Frigade faced a critical challenge: users were complaining about their AI assistant’s performance.
That familiar sinking feeling. They knew the AI worked—sometimes. But they were flying blind, relying on user feedback and “vibes” to understand performance. Sound familiar?
Partnering with Wicked Data, Frigade implemented evaluation-driven development to transform their AI from guesswork to systematic knowing. For this B2B SaaS company, where every user interaction matters and churn is costly, the results were game-changing: 40% Better response acceptability and 75% Faster user-perceived latency—with a team that could now confidently answer “That is a known failure mode” instead of scrambling for explanations.
The Challenge: Flying Blind with AI Performance
Frigade’s engineering team faced the classic “black box” problem that plagues AI implementations:
- User complaints about unhelpful AI responses, but no systematic way to diagnose why
- Slow performance causing user abandonment, but no clear metrics to optimize
- Difficult to prioritize improvements when everything felt like guesswork
- CEO pressure to “fix the AI” without clear data on what was actually broken
This scenario breeds the familiar cycle: reactive firefighting instead of proactive improvement. Frigade’s team was talented, but they were flying blind. They needed to move from “vibes-based” AI development to systematic evaluation that connected performance to business outcomes.
The Solution: Evaluation-Driven Development
Frigade’s ambition was an AI assistant that actively shows users how to achieve tasks, rather than merely pointing to documentation. But first, they needed to stop flying blind.
Wicked Data provided the systematic evaluation framework that transformed their AI development from guesswork to knowing. The key wasn’t just building better AI—it was building the systematic process to measure, diagnose, and improve AI performance consistently.
The Key Unlock: From Guesswork to Systematic Knowing
The engagement’s cornerstone was implementing evaluation-driven development (EDD)—the systematic approach that replaces “vibes-based” AI development with data-driven confidence.
When you’re flying blind with AI performance, every bug feels like a mystery and every improvement feels like luck. EDD provides the systematic framework to measure progress, diagnose failures, and concentrate efforts on changes that actually matter.
As detailed in “AI Observability: You can’t fix what you can’t see,” you can’t optimize what you can’t measure systematically. This is the foundation that transforms engineering teams from reactive firefighters to proactive builders.
Wicked Data instituted a robust evaluation process including:
- Representative Query Sets: Curated queries and inputs reflecting genuine user scenarios.
- Automated System Testing: A custom CLI script to automate query testing against live system versions.
- Trace Recording: Braintrust integration for detailed AI interaction trace recording—vital for AI observability.
- Annotation & Acceptability Scoring: A Braintrust-configured process for annotating traces to assess AI response acceptability.
- Performance Metrics & Reporting: Custom CLI scripts to download annotated logs and compute key metrics, creating a version-controlled “performance snapshot” in Git for trend analysis.
This framework facilitated granular analysis of metrics including:
- Input/Output Tokens
- Latency (notably Time to First Record)
- Percent Acceptable Responses
- Document Relevance
- Detailed breakdowns by response type, customer, and other dimensions.
5. From Insights to Impact: Iterative Improvements
This robust evaluation system enabled Wicked Data and Frigade to meticulously “follow the breadcrumbs” in the data, pinpointing critical areas for enhancement. This data-centric approach spurred improvements in:
- Document Chunking and Indexing: Optimizing information processing for superior retrieval.
- Document Embedding: Enhancing metadata to improve query-document matching.
- User-Perceived Latency: Transitioning to streaming responses and prioritizing “time to first record,” significantly boosting user experience. Strategies for tackling such challenges are explored in “Understanding and Addressing AI Latency.”
- Structured Output Format: Ensuring more dependable integration and predictable AI outputs.
- Guidelines and Instructions: Refining system prompts for enhanced consistency and accuracy.
Measurable Results: A Leap in AI Performance & User Experience
The systematic application of EDD and targeted refinements yielded significant, quantifiable outcomes, making Frigade’s AI 40% Better and 75% Faster:
- 40% increase in response acceptability, ensuring users receive consistently helpful and accurate guidance.
- 35% increase in queries retrieving relevant documents, making the assistant demonstrably more knowledgeable.
- 75% reduction in user-perceived latency, delivering faster assistance.
- 70% increase in the accuracy of rejecting queries outside the assistant’s scope, maintaining focus and reliability.
These advancements directly translated into a vastly improved user experience, minimizing frustration and empowering Frigade’s users to achieve their objectives with greater ease and speed.
Empowering Frigade for Ongoing Success: Mentorship and Knowledge Transfer
Beyond the immediate technical deliverables, Wicked Data emphasized mentoring and knowledge transfer to the Frigade team. This engagement armed Frigade with the requisite skills and processes to independently manage their evaluation cycles. Consequently, Frigade’s team can now harness their intrinsic product understanding to generate new insights and continuously iterate on their AI system.
Conclusion: From Sinking Feeling to Confident Answers
Frigade’s success story illustrates the transformation every engineering leader craves: moving from that sinking feeling of flying blind to the confidence of systematic knowing.
Before: “Why are users complaining about the AI assistant?” After: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”
This transformation is possible for your team. Through evaluation-driven development, you can stop flying blind and start building AI with predictable, measurable outcomes.
Ready to Stop Guessing and Start Knowing?
If you want to build systematic AI evaluation that transforms your team from reactive firefighters to proactive builders, I can help you implement this framework in just one week.
My intensive workshop walks your engineering team through building a robust evaluation system using your own data, in your own codebase—the same approach that helped Frigade achieve 40% better response acceptability and 75% faster user-perceived latency.
Schedule a free consultation to discuss your team’s specific challenges →