Quality Assurance for AI

 

Traditional software quality assurance relies on predictable, deterministic behavior—the same input always produces the same output. But AI systems fundamentally break this assumption, creating a new challenge for quality assurance. This document outlines how to adapt QA practices for the AI era through effective evaluation and annotation systems. Drawing from my experience leading machine learning teams at LinkedIn, Google, and HealthRhythms, I present a framework for building evaluation systems that deliver the core benefits of traditional QA—higher quality releases, faster development cycles, and a proactive quality culture—in the context of AI products.

The Quality Assurance Challenge in AI

What is Quality Assurance?

Traditional quality assurance in software focuses on verifying that systems behave as expected. It typically involves:

  • Unit tests that verify individual functions
  • Integration tests that verify component interactions
  • End-to-end tests that verify complete workflows
  • Manual testing to catch issues that automated tests miss

When done well, QA delivers critical benefits:

  1. Higher quality releases - Fewer bugs reach production
  2. Faster development cycles - Issues are caught early when they’re cheaper to fix
  3. More confident releases - Teams know their changes won’t break existing functionality
  4. Proactive quality culture - Problems are prevented rather than remediated

Why AI Breaks Traditional QA

AI systems introduce variability that fundamentally breaks traditional QA approaches:

  1. Non-deterministic outputs - The same input may produce different outputs
  2. Contextual dependencies - Subtle changes in context can significantly change outputs
  3. Combinatorial explosion - The number of possible scenarios becomes effectively infinite
  4. Emergent behaviors - Systems exhibit behaviors not explicitly programmed

Consider a simple recommendation system with just four personalization factors. In traditional software, this creates 16 (2^4) distinct scenarios to test. But in AI systems, each factor might influence outcomes probabilistically across a continuous spectrum, creating effectively infinite potential outputs.

When I led candidate recommendation systems at LinkedIn, we couldn’t possibly test every potential recommendation for every recruiter. Instead, we needed a different approach to ensure quality while maintaining development velocity.

What Are Evaluations and Why Do We Need Them?

In the context of AI, evaluations are structured assessments of system quality based on:

  1. Representative samples of system behavior
  2. Human judgment applied to these samples
  3. Consistent criteria for assessing quality
  4. Quantitative metrics derived from these judgments

Evaluations serve as the foundation of AI quality assurance by:

  1. Establishing baselines - Understanding current performance
  2. Measuring improvements - Quantifying the impact of changes
  3. Identifying issues - Spotting problems before they affect users
  4. Setting standards - Creating consistent quality expectations

When done systematically, evaluations enable the same benefits as traditional QA for AI systems:

  • Higher quality AI products through systematic identification of issues
  • Faster development cycles by focusing efforts on high-impact improvements
  • More confident releases based on quantitative quality assessments
  • Proactive quality culture that identifies issues before users do

At LinkedIn, implementing structured evaluations increased our experimentation velocity by 3x and led to double-digit improvements in core metrics. Teams with robust evaluation frameworks consistently outpaced those relying on ad-hoc testing.

The Five Critical Capabilities for AI Quality Assurance

Based on my experience across multiple organizations, there are five essential capabilities that every AI system needs for effective quality assurance:

1. Observability: Capturing What Your System Does

Observability means comprehensively tracking how your AI system processes information and generates outputs. You need to log:

  • User inputs and context
  • Intermediate processing steps (e.g., retrieval results, reasoning steps)
  • Final outputs and confidence scores
  • System metadata (timing, resource usage, etc.)

At HealthRhythms, we implemented comprehensive logging across our prediction pipeline. This allowed us to trace issues through our entire system and identify problems that would have been invisible otherwise.

Implementation tip: Log everything in structured formats with correlation IDs to trace requests through complex systems. Use a centralized system that allows you to recreate the entire processing chain. Find more tips in my AI Observability guide.

2. Data Viewing: Understanding Both Perspectives

Data viewing means being able to see both what the user experienced and what the system was “thinking” behind the scenes. This dual perspective is critical because AI decisions often depend on factors invisible to users.

Example: In a content recommendation system:

  • User perspective: A simple list of recommended articles with minimal explanation
  • System perspective: Retrieved content candidates, their relevance scores, personalization factors applied, and confidence metrics

At Health Rhythms, we built tools that showed both the user’s view (simple visualizations) and the system view (detailed prediction data, confidence intervals, feature importance). This dual perspective was essential for understanding why our predictions sometimes missed the mark despite good overall metrics.

Implementation tip: Create viewing interfaces that seamlessly switch between user and system perspectives. Allow annotators to see exactly what the user saw while having access to the additional context that informed the system’s decisions.

3. Annotation: Applying Human Judgment Systematically

Annotation is the process of enriching examples with human judgment. An effective annotation system should support:

  • Binary acceptable/unacceptable judgments
  • Structured categorization of issues
  • Hierarchical tagging systems
  • Free-form notes for edge cases
  • Multiple annotators with tracking

At LinkedIn, our annotation system evolved from simple spreadsheets to a sophisticated platform that supported complex taxonomies of recommendation quality. This evolution dramatically improved our ability to identify and address specific types of failures.

Implementation tip: Establish a regular annotation practice with scheduled sessions where team members annotate together. This builds shared understanding, ensures consistent standards, and makes annotation an integral part of your development process rather than an afterthought.

4. Evaluation: Measuring Overall System Performance

Evaluation means aggregating annotations to assess overall system quality. Your evaluation capability should:

  • Calculate key performance metrics
  • Allow comparison across system versions
  • Support segmentation by user types, contexts, or other dimensions
  • Identify statistically significant changes
  • Produce actionable insights rather than just numbers

When building recommendation systems at LinkedIn, we developed evaluation frameworks that could quickly tell us if a change improved quality for specific user segments while potentially harming others. This nuanced understanding was impossible with simple A/B testing alone.

Implementation tip: Design your evaluation metrics to align directly with business outcomes, not just technical perfection. The “best” model technically may not be the best for your business.

5. Error Analysis: Understanding Failure Patterns

Error analysis means systematically investigating why your system fails. This capability should:

  • Identify common patterns in failures
  • Quantify the impact of different error types
  • Prioritize issues based on frequency and severity
  • Generate hypotheses for improvement
  • Validate that fixes actually address root causes

At Google, we used systematic error analysis to prioritize improvements to our tools recommendation system. By categorizing failures, we could focus engineering efforts on the issues that would have the biggest impact on user experience.

Implementation tip: Schedule regular error analysis reviews where team members collaboratively review annotations, identify patterns, and brainstorm solutions. Making this a consistent practice ensures that insights from annotations actually drive improvements.

What Do We Need to Annotate and How Do We Get That Data?

Effective annotation requires capturing the complete context of AI system interactions:

Critical Information to Capture

  1. Inputs and Context
    • User queries or actions
    • Relevant user information
    • Environmental factors
  2. System Processing
    • Intermediate steps and decisions
    • Retrieved information (for RAG systems)
    • Prompts and parameters used
  3. Outputs
    • Final responses or recommendations
    • Confidence scores or rankings
    • Timing information
  4. Human Judgments
    • Acceptability determinations
    • Error categorizations
    • Severity ratings
    • Notes and explanations

Data Sources

The most valuable data comes from these sources:

  1. Production logs - Real user interactions with your actual system
  2. Synthetic data generation - Generating new samples using AI
  3. Adversarial testing - Deliberately challenging your system
  4. Competitive analysis - Comparing your system to alternatives

When I led ML infrastructure development at LinkedIn, we built systems to automatically capture and store representative samples of user interactions. This gave us a continuous stream of real-world examples for annotation and analysis.

Why Do We Need to Annotate Scalably?

The Economics of Annotation

Annotation can quickly become a bottleneck without scalable approaches:

  1. Volume challenges - Production AI systems generate millions of interactions
  2. Expertise requirements - Quality annotation often requires domain knowledge
  3. Consistency concerns - Maintaining uniform standards across annotators is difficult
  4. Speed requirements - Development velocity depends on annotation turnaround

The User-Annotator Experience Gap

A critical challenge in annotation is that annotators often need to see and evaluate different information than what users see:

  • Users experience the system naturally, focusing on their immediate goals
  • Annotators need additional context about system decisions and alternatives
  • Annotation interfaces must present information that isn’t visible to users
  • The annotator experience must be optimized for efficiency and consistency, not discovery or engagement

Example: For a generative AI system using RAG:

  • User sees: A natural language response to their query
  • Annotator needs to see: The original query, retrieved documents, model prompts, confidence scores, and alternative responses considered

At Health Rhythms, we created specialized views for annotators that showed both the user-facing output and the internal system state. This dual perspective was essential for accurate annotation but would have been overwhelming for actual users.

Scaling Strategies

Scalable annotation requires multiple approaches:

1. User-Generated Annotations

Users themselves can provide valuable feedback through:

  • Explicit mechanisms (thumbs up/down, ratings, flags)
  • Implicit signals (engagement metrics, abandonment rates)
  • Follow-up actions (corrections, refinements)

When building our recruiting products at LinkedIn, we incorporated user feedback directly into our annotation system. Recruiters could flag inappropriate recommendations, and these flags were automatically routed to our annotation team for detailed review.

Implementation tip: Make feedback mechanisms simple and unobtrusive for users. The best user annotation systems feel like a natural part of the product experience, not an additional burden.

2. Automation of Routine Annotation

Not all annotation requires human judgment:

  • Rule-based systems can identify clear cases
  • Statistical outlier detection can flag unusual examples
  • Clustering can group similar examples for batch annotation
  • Models can pre-classify examples for human verification

3. Leveraging AI for Annotation

Modern AI systems can themselves assist in the annotation process:

  • Using models to suggest annotations
  • Identifying examples where the model is uncertain
  • Flagging examples that differ from previously seen patterns
  • Checking for annotation consistency across similar examples

At HealthRhythms, we trained models specifically to identify potential issues in our healthcare predictions. These models helped prioritize which examples needed human review, reducing our annotation load by over 60%.

What is the Ideal Annotation Interface?

The ideal annotation interface balances comprehensiveness with efficiency while addressing the challenge of replicating the user experience.

Core Requirements

1. Dual-Perspective Visibility

Annotators need to see:

  • User View: Exact replication of what the user saw
    • For web applications, rendered pages not just HTML/CSS
    • For mobile apps, screen captures or faithful reproductions
    • For conversational AI, the full conversation history
  • System View: Additional context invisible to users
    • Retrieved documents in RAG systems
    • Confidence scores and alternatives considered
    • Feature importance and decision factors
    • Intermediate reasoning steps

Example: For a content recommendation system:

  • User View: The actual rendered recommendation UI with images, headlines, and layout
  • System View: Relevance scores, retrieval sources, personalization factors, and other candidates considered

2. Annotation Capabilities

  • Binary acceptability judgments
  • Multi-level classification systems
  • Hierarchical tag ontologies
  • Free-form notes
  • Severity classification
  • Confidence indicators

3. Efficiency Features

  • Keyboard shortcuts for common actions
  • Batch operations for similar examples
  • Saved views and filters
  • Template responses for common issues
  • Progress tracking and productivity metrics

4. Collaboration Tools

  • Assignment and routing capabilities
  • Inter-annotator agreement monitoring
  • Discussion threads for complex cases
  • Guideline references and examples
  • Quality control and review workflows

Implementation Challenges and Solutions

The core challenge of annotation interfaces is faithfully replicating the user experience while providing additional context. This is difficult because:

  1. Annotation typically happens in a separate system from the user-facing product
  2. User interfaces may be dynamic or stateful
  3. System context may be voluminous or complex
  4. The full user context (device, environment) may be hard to capture

Creative solutions include:

  • Browser extensions that capture the complete rendered state of web applications
  • Video capture of user sessions with synchronized system logs
  • Embedded annotation capabilities within the actual product (for internal users)
  • Sandboxed environments that recreate the user’s context for annotators

At Google, we developed a system that could recreate the exact search results page a user saw, complete with all ranking signals and internal metadata. This allowed annotators to see both the user experience and the factors that influenced it.

Making Annotation a Consistent Practice

To maximize the value of annotations, establish a regular cadence:

  1. Schedule weekly annotation sessions where team members annotate together
    • This builds shared understanding of quality standards
    • It ensures consistent application of annotation criteria
    • It creates a feedback loop between engineering and quality assessment
  2. Review annotations regularly in team meetings
    • Discuss interesting or challenging examples
    • Identify emerging patterns in system behavior
    • Refine annotation guidelines based on edge cases
  3. Rotate annotation responsibilities across team members
    • Ensures everyone maintains connection to real user experiences
    • Builds empathy for user problems
    • Distributes the annotation workload

At LinkedIn, we instituted “Annotation Wednesdays” where the entire team would spend 1-2 hours reviewing and annotating examples together. This practice significantly improved our shared understanding of quality issues and led to better-targeted engineering solutions.

Putting It All Together: The Virtuous Cycle

When implemented effectively, these five capabilities create a virtuous cycle:

  1. Observability captures comprehensive data about system behavior
  2. Data viewing enables you to understand the user experience and system context
  3. Annotation applies human judgment to examples
  4. Evaluation measures overall system performance
  5. Error analysis identifies patterns and priorities for improvement

This cycle accelerates development by focusing engineering efforts on the highest-impact improvements. It also creates a learning system that gets better at identifying and addressing issues over time.

The Competitive Advantage

Companies that master these capabilities gain three critical advantages:

  1. Faster iteration cycles - More efficient problem identification and solution validation
  2. Higher quality products - More comprehensive understanding of failure modes
  3. Better monitoring capabilities - Earlier detection of emerging issues

In the AI product landscape, the speed of your learning and improvement cycle becomes a primary differentiator. Companies that can identify, understand, and fix issues faster will consistently outperform their competitors.

Looking for help building these capabilities in your team? Book a free consult with me