Topic hub

AI Evals

Evals are the operating loop for improving AI systems. Not a leaderboard, not a vibes dashboard, and definitely not a magic judge prompt. The useful version connects real failures to review, measurement, product changes, and regression checks.

Start here

→Why evals are the operating loop for improving AI systems, not a box-checking score.
→How to turn trace review into a failure taxonomy and regression set.
→Where AI judges help, where they lie, and how to calibrate them against human review.
→How online and offline evals fit into the same product improvement loop.

Core concepts

Failure taxonomy

The map of what breaks, why it breaks, and which failures are worth preventing.

Review UI

The place humans inspect traces, label failures, and create the ground truth for improvement.

Judge calibration

How to keep model-as-judge systems honest instead of outsourcing taste to another black box.

Regression sets

The durable examples that make sure yesterday's fix does not become tomorrow's surprise outage.

Related writing

Open the Practical AI Evals series →

Mar 17, 2026

Evals That Actually Get Used

Essay · ai · evals · engineering

A streamlined system for AI evaluation that closes the gap between seeing problems and fixing them.
May 20, 2025

40% Better, 75% Faster

Essay · ai · rag · evals

How Frigade Slashed Latency & Boosted User Helpfulness
Mar 26, 2025

Quality Assurance for AI

Essay · ai · flywheel · evaluation
Feb 13, 2025

Why Most Companies Fail to Build Strategic Assets with AI

Essay · ai · evals · flywheel

An AI Maturity Model
Feb 8, 2025

The Art of Iterative AI System Development

Essay · ai · evals

A Practical Guide to Evaluation-Driven Improvement

Evals TILs

Future short eval notes will show here below the canonical guides, which is exactly where they belong.