Failure taxonomy
The map of what breaks, why it breaks, and which failures are worth preventing.
Topic hub
Evals are the operating loop for improving AI systems. Not a leaderboard, not a vibes dashboard, and definitely not a magic judge prompt. The useful version connects real failures to review, measurement, product changes, and regression checks.
The map of what breaks, why it breaks, and which failures are worth preventing.
The place humans inspect traces, label failures, and create the ground truth for improvement.
How to keep model-as-judge systems honest instead of outsourcing taste to another black box.
The durable examples that make sure yesterday's fix does not become tomorrow's surprise outage.
Essay · ai · evals · engineering
A streamlined system for AI evaluation that closes the gap between seeing problems and fixing them.
Essay · ai · rag · evals
How Frigade Slashed Latency & Boosted User Helpfulness
Essay · ai · flywheel · evaluation
Essay · ai · evals · flywheel
An AI Maturity Model
Essay · ai · evals
A Practical Guide to Evaluation-Driven Improvement
TIL · ai · evals · ai reliability
Logs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
TIL · ai · agents · evals
One giant prompt can hide five separate jobs. Split the work so each part has a smaller contract and a failure you can actually name.
TIL · ai · rag · evals
A bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
TIL · ai · evals · ai reliability
If an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.
TIL · ai · evals · ai reliability
If every eval run emails a customer, updates production state, or fires a webhook, you do not have an eval harness. You have a hostage situation.
TIL · ai · rag · evals
Before blaming the model, inspect the chunks. Duplicate, empty, bloated, or low-signal chunks can wreck retrieval quietly.
TIL · ai · evals · ai reliability
When a multi-step AI run fails once and then refuses to fail again, replay beats superstition. Capture the calls, context, and intermediate state.
TIL · ai · rag · evals
A citation is not proof just because the model printed a source name. Verify that the source exists and actually supports the claim.