If you only read one thing
Start with the piece that turns evals from a measurement project into a product operating loop.
Series
A reading path for teams that need AI quality to become inspectable, fixable, and steadily better. The point is not to worship a metric. The point is to build the loop that tells you what broke, why, and what to change next.
TIL · ai · evals · ai reliability
Logs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
TIL · ai · agents · evals
One giant prompt can hide five separate jobs. Split the work so each part has a smaller contract and a failure you can actually name.
TIL · ai · rag · evals
A bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
TIL · ai · evals · ai reliability
If an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.
TIL · ai · evals · ai reliability
If every eval run emails a customer, updates production state, or fires a webhook, you do not have an eval harness. You have a hostage situation.
TIL · ai · rag · evals
Before blaming the model, inspect the chunks. Duplicate, empty, bloated, or low-signal chunks can wreck retrieval quietly.
TIL · ai · evals · ai reliability
When a multi-step AI run fails once and then refuses to fail again, replay beats superstition. Capture the calls, context, and intermediate state.
TIL · ai · rag · evals
A citation is not proof just because the model printed a source name. Verify that the source exists and actually supports the claim.
Essay · ai · evals · engineering
A streamlined system for AI evaluation that closes the gap between seeing problems and fixing them.
Essay · ai · rag · evals
How Frigade Slashed Latency & Boosted User Helpfulness
Essay · ai · flywheel · evaluation
Essay · ai · evals · flywheel
An AI Maturity Model
Essay · ai · evals
A Practical Guide to Evaluation-Driven Improvement
Start with the piece that turns evals from a measurement project into a product operating loop.
Use the AI Evals topic hub for concepts, TILs, and future case-study links.