1. Understand the loop
Good evals are not a score. They are the feedback loop for making product behavior better.
Technical library
This is the front door for the deeper technical work: evals, reliability, agents, RAG, observability, and the operating loops that keep AI products from becoming a pile of clever demos.
Start here
Good evals are not a score. They are the feedback loop for making product behavior better.
Trace review, taxonomies, and regression sets are how vague “quality” becomes fixable work.
The durable win is a repeatable review → change → measure loop, not one heroic prompt rewrite.
Evaluation loops, failure taxonomies, AI judges, and review systems that teams actually use.
Patterns for turning prototypes into systems that stay observable, debuggable, and safe under real use.
Multi-step workflows, tool use, approval gates, memory, state, and traces.
Retrieval quality, grounding, latency, evaluation, and the places RAG still breaks.
Small durable lessons from real work: weird eval failures, implementation tricks, debugging patterns, and corrections to naive mental models.
TIL
TIL: Annotate AI TracesLogs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
TIL
TIL: Break AI Workflows Into Parts You Can GradeOne giant prompt can hide five separate jobs. Split the work so each part has a smaller contract and a failure you can actually name.
TIL
TIL: Evaluate RAG Retrieval SeparatelyA bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
TIL
TIL: Instrument AI Calls Before You DebugIf an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.
Essay · ai · agents · ai reliability
When agent instructions turn into all caps rules, the fix is often to move the requirement out of the prompt and into a workflow that can check it.
TIL · ai · evals · ai reliability
Logs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
TIL · ai · agents · evals
One giant prompt can hide five separate jobs. Split the work so each part has a smaller contract and a failure you can actually name.
TIL · ai · rag · evals
A bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
TIL · ai · evals · ai reliability
If an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.