1. Understand the loop
Good evals are not a score. They are the feedback loop for making product behavior better.
Technical library
This is the front door for the deeper technical work: evals, reliability, agents, RAG, observability, and the operating loops that keep AI products from becoming a pile of clever demos.
Start here
Good evals are not a score. They are the feedback loop for making product behavior better.
Trace review, taxonomies, and regression sets are how vague “quality” becomes fixable work.
The durable win is a repeatable review → change → measure loop, not one heroic prompt rewrite.
Evaluation loops, failure taxonomies, AI judges, and review systems that teams actually use.
Patterns for turning prototypes into systems that stay observable, debuggable, and safe under real use.
Multi-step workflows, tool use, approval gates, memory, state, and traces.
Retrieval quality, grounding, latency, evaluation, and the places RAG still breaks.
Small durable lessons from real work: weird eval failures, implementation tricks, debugging patterns, and corrections to naive mental models.
No TILs are published yet. The site now has a first-class lane for them when the first one is ready.
Essay
Anthropic changed OpenClaw billing. We ran evals, tuned the bootstrap files, and GPT-5.4 got a lot better.
Essay · ai · dspy · engineering
Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.
Essay · ai · evals · engineering
A streamlined system for AI evaluation that closes the gap between seeing problems and fixing them.
Essay · ai engineering · leadership · technical implementation