Best next move
Start with tracing, annotation, and record/replay. Those pieces make the feedback loop visible, which gives the rest of the series something concrete to point at.
Series
A reading path for AI systems that have to survive contact with real users. The source material is the Mirascope Effective AI tips corpus; the site only promotes pieces once they have a real job in the library.
TIL · ai · ai reliability · ai platforms
AI features get scary when prompts, logs, evals, schemas, fallbacks, and product code all live in the same pile. Give the weird part one stable interface so changes have a place to go.
TIL · ai · evals · ai reliability
If an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.
TIL · ai · evals · ai reliability
Logs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
TIL · ai · ai reliability · ai platforms
If the rest of your app needs data, make the model return data. Do not make downstream code scrape nice-sounding paragraphs forever.
TIL · ai · rag · evals
A bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
TIL · ai · evals · ai reliability
If every eval run emails a customer, updates production state, or fires a webhook, you do not have an eval harness. You have a hostage situation.
TIL · ai · rag · evals
Before blaming the model, inspect the chunks. Duplicate, empty, bloated, or low-signal chunks can wreck retrieval quietly.
TIL · ai · agents · evals
One giant prompt can hide five separate jobs. Split the work so each part has a smaller contract and a failure you can actually name.
TIL · ai · rag · evals
A citation is not proof just because the model printed a source name. Verify that the source exists and actually supports the claim.
TIL · ai · evals · ai reliability
When a multi-step AI run fails once and then refuses to fail again, replay beats superstition. Capture the calls, context, and intermediate state.
TIL · ai · agents · ai reliability
Agents should not get to delete files, send messages, spend money, publish content, or mutate production just because the next step looks obvious.
TIL · ai · agents · ai reliability
Agents get less spooky when they have named states, constrained transitions, and a record of how each decision moved the process forward.
Essay · ai · agents · ai reliability
When agent instructions turn into all caps rules, the fix is often to move the requirement out of the prompt and into a workflow that can check it.
Interfaces, replay, validation, approvals, and the habits that keep demos from hurting real users.
Tracing, annotation, record/replay, and splitting fuzzy work into pieces you can review.
Retriever evals, chunk quality, citation validation, reranking, and query rewriting.
Approvals, explicit states, safer tool use, and agent loops with actual limits.
Replayable evidence, code-agent boundaries, PR review, and verification habits for AI-assisted coding.
Shared platform pieces for traces, schemas, routing, retries, cost controls, and model changes.
Start with tracing, annotation, and record/replay. Those pieces make the feedback loop visible, which gives the rest of the series something concrete to point at.
Use the Library as the front door and the AI Evals hub as the first topic anchor.