Topic hub

AI Reliability

Reliable AI features are mostly boring on purpose. The work usually sits near the boundary: traces, schemas, retries, evals, replay, and clear ownership for risky actions.

Start here

→Give the unreliable part a stable interface.
→Capture enough evidence to debug the next failure.
→Validate outputs before they leak into product state.
→Keep risky side effects behind explicit gates.

Related writing

Jun 15, 2026

If It Has to Happen, Don’t Put It in the Prompt

Essay · ai · agents · ai reliability

When agent instructions turn into all caps rules, the fix is often to move the requirement out of the prompt and into a workflow that can check it.

TILs

May 28, 2026

TIL: Annotate AI Traces

TIL · ai · evals · ai reliability

Logs tell you what happened. Annotations tell you what it meant, why it failed, and whether the fix helped.
May 28, 2026

TIL: Evaluate RAG Retrieval Separately

TIL · ai · rag · evals

A bad RAG answer does not tell you whether retrieval failed, generation failed, or the product asked an impossible question. Split the blame before fixing anything.
May 28, 2026

TIL: Instrument AI Calls Before You Debug

TIL · ai · evals · ai reliability

If an AI answer goes sideways and you cannot see the prompt, model, latency, tokens, retrieved context, and failure path, you are debugging from vibes.
May 28, 2026

TIL: Make AI Pipelines Safe to Replay

TIL · ai · evals · ai reliability

If every eval run emails a customer, updates production state, or fires a webhook, you do not have an eval harness. You have a hostage situation.
May 28, 2026

TIL: Model Agent Workflows as State Graphs

TIL · ai · agents · ai reliability

Agents get less spooky when they have named states, constrained transitions, and a record of how each decision moved the process forward.
May 28, 2026

TIL: Quality Check RAG Chunks

TIL · ai · rag · evals

Before blaming the model, inspect the chunks. Duplicate, empty, bloated, or low-signal chunks can wreck retrieval quietly.
May 28, 2026

TIL: Record and Replay AI Workflows

TIL · ai · evals · ai reliability

When a multi-step AI run fails once and then refuses to fail again, replay beats superstition. Capture the calls, context, and intermediate state.
May 28, 2026

TIL: Put Approval Before Risky Agent Tools

TIL · ai · agents · ai reliability

Agents should not get to delete files, send messages, spend money, publish content, or mutate production just because the next step looks obvious.
May 28, 2026

TIL: Validate RAG Citations

TIL · ai · rag · evals

A citation is not proof just because the model printed a source name. Verify that the source exists and actually supports the claim.
May 28, 2026

TIL: Structure LLM Outputs at the Boundary

TIL · ai · ai reliability · ai platforms

If the rest of your app needs data, make the model return data. Do not make downstream code scrape nice-sounding paragraphs forever.
May 10, 2026

TIL: Make AI Features Boring to Change

TIL · ai · ai reliability · ai platforms

AI features get scary when prompts, logs, evals, schemas, fallbacks, and product code all live in the same pile. Give the weird part one stable interface so changes have a place to go.