Technical library

Build AI systems that actually improve.

This is the front door for the deeper technical work: evals, reliability, agents, RAG, observability, and the operating loops that keep AI products from becoming a pile of clever demos.

Start here

The shortest useful path

Open the evals hub →

1. Understand the loop

Good evals are not a score. They are the feedback loop for making product behavior better.

2. Find the failure modes

Trace review, taxonomies, and regression sets are how vague “quality” becomes fixable work.

3. Ship the operating system

The durable win is a repeatable review → change → measure loop, not one heroic prompt rewrite.

Core topics

AI Evals

Evaluation loops, failure taxonomies, AI judges, and review systems that teams actually use.

AI Reliability

Patterns for turning prototypes into systems that stay observable, debuggable, and safe under real use.

Agents

Multi-step workflows, tool use, approval gates, memory, state, and traces.

RAG

Retrieval quality, grounding, latency, evaluation, and the places RAG still breaks.

TIL / short notes

Small durable lessons from real work: weird eval failures, implementation tricks, debugging patterns, and corrections to naive mental models.

View all short notes →

TIL
TIL: Make AI Features Boring to Change
AI features get scary to change when prompts, logs, evals, schema validation, fallbacks, and product code all blur together. Give the unreliable part one reliable interface so the next change has an obvious home.

Series

Practical AI Evals

A practical reading path for building evals that improve product behavior instead of becoming dashboard theater.

Effective AI Engineering

A curated path through reliability, observability, RAG, agents, guardrails, and production feedback loops.

Latest writing

All posts →

May 10, 2026

TIL: Make AI Features Boring to Change

TIL · ai · ai reliability · architecture

AI features get scary to change when prompts, logs, evals, schema validation, fallbacks, and product code all blur together. Give the unreliable part one reliable interface so the next change has an obvious home.
Apr 4, 2026

GPT-5.4 in OpenClaw doesn’t suck. Your prompts do.

Essay

Anthropic changed OpenClaw billing. We ran evals, tuned the bootstrap files, and GPT-5.4 got a lot better.
Mar 21, 2026

If DSPy is So Great, Why Isn't Anyone Using It?

Essay · ai · dspy · engineering

Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.
Mar 17, 2026

Evals That Actually Get Used

Essay · ai · evals · engineering

A streamlined system for AI evaluation that closes the gap between seeing problems and fixing them.
Nov 27, 2025

Bringing Data Science Back to AI Engineering

Essay · ai engineering · leadership · technical implementation