TIL: Record and Replay AI Workflows
/ 2 min read
The worst AI bug is the one you cannot make happen twice.
A customer gets a strange answer. The trace shows three model calls, two retrieval steps, a parser repair, and one fallback. You rerun the job and it works. You rerun it again and it fails differently. Everyone gets tired and the fix becomes a prompt tweak nobody trusts.
Record/replay is the antidote.
During the original run, capture the inputs, model responses, retrieved context ids, tool outputs, structured intermediate state, and final result. During replay, feed the captured artifacts back in instead of asking the live feature to rediscover the same path.
From there, debug one layer at a time:
- replay with the same retrieval results
- swap only the generation prompt
- compare parser behavior against the same raw response
- test a validation fix against the exact failure
- add the case to a regression suite
It helps a lot in AI coding and agent work, where a bad final diff or decision may depend on several earlier model/tool choices. If you cannot reconstruct the path, review turns into archaeology.
Record/replay also changes the cost profile of debugging. You stop paying the full token bill for every investigation, and you stop pretending a live rerun is the same experiment.
Build replay before the critical incident. Once the weird run is gone, it is gone. A screenshot of the bad answer is not enough. You need the calls and state that produced it.
Nondeterministic systems need deterministic evidence. Capture enough to replay the failure before you try to fix it.
Related: AI evals and AI reliability.
Part of the Effective AI Engineering series.
Source: adapted from Mirascope’s “Record and Replay”, MIT licensed.