skip to content
Skylar Payne

A trace is evidence. It is not understanding.

You can have thousands of AI call logs and still not know what broke. Prompt, response, latency, cost, retrieved documents, tool calls. All useful. None of it tells you whether the answer was wrong, why it was wrong, or whether you saw the same failure yesterday.

Here is the trap. Teams add tracing, feel responsible for about three days, then drown in their own trace table.

Add annotation.

During review, save the judgment in a shape the code can reuse:

  • pass/fail or quality score
  • failure type
  • expected behavior
  • severity
  • whether retrieval, generation, parsing, or UX caused the problem
  • whether this example should enter a regression set

The pile of weird outputs starts turning into an eval dataset.

A support bot says the refund policy allows something it does not. The trace is interesting once. The annotation keeps paying rent: policy_misread, high_severity, retrieval_ok, generation_failed, add_to_regression_set.

With that label, the next engineer has somewhere to stand. They can count how often the failure happens, compare prompt versions against the same bucket, and decide whether the fix belongs in retrieval, a schema, a refusal path, or the product copy.

Evals stop being dashboard theater when the labels get this concrete. The best eval sets usually come from production traces someone bothered to label carefully. The user already found the edge case; do not lose it.

Use annotation with tracing. First make every AI call visible. Then annotate the calls that matter. Then promote the patterns into regression checks.

Logging captures incidents. Annotation makes the mess reusable.

Part of the Effective AI Engineering series.

Source: adapted from Mirascope’s “Don’t Just Log, Annotate!”, MIT licensed.