skip to content
Skylar Payne

Testing an AI pipeline should not require bravery.

But plenty of first versions mix everything together: fetch live data, call the model, write to the database, send the email, notify Slack, close the ticket. Then someone asks for evals and the room gets quiet.

Nobody wants the test run to email a real customer.

The fear is a design smell. Useful computation has been mixed with irreversible side effects.

Split them.

Keep the path boring: collect or receive inputs, make the AI decision, decide whether the result can leave the sandbox, and perform side effects only after the checks pass.

For support, that might mean:

  1. load the ticket and customer context
  2. generate a proposed response
  3. validate tone, policy, and citations
  4. save the draft
  5. require human approval before sending

Then evals can run generation and validation repeatedly without touching the customer. Record/replay can reproduce failures. Reviewers can inspect outputs. Approval gates can protect the risky move.

Improvement becomes possible when eval runs stop feeling like small acts of courage. If every test mutates production, the team will stop testing. If every replay triggers webhooks, nobody will replay. If every model experiment can send an email, model experiments become political events.

Pure-enough pipelines keep the risky actions small and explicit.

Put side effects behind a gate. Make the AI decision replayable before you let it touch the world.

Related: AI evals, AI reliability, and AI platforms.

Part of the Effective AI Engineering series.

Source: adapted from Mirascope’s “Pure Pipelines”, MIT licensed.