TIL: Make AI Features Boring to Change • Skylar Payne

A new feature only needs an LLM call.

It always starts off just that easy. A user asks a question, you search the docs, pass the docs to the model, parse the answer, and ship something that works.

Quickly, the LLM call itself isn’t enough. You need RAG so the model has the right context. Then you need to parse the response into a format the rest of the backend understands.

Still not too bad.

You roll up your sleeves, take a deep breath, and say: “Claude, implement this. Make no mistakes.”

You do a few validation passes. The answers look right. Easy peasy.

But then users show up, and the problems start.

Timeouts during onboarding. Product wants a warmer tone. Legal asks why one answer has no citations. Four percent of requests fail schema validation. OpenAI released a new model, and someone wants to test it before Friday.

The mess sneaks in one reasonable change at a time. Nobody set out to build one scary file; the file just kept accepting one more responsibility. After a while, every change feels a little risky.

The feature is no longer boring to change.

def answer_question(user_id: str, query: str) -> dict:
    docs = search_docs(query)
    prompt = build_prompt(query, docs, tone="confident")

    try:
        response = client.responses.create(
            model="gpt-4.1",
            input=prompt,
            timeout=12,
        )
    except TimeoutError:
        metrics.increment("ai.timeout")
        return {
            "text": "Try again in a minute.",
            "citations": [],
        }

    log_prompt(
        user_id=user_id,
        prompt=prompt,
        raw=response.output_text,
    )

    try:
        answer = json.loads(response.output_text)
    except JSONDecodeError:
        answer = repair_json(response.output_text)

    if not answer.get("citations"):
        answer["citations"] = guess_citations(docs)

    return answer

The real damage is the missing locality.

Instead of making a change in one place, an experiment touches prompts, logs, eval fixtures, product copy, retrieval, schema validation, parser repair, and maybe the UI.

Basic debugging is harder than it should be:

Which prompt version produced this answer?
Did the fallback model run?
Was this a retrieval miss or a generation miss?
Which layer owns citation handling?

If you cannot tell where the change belongs, the reliability work is spread across too many layers.

So give that work a home. Put the model call behind an interface, and keep the feature code focused on the user action.

class Answer(BaseModel):
    text: str
    citations: list[str]


@trace_ai_call(name="answer_question")
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(max=10),
)
def generate_answer(
    query: str,
    docs: list[Document],
) -> Answer:
    prompt = prompts.render(
        "answer_question",
        query=query,
        docs=docs,
    )

    return call_model(
        prompt,
        model="gpt-4.1",
        response_model=Answer,
        fallback_model="claude-sonnet-4",
    )


def answer_question(user_id: str, query: str) -> Answer:
    docs = search_docs(query)
    return generate_answer(query=query, docs=docs)

The product function says what the product does: search docs and ask for an answer.

The generation function owns the model-specific work: prompt rendering, model choice, tracing, retries, response validation, fallbacks, cost tags, and eval hooks.

Locality is the win. Prompt changes go in one place. Model experiments go in one place. Retry policy, validation, fallback behavior, cost tags, and eval hooks stay next to the LLM call instead of leaking through the product.

The test I like is blunt. When something breaks, can you answer the debugging questions from one trace and one module? When Product wants a tone experiment, do you know where it goes? When Legal asks about citations, do you know which layer owns the fix?

Models will keep being weird. Techniques to control models will keep evolving.

Put the churn behind an interface.

And let your next change be boring.

Related: AI reliability and AI platforms.

Part of the Effective AI Engineering series.

Source: adapted from Mirascope’s “Build Bulkheads Around Your AI Calls”, MIT licensed.