TIL: Make AI Features Boring to Change
/ 3 min read
A new feature only needs an LLM call.
It always starts off just that easy. A user asks a question, you search the docs, pass the docs to the model, parse the answer, and ship something that works.
Quickly, the LLM call itself isn’t enough. You need RAG so the model has the right context. Then you need to parse the response into a format the rest of the backend understands.
Still not too bad.
You roll up your sleeves, take a deep breath, and say: “Claude, implement this. Make no mistakes.”
You do a few validation passes. The answers look right. Easy peasy.
But then users show up, and the problems start.
Timeouts during onboarding. Product wants a warmer tone. Legal asks why one answer has no citations. Four percent of requests fail schema validation. OpenAI released a new model, and someone wants to test it before Friday.
Each change was reasonable on its own. That is what makes the file annoying. Nobody set out to build a mess. But now every change feels a little risky, and that’s the problem.
The feature is no longer boring to change.
def answer_question(user_id: str, query: str) -> dict: docs = search_docs(query) prompt = build_prompt(query, docs, tone="confident")
try: response = client.responses.create( model="gpt-4.1", input=prompt, timeout=12, ) except TimeoutError: metrics.increment("ai.timeout") return { "text": "Try again in a minute.", "citations": [], }
log_prompt( user_id=user_id, prompt=prompt, raw=response.output_text, )
try: answer = json.loads(response.output_text) except JSONDecodeError: answer = repair_json(response.output_text)
if not answer.get("citations"): answer["citations"] = guess_citations(docs)
return answerThe mess means there’s no locality.
Instead of making a change in one place, an experiment touches prompts, logs, eval fixtures, product copy, retrieval, schema validation, parser repair, and maybe the UI.
Basic debugging is harder than it should be:
- Which prompt version produced this answer?
- Did the fallback model run?
- Was this a retrieval miss or a generation miss?
- Which layer owns citation handling?
If you cannot tell where the change belongs, the reliability work is spread across too many layers.
So give that work a home. Put the model call behind an interface, and keep the feature code focused on the user action.
class Answer(BaseModel): text: str citations: list[str]
@trace_ai_call(name="answer_question")@retry( stop=stop_after_attempt(3), wait=wait_exponential(max=10),)def generate_answer( query: str, docs: list[Document],) -> Answer: prompt = prompts.render( "answer_question", query=query, docs=docs, )
return call_model( prompt, model="gpt-4.1", response_model=Answer, fallback_model="claude-sonnet-4", )
def answer_question(user_id: str, query: str) -> Answer: docs = search_docs(query) return generate_answer(query=query, docs=docs)Now the product function says what the product does: search docs and ask for an answer.
The generation function owns the model-specific work: prompt rendering, model choice, tracing, retries, response validation, fallbacks, cost tags, and eval hooks.
This gives you locality. Prompt changes go in one place. Model experiments go in one place. Retry policy, validation, fallback behavior, cost tags, and eval hooks all live next to the LLM call instead of leaking through the rest of the product.
The test is simple. When something breaks, can you answer the debugging questions from one trace and one module? When Product wants a tone experiment, do you know where it goes? When Legal asks about citations, do you know which layer owns the fix?
Models will keep being weird. Techniques to control models will keep evolving.
Put the churn behind an interface.
And let your next change be boring.