If DSPy is So Great, Why Isn't Anyone Using It?
/ 11 min read
Table of Contents
DSPy Has the Best Ideas in AI Engineering. So Why Does Everyone Bounce Off It?
For a framework that promises to solve AI engineering’s worst problems, that’s a brutal gap.
But teams that stick with DSPy report something consistent. Their AI code becomes maintainable.
They can swap models without rewriting everything. They can actually improve their systems instead of praying each prompt change doesn't break production.So why aren’t more people using it?
DSPy’s problem isn’t that it’s wrong. It’s that it’s hard. The abstractions are unfamiliar. The documentation assumes you already think in modules and optimizers. The learning curve is real.
But I keep watching the same thing happen:
Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.
You’re going to build these patterns anyway. You’ll just do it worse, over six months, through pain.
The evolution of every AI system
Let me walk you through what actually happens. I’ve seen this play out dozens of times.
Stage 1: Ship it
You need to extract company names from text. You write this:
from openai import OpenAI
client = OpenAI()
def extract_company(text: str) -> str: response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": f"Extract the company name from: {text}"}] ) return response.choices[0].message.contentIt works. You ship it. Life is good.
Stage 2: “Can we tweak the prompt without deploying?”
Product wants to iterate faster. Redeploying for every prompt change is annoying. So you store prompts in a database:
from openai import OpenAIfrom myapp.config import get_prompt
client = OpenAI()
def extract_company(text: str) -> str: prompt_template = get_prompt("extract_company") prompt = prompt_template.format(text=text)
response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.contentNow you have a prompts table. And a little admin UI to edit it. And then you had to add version history, because someone broke prod last Tuesday.
Stage 3: “It keeps returning garbage formats”
Sometimes the model returns "Company: Acme Corp" instead of just "Acme Corp". So you add structured outputs:
from openai import OpenAIfrom pydantic import BaseModel
client = OpenAI()
class CompanyExtraction(BaseModel): company_name: str confidence: float
def extract_company(text: str) -> CompanyExtraction: prompt_template = get_prompt("extract_company_v2")
response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": prompt_template.format(text=text)}], response_format=CompanyExtraction, ) return response.choices[0].message.parsedYou now have typed inputs and outputs. A schema for what goes in and what comes out.
Stage 4: “We need to handle failures”
The API times out sometimes. Parsing fails occasionally. You add retries:
from openai import OpenAIfrom tenacity import retry, stop_after_attempt, wait_exponential
client = OpenAI()
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))def extract_company(text: str) -> CompanyExtraction: prompt_template = get_prompt("extract_company_v2")
response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": prompt_template.format(text=text)}], response_format=CompanyExtraction, ) return response.choices[0].message.parsedEvery LLM call now has retry logic wrapped around it.
Stage 5: “Now we need RAG”
Extraction alone isn’t enough. You need to search a knowledge base first:
from openai import OpenAI
client = OpenAI()
def extract_company_with_context(text: str) -> CompanyExtraction: # Step 1: Retrieve relevant context query_embedding = embed(text) docs = vector_db.search(query_embedding, top_k=5) context = "\n".join([d.content for d in docs])
# Step 2: Extract with context prompt_template = get_prompt("extract_company_with_rag") prompt = prompt_template.format(text=text, context=context)
response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": prompt}], response_format=CompanyExtraction, ) return response.choices[0].message.parsedTwo steps now. Different prompts for each. The retriever has its own config: top_k, similarity threshold, which embedding model. Your prompts table is getting crowded.
Stage 6: “How do we know if this is getting better?”
You’ve changed the prompt 12 times this month. Is it actually better? Worse? You build an eval harness:
def evaluate(dataset: list[dict]) -> dict: results = [] for example in dataset: prediction = extract_company_with_context(example["text"]) results.append({ "correct": prediction.company_name == example["expected"], "confidence": prediction.confidence })
return { "accuracy": sum(r["correct"] for r in results) / len(results), "avg_confidence": sum(r["confidence"] for r in results) / len(results) }This file grows to 400 lines. You need golden datasets somewhere. CI integration so evals run on PRs. Historical tracking so you can compare this week’s prompt to last week’s. Cost tracking because someone asked how much the evals cost.
Stage 7: “Let’s try Claude instead… oh no”
Anthropic released something new. You want to test it.
But your code is full of openai.chat.completions.create. OpenAI and Anthropic have different APIs. Your structured output parsing doesn’t transfer cleanly. The prompts that worked for GPT-4 perform differently on Claude.
You need to re-run all your evals, but first you need to refactor every call site to support multiple providers.
Everything is coupled. The model, the prompt, the parsing, the retry behavior. All tangled together.
Six months in, you finally sit down to untangle this:
class LLMModule: def __init__(self, signature: type[BaseModel], prompt_key: str): self.signature = signature self.prompt_key = prompt_key
def forward(self, **kwargs) -> BaseModel: prompt = get_prompt(self.prompt_key).format(**kwargs) return self._call_llm(prompt)
def _call_llm(self, prompt: str) -> BaseModel: # Model-agnostic, with retries, parsing, validation ...
extract_company = LLMModule( signature=CompanyExtraction, prompt_key="extract_company_v3")
result = extract_company.forward(text="...")You now have typed signatures, composable modules, swappable backends, centralized retry logic, and prompt management separated from application code.
You just spent six months building half of DSPy. And it’s still worse than DSPy.
What you built (whether you meant to or not)
DSPy didn’t invent anything new. It packaged patterns that every serious AI system ends up needing:
Signatures
Typed inputs and outputs. What goes in, what comes out, with a schema.
Modules
Composable units you can chain, swap, and test independently.
Optimizers
Logic that improves prompts, separated from the logic that runs them.
These are software engineering fundamentals. Separation of concerns. Composability. Declarative interfaces.
So why do experienced engineers forget this stuff when they start building with LLMs?
Why good engineers write bad AI code
Weird feedback loops
You can't step through a prompt. The output is probabilistic. When it finally works, you don't want to touch it.
Pressure to ship
Getting an LLM to work feels like an accomplishment. Clean architecture feels like a luxury for later.
Unclear boundaries
Where do you draw the boundaries? Your prompts are both code and data. Nothing is familiar.
So engineers do what works in the moment. Inline prompts. Copy-paste with tweaks. One-off solutions that become permanent.
Six months later? Drowning in accidental complexity.
DSPy forces you to think about these abstractions upfront. That's why the learning curve feels steep. The alternative is discovering the patterns through pain.
What you should actually do
Option 1: Use DSPy
Accept the learning curve. Read the docs. Build a few toy projects until the abstractions click. Then use it for real work.
Option 2: Steal the ideas
Don't use DSPy, but build with its patterns from day one. See below.
If you're stealing the ideas, build with these patterns:
The point
DSPy has adoption problems because it asks you to think differently before you’ve felt the pain of thinking the same way everyone else does.
The patterns DSPy embodies aren’t optional. If your AI system gets complex enough, you will reinvent them. The only question is whether you do it deliberately or accidentally.
You don’t have to use DSPy. But you should build like someone who understands why it exists.
Want help building AI systems that don't turn into spaghetti?
I help teams design AI architectures that scale. Let's talk about your system.
Book a Free Consultation →