TIL: Evaluate RAG Retrieval Separately
/ 2 min read
A RAG answer can fail in at least two boring ways.
The retriever can bring back the wrong context. Or the model can ignore good context and produce a bad answer anyway.
If you only grade the final answer, both failures look the same: the user got garbage. Fine for product triage. Useless for engineering diagnosis.
So give the retriever its own eval, separate from the answer eval.
For a known question, decide what documents or chunks should be retrieved. Measure whether the retriever brings them back before the generator gets involved. Track recall at k, dominant chunks, document types that never appear, and relevant material that ranked too low to fit in the context window.
Only then grade the generated answer.
The debugging conversation gets sharper:
- If retrieval missed the policy page, fix indexing, chunking, ranking, query rewriting, or metadata.
- If retrieval found the policy page and the answer was still wrong, fix the prompt, citation behavior, structured output, or refusal path.
- If the question has no answer in the corpus, model that explicitly instead of punishing the generator for refusing to invent one.
Teams get into trouble when they treat “RAG quality” as one number. RAG has steps. Each step needs its own check.
Once real users arrive, a few bad answers can send the team in circles: new model, longer prompt, more documents, different chunk size, another reranker. Some of those changes help. Some just move the bug.
A retrieval eval gives you the first fork in the road.
Before arguing with the generator, prove the right evidence made it into the room.
Part of the Effective AI Engineering series.
Source: adapted from Mirascope’s “Isolate & Evaluate Your RAG Retriever”, MIT licensed.