Skip to content
RAGEvaluation

RAG that actually works: evaluation before vibes

Most retrieval-augmented systems fail not at retrieval, but at proving they're right. Here's the engineering discipline that turns a flaky RAG demo into something your customers can trust.

DerbaTech Engineering12 min read
RAG that actually works: evaluation before vibes — cover

Retrieval-augmented generation is the most over-demoed and under-engineered pattern in applied AI today. A capable developer can wire up an embedding model, a vector database and a chat endpoint in an afternoon, paste in a few documents, and produce something that looks magical in a Friday demo. The same system, three weeks later, confidently tells a customer about a refund policy that does not exist. The gap between those two moments is not a model problem. It is an engineering problem, and it is almost entirely about measurement.

At DerbaTech we have rebuilt enough RAG systems — for fintech support desks, healthcare knowledge bases and internal developer assistants — to have a strong opinion about why they fail and what separates the ones that survive contact with real users from the ones quietly switched off after launch. This article is that opinion, written for engineering leaders who need RAG to be reliable, not just impressive.

Why RAG demos lie

A demo is a curated environment. The questions are ones the builder already knows the system can answer, the documents are clean, and nobody is adversarial. Production is the opposite: users ask questions in ways you never anticipated, your corpus is messy and contradictory, and the cost of a wrong answer is real. The demo optimises for the best case; production is judged on the worst case.

When a RAG system gives a wrong answer, there are only a few possible causes, and naming them precisely is the first step to fixing them. Either the relevant information was never retrieved, or it was retrieved but drowned out by irrelevant context, or it was retrieved and the model ignored or misread it, or the information simply is not in your corpus and the model invented something to fill the silence. Each of these has a different fix, and you cannot apply the right fix until you can measure which one is happening.

If you can't measure faithfulness, you're not shipping RAG. You're shipping a confident guess with a citation-shaped decoration.

Retrieval is the system, not the model

The single most common mistake teams make is treating retrieval as a solved commodity — "just embed and search" — and pouring all their attention into prompt engineering. In practice, the quality of a RAG answer is bounded by the quality of what you retrieve. No prompt can rescue an answer when the right document never made it into the context window. So the first place to invest engineering effort is the retrieval pipeline, and the first thing to build is a way to measure it independently of the language model.

Retrieval quality has two dimensions worth measuring separately. Recall asks: of all the documents that could answer this question, how many did we surface? Precision asks: of the documents we surfaced, how many were actually relevant? High recall with low precision floods the context with noise; high precision with low recall leaves the model starved of the fact it needs. You want both, and you cannot improve what you do not measure on a fixed set of representative queries.

Chunking is a design decision, not a default

How you split documents into retrievable units quietly determines your ceiling. Chunk too small and you fragment the context a fact depends on; chunk too large and you dilute the embedding so it matches everything weakly and nothing strongly. The right strategy is domain-specific: legal contracts, API documentation, support tickets and clinical notes all want different boundaries. Structure-aware chunking that respects headings, tables and logical sections almost always beats naive fixed-size windows — but the only way to know is to evaluate retrieval recall across chunking strategies on your own data.

Hybrid retrieval beats pure vector search

Dense vector search is excellent at capturing semantic similarity but notoriously weak on exact matches — product codes, error numbers, names, acronyms — precisely the tokens that matter most in enterprise queries. Sparse keyword search (BM25 and its relatives) is the mirror image. Combining the two and fusing their rankings consistently outperforms either alone on real corpora. It is one of the highest-return changes you can make, and it is cheap.

Fusing the ranked lists from vector and keyword search is itself a design decision: Reciprocal Rank Fusion (RRF) is the standard method, combining ranks rather than scores so the two channels need not be calibrated against each other. On top of hybrid retrieval, a cross-encoder re-ranking model that scores each candidate against the full query — rather than the bi-encoder used for first-pass retrieval — lifts precision substantially. Concrete rerankers used in production include Cohere Rerank and BGE-Reranker-v2-M3, both of which are easy to slot into the two-stage pipeline. The pattern is retrieve broadly with a fast method, then re-rank narrowly with a slower, more accurate one, giving you both recall and precision without paying the cost of running the expensive model over your whole corpus.

A RAG pipeline: documents are ingested and chunked into a hybrid vector and keyword index. A user query drives hybrid retrieval, re-ranking and grounded generation to produce a cited answer. An evaluation harness scores faithfulness, recall and quality, and feeds tuning back into retrieval and generation.
The RAG pipeline, instrumented. The evaluation loop is not an afterthought — it is what turns the system from static into improvable.

Faithfulness is a number, not a feeling

Once retrieval is solid, the question becomes whether the generated answer is actually supported by what was retrieved. This property — faithfulness, or groundedness — is the one most teams never measure, and it is the one that destroys trust fastest. An answer can be fluent, well-formatted, correctly citationed in appearance, and still assert something the sources never said.

Faithfulness can and should be scored automatically. The most practical approach uses a strong model as a judge: given the retrieved context and the generated answer, it decides whether each claim in the answer is entailed by the context. Run this over a fixed evaluation set on every change and you have a number that moves up or down with each modification to your prompt, retrieval or model. A change that improves fluency but lowers faithfulness is a regression, and the harness should treat it as one.

Alongside faithfulness, score answer relevance (did it actually address the question?) and, where you have reference answers, correctness. These three numbers — retrieval recall, faithfulness and answer relevance — form a dashboard that tells you not just whether the system is good, but why it is good or bad, which is the only way to improve it methodically.

Building an evaluation set you trust

The evaluation set is the most valuable artefact in a serious RAG project, and it is worth building by hand. Start with real questions: pull them from support logs, sales calls, internal Slack channels — wherever your users actually ask things. Add adversarial cases deliberately: questions whose answer is not in the corpus (the system should decline, not invent), ambiguous questions, and questions that require combining multiple documents. For each, record the ideal answer and the documents that should have been retrieved.

A hundred carefully chosen examples beat ten thousand synthetic ones. The set should be small enough to run cheaply on every change and rich enough to expose the failure modes you actually care about. Treat it like a test suite, because that is exactly what it is. As you discover new failures in production, add them to the set — this is how the system gets monotonically better instead of oscillating.

Citations earn trust — and force honesty

Inline citations are not a cosmetic feature. They change the contract between the system and the user from "trust me" to "verify me," and that shift is what makes enterprise stakeholders comfortable putting AI in front of customers. A support agent who can click a citation and confirm the answer in two seconds will use the system; one who has to take it on faith will not.

Citations also impose discipline on the system itself. If you require the model to cite a source for every claim, and you verify those citations actually support the claim, you have closed the loop between generation and retrieval. Claims that cannot be grounded should be refused, not fabricated. The willingness to say "I don't have information on that" is, counterintuitively, one of the strongest signals of a trustworthy system.

The production concerns nobody demos

A RAG system that works in a notebook still has to survive production, where a different set of properties matter. These rarely show up in demos and almost always show up in incident reviews.

  • Freshness: when a source document changes, how quickly does the answer change? Stale answers erode trust as fast as wrong ones. Your ingestion pipeline needs a clear, monitored update path.
  • Access control: two users asking the same question should get answers grounded only in documents they are each allowed to see. Enforce this at retrieval time, not as a post-hoc filter — it is a security boundary, not a UX preference.
  • Latency: hybrid retrieval plus re-ranking plus generation adds up. Budget it, measure the tail (p95, p99), and use caching for repeated queries so common questions return instantly.
  • Cost: every query may touch an embedding model, a re-ranker and a generation model. At scale this is a real line item. Caching, smaller models for easy queries, and good retrieval (so you send less context) all bend the curve.
  • Observability: log retrievals, scores and answers so that when something goes wrong, you can replay exactly what the system saw and decided.

For teams operating under Indian and global data-protection expectations, access control and data residency are not optional polish. A RAG system over internal documents is a query engine over potentially sensitive data, and it must be designed as one — with isolation, audit trails, and the ability to run inside your own cloud where required.

Common pitfalls we see repeatedly

  1. 1Optimising the prompt before fixing retrieval. If the right document is not in context, no prompt will save you. Measure retrieval first.
  2. 2No evaluation set, so every change is argued rather than measured, and improvements in one area silently regress another.
  3. 3Treating chunking as a default rather than a tuned, evaluated decision specific to the document type.
  4. 4Pure vector search with no keyword channel, which fails exactly on the codes and names enterprise users search for.
  5. 5Letting the model answer when it should decline, because nobody measured the "not in corpus" case.
  6. 6Shipping without citations, then being unable to explain to a compliance team why any given answer should be trusted.

Query understanding: the step before retrieval

There is a stage most RAG tutorials skip entirely, and it is one of the highest-leverage places to invest once retrieval and evaluation are in place: understanding the query before you search with it. Users do not phrase questions the way documents are written. They use shorthand, omit context that lives in the previous turn of a conversation, misspell product names, and bundle two questions into one sentence. Embedding the raw query and hoping for the best leaves a great deal of recall on the table.

Query transformation closes that gap. The techniques are simple to describe and consistently effective in practice, and each can be evaluated independently against your retrieval recall metric so you adopt only what helps.

  • Query rewriting: use a small, fast model to rewrite a terse or conversational query into a fuller, self-contained search query before retrieval — especially important in multi-turn chat, where the real question depends on earlier turns.
  • Query expansion: generate a handful of paraphrases or related queries, retrieve for each, and fuse the results — this lifts recall on queries where the user's wording differs from the document's.
  • Decomposition: split a compound question into its parts, retrieve for each, and assemble the evidence before generating — the only reliable way to answer questions that span multiple documents.
  • Hypothetical document embeddings: generate a hypothetical answer and embed that for retrieval, since a hypothetical answer often sits closer in vector space to the real source than the question does.

None of these is expensive, and each is measurable. The discipline is the same as everywhere else in this article: add the technique, run the evaluation set, keep it if recall improves and drop it if it does not. Query understanding is where a competent RAG system becomes a genuinely good one, because it fixes the failures that happen before the model ever sees a document.

Beyond naive retrieval: structure and multi-hop

The simplest RAG — embed chunks, retrieve top-k, stuff into context — handles a surprising amount, but it hits a ceiling on two kinds of question that show up constantly in real products: questions that require combining information from several places, and questions over structured or semi-structured data.

Multi-hop questions — "which of our enterprise customers in the western region renewed after a support escalation last quarter?" — cannot be answered by retrieving a single chunk, because no single chunk contains the answer. These need either decomposition (break the question into hops, retrieve and reason step by step) or a knowledge graph that encodes the relationships explicitly so they can be traversed. Knowing which of your questions are multi-hop, and how many, is itself a useful exercise: if they are common, naive top-k retrieval will quietly fail on them and your evaluation set must include them to catch it.

Structured data is the other ceiling. A great deal of valuable information lives in tables, databases and spreadsheets, not prose, and embedding a table as if it were a paragraph loses the structure that gives it meaning. The right pattern is often hybrid: let the model translate a natural-language question into a query against the structured source, and reserve semantic retrieval for the genuinely unstructured content. A RAG system that knows when to retrieve text and when to query a table is far more capable than one that treats everything as a blob of prose.

The lesson is not that you need all of this on day one. It is that you should know which of these patterns your real questions demand, build the simplest thing that handles them, and let your evaluation set — populated with the hard, real questions — tell you when naive retrieval has run out of road.

What good looks like

A RAG system you can trust in production has a few unmistakable characteristics. It is built around an evaluation harness that runs on every change and reports faithfulness, recall and relevance as numbers. It uses hybrid retrieval and re-ranking, with chunking tuned to the document type. It cites its sources and verifies those citations. It declines gracefully when the answer is not in the corpus. It enforces access control at retrieval time and updates promptly when sources change. And crucially, it improves over time, because every production failure becomes a new test case rather than a recurring embarrassment.

None of this is exotic. It is ordinary engineering discipline applied to a probabilistic system — the same discipline that separates software you can deploy from a script that works on the author's machine. The teams that internalise this ship RAG that earns its place in the product. The teams that chase the demo keep rebuilding the same fragile thing.

If you are building or rescuing a RAG system and want it to hold up under real traffic and real scrutiny, this is the kind of work we do every day. Start with the evaluation harness, be honest about the numbers, and let the measurements — not the vibes — tell you where to invest. That is RAG that actually works.

Put this into production

This is the kind of work we do every day. Explore the related service, or tell us what you're building.

Let's build the AI that moves your business.

Tell us the problem. We'll propose the smallest first step that proves real value — usually within a week.