Building an LLM evaluation harness your team will actually trust
You cannot improve what you cannot measure, and you cannot ship what you cannot trust. A practical guide to evaluation harnesses that turn LLM development from guesswork into engineering.

Ask most teams how they know their LLM feature is working and you will hear some variation of "we tried a few prompts and it looked good." That is not evaluation; that is hope with extra steps. It works until the day you change a prompt, swap a model to save cost, or upgrade to a new version — and discover, usually from a customer, that something you were not watching quietly broke. The antidote is an evaluation harness: a repeatable, automated way to measure whether your system is getting better or worse, run on every change.
Evaluation is the single highest-leverage investment in any serious LLM project, and it is the one most teams skip because it feels like overhead. This article lays out how we build harnesses that engineering teams actually trust — trust enough to gate deployments on, which is the real test of whether an evaluation is worth anything.
Why traditional testing isn't enough
Conventional software tests assert exact outputs: given this input, expect exactly this output. LLMs break that model. The same prompt can produce different wordings that are equally correct, or subtly different wordings where one is right and one is dangerously wrong. You cannot assert string equality on a paragraph of generated text. This is why teams reach for "it looked good" — the tooling instinct they have does not fit the problem.
The shift required is from binary pass/fail on exact strings to graded scores on properties. Instead of asking "is the output exactly X?", you ask "does the output satisfy the properties we care about, and how well?" Those properties might be correctness, faithfulness to a source, adherence to a format, tone, safety, or task completion. Each becomes a measurable dimension, and the harness aggregates them into a scorecard you can track over time.
The anatomy of a harness
Every effective evaluation harness has the same four parts: a dataset of representative inputs, the candidate system under test, a set of graders that score outputs, and a gate that decides whether a change is allowed to ship. Get these four right and the rest is detail.
1. The dataset is everything
The evaluation dataset is the most valuable and most under-invested artefact in LLM work. It should be built from real usage wherever possible — actual user questions, actual documents, actual edge cases that caused problems. Synthetic data has its place for coverage, but a harness built only on synthetic examples measures how well your system handles imaginary inputs, which is not the same as the inputs it will actually face.
A good dataset deliberately includes adversarial and edge cases: inputs designed to trip the system, questions with no valid answer, ambiguous requests, and the long tail of weird-but-real cases. These are where systems fail, and a dataset that omits them gives you a comforting, useless score. Curate it by hand, version it like code, and grow it every time production surprises you.
2. Choosing the right graders
Graders fall into three tiers, and a mature harness uses all three because each catches what the others miss.
- Heuristic graders are cheap, deterministic checks: does the output parse as valid JSON, match a regex, fall within a length bound, contain a required field, or avoid a forbidden term? Use these liberally — they are fast, free, and catch a surprising share of failures.
- LLM-as-judge graders use a strong model with a clear rubric to score subjective properties: is this answer faithful to the source, helpful, on-topic, appropriately toned? They scale far better than humans and, when the rubric is well-designed, correlate strongly with human judgement.
- Human review is the gold standard and the most expensive. You cannot review everything, so sample — especially the cases where automated graders disagree or score near the threshold. Human labels also calibrate your LLM judge.
3. Making LLM-as-judge reliable
Using a model to grade a model sounds circular, and done carelessly it is. The trick is to make the judge's task far easier than the original task. Generating a good answer is hard; deciding whether a given answer is supported by a given source is much easier, and models are reliable at the easier task. Constrain the judge with an explicit rubric, ask for a structured verdict rather than a vibe, and where possible give it the reference material to check against rather than relying on its own knowledge.
Calibrate the judge against human labels on a sample. If the judge and your human reviewers agree most of the time, you can trust it to scale; if they diverge, fix the rubric before you rely on the number. A judge you have not calibrated is a random number generator with good manners. One operational discipline that matters more than it sounds: pin the judge model to a specific dated version snapshot (e.g. a versioned API alias rather than "latest") — judge model upgrades shift score distributions in ways that silently break trend comparisons across eval runs, making it impossible to tell whether a change in score reflects your system improving or the judge changing its mind.
Make the grader's job easier than the generator's job. Verifying a claim against a source is reliable; asking a model for a vague quality score is not.
4. The gate turns scores into decisions
Scores that nobody acts on are decoration. The final piece is a gate wired into your deployment pipeline: a change ships only if it does not regress the metrics that matter. Define thresholds per metric, run the harness automatically on every candidate prompt, model or retrieval change, and block the ones that regress. This is what makes evaluation real — it stops being a report you glance at and becomes a guardrail you cannot accidentally bypass.
Metrics that mean something
Generic metrics borrowed from academic benchmarks rarely map to what your product needs. Define metrics in terms of your task. For a support assistant: resolution rate, faithfulness to the knowledge base, escalation appropriateness. For a coding assistant: does the generated code run, pass tests, and follow conventions? For a summariser: coverage of key points and absence of fabrication. The best metric is one your product manager and your engineer both agree reflects success.
Track each metric as a distribution, not just an average. An average hides the tail, and the tail is where trust dies — the one-in-fifty answer that is confidently wrong matters more than the forty-nine that are fine. Watch the worst cases as deliberately as the mean.
Offline and online evaluation
Everything above is offline evaluation: run before deployment, on a fixed dataset, fast enough to gate releases. It is necessary but not sufficient, because no offline dataset perfectly predicts live behaviour. The complement is online evaluation: measuring the system on real traffic in production.
Online signals include implicit feedback (did the user accept the answer, retry, or escalate to a human?), explicit feedback (thumbs up/down), and sampled production traffic run through your offline graders. The crucial discipline is to feed online failures back into the offline dataset, so that a problem discovered in production becomes a permanent test case. This loop is what makes the system improve monotonically instead of fixing one bug while reintroducing another.
Avoiding the traps
Evaluation harnesses fail in predictable ways, and knowing them in advance saves months.
- 1Overfitting to the eval set: if you tune endlessly against the same examples, you optimise for the test, not the task. Keep a held-out set you tune against less often, and refresh examples regularly.
- 2An uncalibrated judge: trusting LLM-as-judge scores without ever comparing them to human labels. The number feels objective and may be meaningless.
- 3Averages that hide tails: a 95% mean score can still mean one in twenty users gets a harmful answer. Inspect the failures, not just the summary.
- 4A static dataset: production drifts, user behaviour changes, and a dataset frozen at launch slowly stops measuring reality. Grow it continuously.
- 5Measuring what is easy instead of what matters: latency and token count are easy to measure and rarely the point. Measure task success even when it is hard.
Different systems need different evaluations
There is no universal evaluation; there is only evaluation that fits your task. The graders and metrics that matter for a retrieval system are different from those for an agent, which are different again from those for a generative feature. A harness that measures the wrong things produces a comforting number and no insight. Worth being concrete about how the three common cases differ.
Evaluating retrieval and RAG
For a RAG system, evaluation splits cleanly into two halves that should be measured separately: did retrieval find the right context, and did generation use it faithfully? Conflating them hides which half is broken. Measure retrieval with recall and precision against a set of questions for which you know the correct source documents. Measure generation with faithfulness — is every claim supported by the retrieved context — and answer relevance. When the overall answer is wrong, these separate numbers tell you immediately whether to fix the retriever or the generator, which is the difference between a targeted fix and a week of guessing.
Evaluating agents
Agents are harder to evaluate because success is a trajectory, not a single output. An agent can reach the right answer through a reckless path that happened to work, or fail a task despite reasoning well. Evaluate both the outcome (did it complete the task correctly?) and the process (did it call the right tools, stay within policy, avoid unnecessary or risky actions?). Trajectory evaluation — scoring the sequence of steps, not just the final state — catches the agent that succeeds by luck today and fails catastrophically tomorrow. This is also where your trace store earns its keep, because you cannot evaluate a trajectory you did not record.
Evaluating open-ended generation
For summarisation, drafting and other open-ended generation, there is rarely one correct output, which is exactly why string-matching fails and why this is the home territory of LLM-as-judge and human review. Define the properties that matter — coverage of key points, absence of fabrication, appropriate length and tone — and grade against those. Where you have reference outputs, pairwise comparison (is output A better than output B?) is often more reliable than absolute scoring, because models and humans alike are better at ranking two things than at assigning a calibrated score to one.
The economics of grading
Evaluation is not free, and pretending otherwise leads to harnesses that are too expensive to run often, which means they don't get run, which defeats the purpose. Every LLM-as-judge call costs tokens; every human review costs time. A harness that costs a fortune to execute will be run quarterly instead of on every change, and a quarterly evaluation is barely an evaluation at all.
The way out is tiering, matching the cost of the grader to the value of the signal. Run cheap heuristic checks on every example on every change — they are nearly free and catch a large share of regressions. Run the more expensive LLM-as-judge graders on every change but over a curated subset sized to be affordable. Reserve human review for periodic deep audits and for calibrating the judge. This tiered approach keeps the fast feedback loop fast and cheap while preserving the depth you need, and it is the practical answer to the objection that real evaluation is too costly to do continuously.
The eval set as a living asset
The most valuable evaluation sets are not written once; they accumulate. Every production incident, every customer complaint, every surprising failure should become a permanent test case. Over time the set becomes an institutional memory of every way your system has ever been wrong — and the guarantee that none of those failures silently returns. This is how a system gets monotonically better instead of fixing one bug while quietly reintroducing another.
Treat the evaluation set with the same care as production code: version it, review changes to it, and understand that it encodes your definition of quality. When stakeholders disagree about whether the system is good enough, the conversation should be about the evaluation set and its thresholds, not about anecdotes. A shared, concrete definition of success — captured in examples and metrics — is one of the most powerful alignment tools a team can have, because it turns subjective arguments into objective ones.
There is a governance dimension too. For organisations operating under regulatory scrutiny, a documented, versioned evaluation process is increasingly part of demonstrating that an AI system is fit for purpose and was tested before deployment. The harness that makes your engineers confident is the same artefact that makes your auditors comfortable — another reason the unglamorous work pays off twice.
What it buys you
A team with a trusted evaluation harness moves differently. They upgrade models the day a better one ships, because the harness tells them in an hour whether it is actually better for their task rather than just better on someone else's benchmark. They refactor prompts fearlessly, because regressions are caught automatically. They negotiate with stakeholders using numbers instead of anecdotes. And they sleep at night, because the thing that would have embarrassed them in front of a customer was blocked at the gate.
There is a cultural shift that comes with all of this, and it may matter more than any technique. A team with a trusted harness argues less and measures more. Disagreements about whether an answer is good enough stop being battles of opinion and seniority and become questions you can settle by looking at the scorecard. New engineers can change the system safely on their first week, because the gate protects them from shipping a regression they did not know they were causing. Product and engineering share one definition of success, written down in examples. This is, quietly, one of the largest organisational benefits of evaluation: it replaces politics with evidence, and evidence scales in a way that authority never does.
It also changes the relationship with the underlying models. The frontier moves monthly; new models, new versions and new providers appear constantly. Without a harness, every model change is a leap of faith and a fresh round of manual spot-checking. With one, a model upgrade is a one-hour experiment: point the harness at the new model, read the scorecard, and adopt it only if it wins on your task. The teams that move fastest in this space are not the ones with the cleverest prompts; they are the ones whose evaluation is good enough that they can try anything cheaply and keep only what measurably helps.
This is the difference between treating LLMs as a science experiment and treating them as production software. The harness is unglamorous — nobody demos their evaluation pipeline — but it is the foundation that everything reliable is built on. Build it first, calibrate it honestly, gate on it ruthlessly, and grow it forever. Your future self, staring at a model upgrade with a deadline, will be grateful you did.
Put this into production
This is the kind of work we do every day. Explore the related service, or tell us what you're building.


