ModelsStrategy

Fine-tune or RAG? A decision framework that saves months

Teams reach for fine-tuning when they need retrieval, and for retrieval when they need fine-tuning. A clear framework for telling the two apart — and knowing when you need both.

DerbaTech Engineering10 December 202512 min read

Fine-tune or RAG? A decision framework that saves months — cover

"Should we fine-tune our own model?" is one of the most common questions we hear from teams starting serious AI work, and the honest answer is usually "probably not yet, and possibly not ever." Fine-tuning has a gravitational pull on engineering teams — it feels like the real, sophisticated thing to do, the move that turns a generic API into a proprietary asset. But most of the time the problem people are trying to solve with fine-tuning is actually a retrieval problem, and reaching for the wrong tool costs months and a substantial budget before the mismatch becomes obvious.

The confusion is understandable, because fine-tuning and retrieval-augmented generation both make a model "know" things it didn't before. But they do fundamentally different jobs, and the distinction between them is the most useful thing you can internalise before committing to an architecture. This article gives you the framework we use to make that call.

Two different problems

Here is the distinction that resolves most of the confusion. Retrieval changes what the model knows. Fine-tuning changes how the model behaves. If your problem is that the model lacks facts — your documents, your policies, your latest data — that is a knowledge problem, and retrieval is the answer. If your problem is that the model knows enough but won't reliably produce the format, tone, or task-specific behaviour you need, that is a behaviour problem, and fine-tuning is the answer.

Most business problems people bring to us are knowledge problems wearing a behaviour-problem costume. "The model doesn't know our products" is not a reason to fine-tune; it is a reason to build retrieval over your product catalogue. Fine-tuning facts into a model's weights is the worst of both worlds: expensive to do, impossible to update without redoing it, and unverifiable because the knowledge is baked into billions of parameters rather than sitting in a document you can point to.

A decision flow: starting from 'what's the gap?', if knowledge or facts are missing, use retrieval (RAG); if behaviour, format or latency is the issue, fine-tune or distill; otherwise start with prompting and evals. Both paths can lead to a hybrid that retrieves facts and fine-tunes the style. — The decision in one picture. Most teams should start at the bottom — prompting and evals — and only move outward when evidence demands it.

When retrieval is the answer

Use retrieval when the gap is knowledge. The signs are clear: the model gives correct-sounding answers that are factually wrong about your specific domain, or it lacks information that exists in your documents, or your knowledge changes frequently enough that baking it into a model would mean constant retraining.

Retrieval has decisive advantages for knowledge problems. It updates instantly — change the source document and the next answer reflects it, with no retraining. It keeps knowledge auditable, because every answer can cite the document it came from, which matters enormously for trust and compliance. And it is far cheaper to build and maintain than a fine-tuning pipeline. For the large majority of "make the model know our stuff" problems, retrieval is not just the better tool; it is the obviously better tool.

When fine-tuning is the answer

Use fine-tuning when the gap is behaviour, and prompting has genuinely failed to close it. The legitimate cases are real and worth doing well.

Format and structure: you need the model to produce output in a precise, consistent shape that prompting alone does not reliably enforce — a specialised schema, a domain-specific notation, a rigid style.
Tone and voice: a consistent brand or domain voice that few-shot examples approximate but never quite nail across the full range of inputs.
Narrow task specialisation: a well-defined, repetitive task where a smaller fine-tuned model matches a large general one at a fraction of the cost and latency.
Latency and cost at scale: distilling the behaviour of a large model into a smaller one you can serve cheaply, once the task is stable and well-understood.
Teaching a skill, not a fact: a reasoning pattern or a way of approaching problems that is hard to convey in a prompt but learnable from many examples.

Notice what these have in common: none of them is about knowledge. They are all about how the model acts. When you genuinely have a behaviour problem and you have the labelled examples to teach the desired behaviour, fine-tuning earns its keep — often spectacularly, in the form of a small, fast, cheap model that does one job better than a giant general model.

Most "we need a custom model" problems are actually "we need better retrieval" problems. Diagnose the gap before you choose the tool.

Start with prompting and evals

Before either retrieval or fine-tuning, there is a step teams skip at their peril: serious prompt engineering with a real evaluation harness. Modern frontier models are extraordinarily capable when prompted well, and a great many problems that teams assume require fine-tuning dissolve under good prompting, few-shot examples, and clear instructions. The only way to know is to measure — which is why an evaluation set comes first.

Prompting is the cheapest experiment you can run. If prompting plus retrieval gets you to your quality bar, you are done, with a system that is cheap, flexible and easy to maintain. Only when you have measured prompting to its ceiling and found it genuinely insufficient for a behaviour problem should you reach for fine-tuning. Treating fine-tuning as the last resort rather than the first instinct saves more projects than any other single piece of advice we give.

When you need both

The most sophisticated systems often combine the two, because real problems frequently have both a knowledge gap and a behaviour gap. The pattern is clean once you see it: use retrieval to supply the facts, and fine-tune to shape how those facts are used. A fine-tuned model that has learned your domain's reasoning style and output format, fed grounded facts by a retrieval pipeline, can outperform either approach alone.

But hybrid is an endpoint, not a starting point. You arrive at it by first solving the knowledge problem with retrieval, confirming that a behaviour gap remains, and then fine-tuning specifically to close that residual gap — with the retrieval already in place during fine-tuning so the model learns to use retrieved context well. Jumping straight to a hybrid system before understanding which gap is which is how teams build complicated machines that solve a problem they never diagnosed.

The economics matter

The decision is not only technical; it is economic, and the costs differ in kind, not just degree.

1Retrieval costs are ongoing and operational: embedding, storage, and the inference to retrieve and generate. They scale with usage and are predictable.
2Fine-tuning costs are front-loaded and recurring per update: data collection and labelling (often the largest hidden cost), training compute, evaluation, and the engineering to maintain the pipeline. Every time your requirements change, you may pay again.
3The hidden cost of fine-tuning is data. A fine-tune is only as good as its training examples, and assembling a high-quality labelled dataset is usually the hard, expensive, slow part — far more than the training run itself.
4Distillation flips the economics in your favour once a task is stable: a small fine-tuned model can be dramatically cheaper to serve at scale than calling a frontier API for every request.

For many teams, especially those scaling in cost-sensitive markets, the right long-term architecture is retrieval plus prompting on a capable model for flexibility, with targeted distillation into smaller models for the high-volume, stable tasks where serving cost dominates. This captures flexibility where you need it and efficiency where it pays.

A checklist for the decision

When a team asks us whether to fine-tune, we walk through a short diagnostic that almost always clarifies the answer.

Is the problem that the model lacks facts? If yes, you need retrieval, not fine-tuning.
Does the needed knowledge change over time? If yes, retrieval — baking changing facts into weights guarantees staleness.
Do you need answers to be auditable and citable? If yes, retrieval, because fine-tuned knowledge cannot cite a source.
Have you genuinely exhausted prompting and few-shot examples, measured against an eval set? If no, do that first.
Is the remaining gap about behaviour, format, tone or latency? If yes, fine-tuning is now on the table.
Do you have, or can you affordably build, a high-quality labelled dataset for the behaviour? If no, fine-tuning will disappoint regardless of intent.

What fine-tuning actually involves

Teams that decide to fine-tune often underestimate what the project actually entails, because the training run — the part that sounds technical and hard — is the easy part. The hard parts are everything around it, and being honest about them up front prevents the most common form of disappointment.

The first reality is data. A fine-tune is only as good as its training examples, and assembling a few thousand high-quality, correctly labelled examples that represent the behaviour you want is usually the largest cost and the longest pole in the project. The data has to be consistent — contradictory examples teach the model to be inconsistent — and it has to cover the range of inputs the model will actually see. Many fine-tuning efforts fail not because the technique was wrong but because the dataset was small, noisy, or unrepresentative.

The second reality is method. Full fine-tuning, which updates all of a model's weights, is expensive and rarely necessary. Parameter-efficient methods such as LoRA keep the base model's weights frozen during training while learning compact low-rank adapter matrices that capture the target behaviour. Once trained, those adapters can be merged back into the base weights, producing a single model artifact with no inference-time overhead — you get the efficiency of parameter-efficient training and the serving simplicity of a standard model. These methods achieve most of the benefit of full fine-tuning at a fraction of the cost and are the right default for most behaviour-shaping tasks. Choosing the base model matters too: a smaller open model fine-tuned for a narrow task often beats a larger general model on that task while costing far less to serve.

Data collection and labelling — usually the dominant cost, and the one that determines success.
Method selection — parameter-efficient fine-tuning (LoRA and relatives) before full fine-tuning, almost always.
Base model selection — the smallest model that can learn the behaviour, for cheaper serving.
Iteration — fine-tuning is rarely one-and-done; you train, evaluate, find gaps, augment the data, and repeat.

The hybrid, in practice

When a problem genuinely has both a knowledge gap and a behaviour gap, the hybrid is worth building carefully rather than accidentally. The most robust pattern fine-tunes a model specifically to work well with retrieved context — teaching it not facts, but how to use retrieved evidence: how to ground its answers, how to cite, how to handle conflicting or missing sources, and how to produce your required output format from whatever the retriever provides.

This is a meaningfully different objective from fine-tuning facts into weights. You are teaching a skill — disciplined use of evidence — that generalises across documents, rather than memorising specific documents that will go stale. The retrieval layer keeps the knowledge current and auditable; the fine-tune makes the model reliably good at turning that knowledge into the answer you want. Done in that order, with retrieval first and fine-tuning to close the residual behaviour gap, the hybrid delivers what neither approach manages alone. Done as a leap straight to a complicated system, it usually delivers a maintenance burden and an unclear win.

Evaluating the decision after you make it

Whichever path you choose, the decision is not validated by intuition but by measurement — which is why an evaluation harness is a prerequisite, not a follow-up. Before fine-tuning, you should have a number for how well prompting plus retrieval performs, so that after fine-tuning you can say precisely whether it was worth it. Too many teams fine-tune, feel that the result is better, and never confirm it against a held-out set — only to discover later that the gain was marginal or imaginary and the maintenance cost was not.

The same harness also guards against fine-tuning's particular hazard: a fine-tune that improves the target behaviour while quietly degrading general capability or introducing new failure modes. A model tuned hard for one format can become brittle outside it. Evaluate across the full range of inputs, not just the cases you tuned for, and keep the comparison against the simpler prompting-plus-retrieval baseline visible. If the fine-tune does not beat the baseline by a margin that justifies its cost and rigidity, the right engineering decision is to ship the baseline — and the only way to know is to have measured both.

The bottom line

Fine-tuning is a powerful, legitimate tool that is reached for too early and too often, usually to solve a problem it is the wrong tool for. Start by diagnosing the gap: knowledge or behaviour. Solve knowledge gaps with retrieval, which is cheaper, updatable and auditable. Solve behaviour gaps with fine-tuning — but only after prompting has been measured to its limit, and only when you have the data to do it well. Combine them when, and only when, a measured behaviour gap remains after retrieval is in place.

It is worth naming why the wrong instinct is so common, because understanding the pull helps resist it. Fine-tuning feels like ownership. Calling someone else's API to retrieve your own documents feels like renting; training your own model feels like building an asset. That emotional framing leads teams to over-value fine-tuning and under-value retrieval, even when retrieval is plainly the better engineering choice. The reframe that helps: your proprietary advantage is your data and how well you use it, not whether that data lives in model weights or in a retrieval index. A great retrieval system over proprietary data is every bit as much a moat as a fine-tune — and a more flexible, auditable, maintainable one.

Get this decision right and you save months of building the wrong thing. Get it wrong and you spend a quarter assembling a training pipeline for a problem a good retrieval system would have solved in a fortnight. The framework is simple, the discipline is in applying it honestly: diagnose before you build, measure before you commit, and reach for the expensive, rigid option only when the cheap, flexible one has demonstrably run out of road.

When teams bring us this decision, we almost always start the same way: build the evaluation harness, establish the prompting-plus-retrieval baseline, and only then have a grounded conversation about whether fine-tuning is warranted. More often than not, retrieval and good prompting clear the bar, and the team keeps a system that is cheaper and easier to live with. When fine-tuning genuinely is the answer, the same baseline proves it was worth the investment. Either way, the decision is made with evidence rather than instinct — which is the whole point of the framework.

Put this into production

This is the kind of work we do every day. Explore the related service, or tell us what you're building.

RAG & Knowledge Systems Custom ML & Model Development Start a project