Skip to content
InferenceInfrastructure

The real economics of LLM inference: cost without compromise

An AI feature that delights in a pilot can quietly become unaffordable at scale. The levers that cut inference cost by an order of magnitude — without cutting the quality your users notice.

DerbaTech Engineering12 min read
The real economics of LLM inference: cost without compromise — cover

Plenty of AI products die not because they don't work, but because they don't work at a price anyone can sustain. A feature that costs a few rupees per interaction is delightful in a pilot with a hundred users and ruinous with a million. Inference cost is the line item that quietly decides whether an AI product has a business model, and it is the one teams think about last — usually when the first full-scale invoice arrives and the mood in the room changes.

The good news is that inference cost is one of the most optimisable parts of an AI system. It is routine to cut cost by five to ten times without users noticing any drop in quality, because most systems leave easy savings on the table. This article is a tour of the levers, ordered roughly by return on effort, so you can think about cost as a design parameter from the start rather than a crisis at scale.

Understand what you're paying for

Before optimising, know the cost structure. LLM inference is priced, directly or indirectly, on tokens: the input tokens you send (your prompt, context and retrieved data) and the output tokens the model generates. Output tokens typically cost several times more than input tokens, and generation is the slower, more expensive operation, because each output token requires a full pass through the model while input tokens are processed in parallel.

This asymmetry has immediate implications. Bloated prompts — stuffing the entire knowledge base into context "just in case" — are pure waste, paid on every single call. Verbose outputs cost more than concise ones and are slower. The first optimisation is simply not paying for tokens you don't need: tight prompts, well-targeted retrieval so you send only relevant context, and output constraints that stop the model from rambling. These cost nothing to implement and often reclaim a third of the bill.

An inference cost pipeline: a request first hits a semantic cache, returning instantly on a cache hit. On a miss, a model router sends simple queries to a small or distilled model and hard ones to a frontier model. Both flow into optimised serving — continuous batching, INT8 or FP8 quantization, vLLM or TGI — before returning a response.
The cost levers in sequence. Each stage either avoids work (cache, routing) or makes the unavoidable work cheaper (batching, quantization).

Lever 1 — Don't compute what you can cache

The cheapest inference is the one you never run. A large fraction of real-world queries are repeats or near-repeats: the same questions, the same documents, the same prompts. A semantic cache stores previous query-response pairs and returns the stored answer when a new query is similar enough, skipping the model entirely. On systems with repetitive traffic — support assistants, internal tools — caching can absorb a substantial share of requests at near-zero marginal cost and near-zero latency.

Caching needs care: define similarity thresholds so you don't serve a stale or subtly wrong cached answer, respect personalisation and access control so one user's cached answer never leaks to another, and set sensible invalidation when underlying data changes. Done well, it is the highest-return lever available, because it removes work entirely rather than merely making it cheaper.

Lever 2 — Route to the right-sized model

Not every query needs your most capable, most expensive model. Many are simple — a classification, a short factual lookup, a routine transformation — and a smaller or distilled model handles them perfectly at a fraction of the cost and latency. A model router inspects each request and sends it to the cheapest model that can handle it, reserving the frontier model for the genuinely hard cases.

The economics here are dramatic because query difficulty is usually very skewed: a small minority of requests are hard, and the rest are easy. If routing sends the easy majority to a model that costs a tenth as much, the blended cost collapses even though your hardest queries still get the best model. The engineering challenge is classifying difficulty reliably and cheaply — and, as always, measuring with an evaluation harness that the routed-down answers are still good enough.

Lever 3 — Make the unavoidable cheaper with quantization

For the inference you do run — especially on open models you host yourself — quantization reduces the precision of the model's weights from 16-bit floats to 8-bit integers or 8-bit floats, and sometimes lower. This shrinks memory footprint and increases throughput substantially, letting you serve more requests per GPU. Modern quantization techniques preserve quality remarkably well; the difference is often imperceptible on real tasks, while the cost reduction is very real.

Beyond 8-bit, 4-bit quantization formats — GPTQ, AWQ, NF4 and GGUF — have become the dominant choice for open-model serving in practice, yielding roughly twice the memory and throughput gains of 8-bit (approximately 4× vs the full-precision baseline). This makes a meaningful difference to hardware costs on long-running services. The caveat is real: quality impact at 4-bit is more pronounced than at 8-bit and must be validated per model and per task — what is imperceptible for a summarisation workload may be unacceptable for a precision-critical extraction task.

Quantization is not free of trade-offs — aggressive low-bit quantization can degrade quality on harder tasks — which is exactly why you evaluate. Run the quantized model through the same harness as the full-precision one, confirm the quality metrics hold on your task, and adopt the most aggressive quantization that passes. This is the disciplined version of "cheaper without compromise": you measure the compromise and only accept it when there isn't one that matters.

Lever 4 — Batching and serving efficiency

If you serve your own models, how you serve them matters enormously. Naive serving processes one request at a time and leaves the GPU idle between them. Continuous batching — packing multiple requests through the model together and dynamically adding new ones as others finish — keeps the expensive hardware busy and can multiply throughput several times over. Purpose-built serving engines implement this and other optimisations like paged attention out of the box.

The practical advice is to use a serving stack designed for LLM inference rather than rolling your own. The throughput difference between a well-configured modern server and a basic implementation is large enough to change your hardware budget outright. This is infrastructure work, but it is well-trodden infrastructure work, and the return is immediate utilisation of capacity you are already paying for.

Lever 5 — Distillation for high-volume tasks

When a task is stable, well-defined and high-volume, distillation is the heavy artillery. You use a large model to generate training data, then train a much smaller model to replicate its behaviour on that specific task. The result is a small, fast, cheap model that matches the big one where it matters, for the one job you have distilled. For a high-traffic, narrow task, serving a distilled model instead of calling a frontier API for every request can cut the cost of that task by an order of magnitude.

Distillation is an investment with a payback period: it costs engineering effort and data generation up front, and it pays off only when volume is high enough and the task stable enough to amortise that cost. The decision is economic — compute the break-even point — but for the high-volume core of a mature product, it is frequently the single largest saving available.

Architecture choices that compound

Beyond the individual levers, a few architectural habits keep cost under control as you scale.

  • Stream responses so users perceive speed even when total generation time is unchanged — a cheaper model that streams often feels better than an expensive one that doesn't.
  • Set output length limits appropriate to the task; uncapped generation is uncapped cost.
  • Retrieve precisely so you send less context — good retrieval is a cost optimisation as well as a quality one.
  • Cache at multiple levels: full responses, retrieved documents, and embeddings, each of which avoids recomputation.
  • Pick the deployment model deliberately — hosted APIs trade higher per-token cost for zero operational burden; self-hosting trades operational work for control and, at scale, lower unit cost.

Hosted versus self-hosted

The largest architectural cost decision is whether to call a hosted API or run open models on your own infrastructure. Hosted APIs are the right choice early: zero operational burden, instant access to frontier models, and you pay only for what you use. The crossover comes at scale. Once volume is high and predictable, self-hosting open models — with quantization, batching and the serving optimisations above — can be markedly cheaper per unit, and it brings the data-control and residency benefits that matter for regulated workloads.

There is no universal answer; there is a crossover point that depends on your volume, your team's operational capacity, and your compliance needs. The mistake is treating the early choice as permanent. Many mature systems end up hybrid: hosted frontier models for the long tail of hard, low-volume queries, and self-hosted distilled or quantized models for the high-volume core. Design so you can move workloads between the two as the economics shift.

Prompt caching and the input-token problem

There is a specific, large cost that hides in plain sight: the part of your prompt that is identical on every call. A system prompt, a set of few-shot examples, a tool schema, a long instruction block — these can run to thousands of tokens, and naively you pay to process them on every single request even though they never change. For high-traffic applications this fixed overhead can dominate the bill.

Prompt caching addresses exactly this. The provider or serving stack caches the processed representation of a stable prefix so that subsequent requests reuse it instead of recomputing it, often at a steep discount on those cached tokens. The engineering implication is to structure prompts so the stable content comes first and the variable content last, maximising the reusable prefix. It is a small change to how you assemble prompts and frequently one of the larger line-item reductions available, precisely because it attacks a cost that scales with every request rather than with the hard ones.

The same thinking applies to retrieval. If you re-embed the same documents repeatedly, you are paying for work you have already done; cache embeddings. If you retrieve the same context for similar queries, cache the retrieval. Each layer of caching removes a category of repeated computation, and they compound.

Throughput, latency and the hardware you already pay for

When you host your own models, the dominant cost is the GPU, and the metric that determines your unit economics is throughput — how many requests you can serve per GPU per second. Two systems running the identical model on identical hardware can differ several-fold in throughput based purely on how well they keep the GPU busy. A GPU sitting idle between requests is money burning with nothing to show for it.

This is why continuous batching matters so much: it keeps the expensive accelerator saturated by always having work in flight. But there is a genuine tension between throughput and latency that you must manage deliberately. Larger batches improve throughput and therefore cost per request, but can increase the latency any individual user experiences. The right balance depends on the product — a background document-processing job can tolerate latency for maximum throughput, while an interactive assistant must cap latency even at some cost in efficiency.

  • Measure throughput (requests per GPU-second) and the latency distribution together — optimising one blind to the other is how you accidentally ruin the experience or the budget.
  • Tune batch size to the latency budget of the specific product, not to a generic default.
  • Watch the tail latencies (p95, p99), because the worst case is what users remember and what SLAs are written against.
  • Right-size the hardware to the model — over-provisioning wastes money, under-provisioning throttles throughput and inflates per-request cost.

Capacity planning and the cost of idle

Self-hosting introduces a cost that hosted APIs hide: capacity you pay for whether or not you use it. A GPU reserved for your peak traffic sits underused during quiet hours, and that idle time is pure cost. Traffic that is spiky — busy during business hours in your market, quiet overnight — makes this worse, and Indian and regional products with concentrated daytime usage feel it acutely.

The levers here are operational. Autoscaling adds and removes serving capacity with demand so you are not paying peak prices around the clock, though model loading times make this less instant than scaling stateless web servers, so it needs tuning. Batching latency-tolerant work into off-peak windows uses capacity that would otherwise be idle. And the hosted-versus-self-hosted decision reappears here in economic terms: for spiky or unpredictable traffic, a hosted API's pay-per-use model may genuinely be cheaper than paying for idle hardware, while steady high-volume traffic rewards the lower unit cost of well-utilised owned infrastructure.

The general principle is that utilisation, not raw price, determines real cost. A cheap GPU at twenty percent utilisation can cost more per request than an expensive one at ninety. Capacity planning — matching provisioned capacity to actual demand — is therefore as much a cost lever as any model-level optimisation, and it is one teams routinely ignore until the finance team asks why the infrastructure bill does not track usage.

Measure cost like you measure quality

The throughline of every lever is measurement. You cannot optimise cost you do not track, and you cannot safely optimise it without watching quality at the same time. Instrument cost per request, per feature and per user, and track it alongside your quality metrics so that every cost optimisation is validated against the evaluation harness. A change that halves cost and quietly drops faithfulness by ten points is not a win; it is a deferred incident.

One framing makes all of this concrete for decision-makers: compute your cost per successful outcome, not just your cost per API call. A cheap model that fails a third of the time and forces a retry or a human handoff is not cheap once you count the failures. A slightly more expensive model that gets it right the first time may be the lower-cost option per resolved ticket or per completed task. Pairing the cost metric with the quality metric this way prevents the classic false economy of optimising the per-call price while quietly destroying the unit economics of the actual business outcome.

Treated this way, inference cost stops being a scary surprise and becomes an ordinary engineering parameter you tune with the same discipline as latency or accuracy. The order-of-magnitude savings are real and routinely achievable — caching, routing, quantization, batching and distillation, each validated against quality. The teams that build with cost in mind from the start ship AI products with a sustainable business model. The teams that ignore it until the invoice arrives often discover their delightful feature was never economically viable — which is a far worse problem to find late than early.

The encouraging part is how much headroom most systems have. We rarely meet a production AI workload that cannot be made several times cheaper without users noticing, simply by applying these levers in order: trim wasted tokens, cache aggressively, route to right-sized models, quantize and batch what remains, and distill the high-volume core. Each is ordinary engineering, each is validated against quality, and together they routinely turn an unaffordable feature into a profitable one. Designing for cost from day one is the difference between an AI product that scales into a business and one that scales into a liability.

Put this into production

This is the kind of work we do every day. Explore the related service, or tell us what you're building.

Let's build the AI that moves your business.

Tell us the problem. We'll propose the smallest first step that proves real value — usually within a week.