AgentsMLOps

Shipping AI agents to production without the chaos

Autonomous agents are thrilling in a notebook and terrifying in production. Guardrails, observability and gradual rollout are the boring engineering that turns one into the other.

DerbaTech Engineering18 February 202612 min read

Shipping AI agents to production without the chaos — cover

An AI agent is a language model given the ability to act: to call tools, query systems, write data, trigger workflows. That capability is exactly what makes agents transformative and exactly what makes them dangerous. A chatbot that hallucinates produces an embarrassing sentence. An agent that hallucinates can issue a refund, delete a record, or send an email to the wrong customer. The autonomy is the product and the liability in the same breath.

We have built agent systems that run quietly in production handling real operational load, and we have been called in to rescue ones that went sideways. The pattern that separates the two is not the cleverness of the model or the elegance of the prompt. It is engineering discipline: guardrails, observability and a rollout strategy that earns autonomy gradually rather than granting it all at once. This article is about that discipline.

The notebook-to-production gap

In a notebook, an agent that completes a complex task on the first try feels like the future has arrived. The temptation is to wire it directly into production systems and let it run. This is how the horror stories start. The notebook hides everything that matters in production: the long tail of inputs the agent has never seen, the partial failures of the tools it calls, the adversarial users who will probe it, and the simple fact that an agent acting thousands of times a day will eventually hit a case where its plausible-looking plan is catastrophically wrong.

Production agents are not a fundamentally different model from notebook agents. They are the same model wrapped in scaffolding that constrains, observes and recovers. The intelligence is the easy part — frontier models are already capable enough for most agentic tasks. The hard part, and the part that determines success, is the engineering around the model.

Guardrails before autonomy

The first principle is that not every action carries the same risk, and the system must know the difference. Reading data is low-risk. Sending an irreversible communication, moving money, or deleting records is high-risk. Classify every tool the agent can call by the cost of it going wrong, and apply controls proportional to that cost.

For low-risk actions, let the agent act freely — that is where the productivity comes from. For high-risk or irreversible actions, interpose a checkpoint: a dry-run that previews exactly what will happen, and a human approval gate for the cases that warrant it. This single design decision — graduated autonomy keyed to risk — prevents the failure modes that cause leadership to ban agents entirely. It lets you capture most of the value while bounding the downside.

A planner/executor agent loop: a goal goes to a planner that decomposes it into steps. The executor acts on each step but every action passes a guardrail and policy check. Low-risk actions go straight to tools; high-risk actions require a dry-run and human approval before reaching tools. Results flow back to the executor, every action is written to a trace and eval store, and the planner replans until the goal is met. — Planner, executor and guardrails. The guardrail check is the gate every action passes through — and where risk-based controls live.

Policy as code, not as prompt

A common mistake is to encode the rules of what an agent may do entirely in its prompt — "do not delete anything without confirmation," and so on. Prompts are suggestions, not guarantees; a sufficiently confused or manipulated model will violate them. Hard constraints belong in a policy layer outside the model: an allowlist of permitted tools, parameter validation, rate limits, and approval requirements enforced in code that the model cannot talk its way around. The model proposes; the policy layer disposes.

If you can't see it, you can't ship it

The second principle is total observability. An agent makes a sequence of decisions — plan, tool call, result, replan — and when something goes wrong you need to replay that sequence exactly. Without traces, debugging an agent failure is archaeology by guesswork. With them, it is a matter of reading the transcript and seeing precisely where the reasoning or the tool result went off the rails.

Instrument every step: the plan the agent formed, every tool call with its arguments and result, the intermediate reasoning where available, and the final outcome. Store these traces durably. They serve three purposes at once — debugging incidents, feeding your evaluation pipeline, and providing the audit trail that compliance and security teams will require for any agent touching sensitive systems.

Architectures that stay sane

Most reliable agent systems converge on a planner/executor structure. A planner decomposes a goal into steps and decides what to do next; an executor carries out one step at a time, observes the result, and reports back so the planner can adapt. Separating planning from execution makes behaviour far more inspectable than a single monolithic loop, and it gives you natural points to insert guardrails and checkpoints.

Resist the urge to build a sprawling swarm of agents talking to each other before you have a single agent working reliably. Multi-agent systems are powerful but multiply the failure surface and the difficulty of debugging. The right progression is one well-instrumented agent doing one bounded job, then expanding scope as confidence grows. Complexity should be earned, not assumed.

Bounded scope beats general autonomy

The agents that work in production are almost always narrow. "Resolve this specific class of support ticket" succeeds where "handle anything a customer asks" fails. A bounded task has a definable success criterion, a manageable set of tools, and a tractable evaluation set. General autonomy sounds impressive and behaves unpredictably. Ship the narrow agent, prove it, then widen the aperture deliberately.

Gradual rollout is the strategy

You do not launch an agent the way you launch a static feature. You earn its autonomy one notch at a time, and each notch is justified by evidence.

1Shadow mode: the agent runs alongside the existing process, proposing actions without executing them. You compare its proposals to what actually happened and measure how often it would have been right.
2Human-in-the-loop: the agent acts, but a human approves each action before it takes effect. This builds the trace history and the trust, and surfaces the failure modes safely.
3Supervised autonomy: the agent acts on low-risk actions automatically, escalating only the high-risk or low-confidence cases to a human.
4Full autonomy on the bounded task, with monitoring and the ability to pull it back instantly if metrics degrade.

Each stage is gated by data from the one before. If shadow mode shows the agent would have been wrong 8% of the time on a high-stakes action, you do not advance — you fix it first. This is slower than a big-bang launch and dramatically more likely to still be running, and trusted, six months later.

Handling failure gracefully

Agents will fail; the question is how. A well-engineered agent fails safely: it recognises when it is stuck or uncertain and escalates to a human rather than barrelling ahead. Build in explicit uncertainty handling — confidence thresholds, step limits to prevent runaway loops, and timeouts. An agent that loops forever or confidently takes a wrong irreversible action is a far worse outcome than one that stops and asks for help.

Design the escalation path as carefully as the happy path. When the agent hands off to a human, it should hand off context: what it was trying to do, what it tried, and why it stopped. A good handoff turns a failure into a minor interruption; a bad one turns it into a frustrated user and a confused operator.

The operational reality

Running agents in production is an ongoing operation, not a one-time deploy. Tool APIs change and break. Models get updated and behave differently. New categories of input appear. Costs can spike if an agent enters an expensive loop. You need monitoring on success rate, latency, cost per task, and escalation rate, with alerts when any of them drift — and an evaluation harness so that when you update the model or a prompt, you know before shipping whether the agent still behaves.

For organisations operating under regulatory expectations — financial services, healthcare, and the data-protection regimes that increasingly apply across India and globally — the audit trail and the policy layer are not engineering niceties. They are how you demonstrate that an autonomous system acted within its permitted bounds, and they need to be designed in from the first commit, not bolted on after an auditor asks.

Memory and context management

An agent working on anything non-trivial quickly accumulates more context than fits in a model's window: the original goal, every step taken, every tool result, every observation. Naively appending all of it leads to two failure modes at once — the context overflows and gets truncated unpredictably, and the cost per step climbs as the history grows. Context management is therefore not an optimisation; it is a correctness concern, because an agent that loses the wrong piece of context makes the wrong decision.

Effective agents manage memory deliberately. Short-term working memory holds the current task's recent steps. A summarisation step compresses older history into a compact running summary so the agent retains the thread without carrying every token. Longer-term memory — facts the agent should remember across sessions — lives in an external store the agent retrieves from when relevant, rather than in the prompt. Designing this hierarchy is one of the quieter determinants of whether an agent stays coherent over a long task or drifts into confusion halfway through.

The practical signs that memory needs attention are familiar to anyone who has operated agents: the agent forgets a constraint it was given earlier, repeats a step it already completed, or contradicts an earlier decision. Each of those is a context-management failure, and each is fixable with deliberate memory design rather than a more powerful model.

Cost and latency are design constraints

Agents are expensive in a way single-shot LLM calls are not, because a single agent task involves many model calls — one per planning step, often several per step once tool results come back. A task that takes ten reasoning steps makes at least ten model calls, and the cost and latency add up fast. An agent feature that is delightful in testing can become unaffordable or unacceptably slow at scale, and this surprises teams who budgeted as though each task were one call.

Controlling this is a design exercise, not an afterthought.

Use a smaller, faster model for routine steps and reserve the most capable model for genuinely hard planning — most steps in most tasks are easy.
Cap the number of steps to prevent runaway loops, which protect both cost and the user experience.
Parallelise independent steps rather than running everything sequentially, which cuts wall-clock latency on multi-part tasks.
Cache tool results and sub-plans where the same work recurs across tasks.
Stream progress to the user so a multi-step task feels responsive even when its total duration is unavoidably longer than a single call.

The throughline is that an agent's economics are a property of its architecture, decided early. Teams that treat cost and latency as design constraints from the first sketch ship agents that scale; teams that treat them as something to optimise later often find the architecture has to change to make the numbers work.

Multi-agent systems: when they earn their complexity

Multi-agent architectures — several specialised agents collaborating, often coordinated by an orchestrator — are genuinely powerful and genuinely overused. They make sense when a problem decomposes naturally into specialised roles: a researcher agent that gathers information, a writer that drafts, a critic that reviews. The specialisation can improve quality, because each agent has a focused role and a tailored prompt and toolset, the same way a well-structured team outperforms one generalist doing everything.

But every additional agent multiplies the failure surface, the cost, and the difficulty of debugging. Agents talking to agents can amplify each other's errors, loop indefinitely, or produce emergent behaviour nobody designed. The honest guidance is to earn multi-agent complexity rather than assume it: start with a single well-instrumented agent, prove it on the bounded task, and introduce additional agents only when a clear role separation justifies the added surface — and only with the same guardrails, traces and evaluation applied to each agent individually. Complexity that is not earned is just risk you have volunteered for.

The unglamorous truth

Agents succeed in production through boring engineering. Guardrails keyed to risk. Policy enforced in code. Total observability. Bounded scope. Gradual, evidence-gated rollout. Graceful failure and clean escalation. None of it is the part that makes a demo go viral, and all of it is the part that determines whether your agent is still running next quarter.

It is worth being honest about why this discipline is so often skipped. Guardrails, traces and gradual rollout slow down the thrilling part — the moment the agent does something clever on its own — and they require building infrastructure that has no demo value. Under deadline pressure, with a working notebook in hand, the temptation to wire it straight into production is enormous. Almost every agent horror story we have been called to clean up started with exactly that decision: a capable prototype promoted to production without the scaffolding, because the scaffolding felt like a luxury the timeline could not afford. It always turns out to have been the cheaper option.

There is also an organisational dimension. Agents touch real systems and take real actions, which means their failures are visible to customers, finance and sometimes regulators. The teams that succeed treat an agent rollout as a cross-functional effort, not a purely technical one: operations defines which actions are high-risk, security reviews the tool permissions, and the people whose work the agent automates are involved in shaping and supervising it rather than having it imposed on them. An agent that operators trust and helped design gets used and improved; one dropped on them as a replacement gets quietly worked around. The technology is only half the deployment.

The good news is that this is a solved problem in the sense that the patterns are known and repeatable. The teams that treat agents as a serious engineering effort — with the same rigour they would apply to any system that can take consequential action — ship agents that quietly do real work. The teams that chase the autonomous-everything dream without the scaffolding spend their time cleaning up after it. Build the boring parts well, and the impressive parts take care of themselves.

If you are weighing an agent project — or rescuing one that outran its guardrails — the most useful first move is to pick one narrow, valuable task, instrument it end to end, and earn its autonomy in stages with the evidence to back each step. That is precisely the kind of disciplined, production-first agent work we do at DerbaTech: bounded scope, real observability, policy enforced in code, and a rollout your operators and your auditors can both trust. Start small, measure honestly, and expand only what the data has proven.

Put this into production

This is the kind of work we do every day. Explore the related service, or tell us what you're building.

AI Agents & Automation Data & ML Infrastructure Start a project