Autonomous agents that quietly run a third of back-office operations
A planner/executor agent system that reasons over the client's tools and APIs to resolve routine operational exceptions end to end — safely, with guardrails and human approval where the stakes demand it.

- Of routine ops tasks automated
- ~30%Of routine ops tasks automated
- Typical action latency
- <2sTypical action latency
- Unsupervised high-risk actions
- 0Unsupervised high-risk actions
The client operates a logistics platform whose operations team spent a large share of every day on repetitive exception handling: a shipment flagged for a mismatch, a record that needed reconciling across two systems, a routine status update that required touching three different tools. The work was necessary, unglamorous, and an enormous drain on a skilled team's time — exactly the kind of work that looks automatable until you try to automate it with rigid rules.
They had tried rules-based automation before and been burned. Every new edge case broke a brittle script, and the long tail of exceptions was precisely where the time went. They came to us wanting to know whether AI agents could handle the judgement that rules could not — and, reasonably, they were nervous about handing autonomy to a system that acts on their live operational data.
The challenge
This was a problem with real upside and real danger in equal measure. The upside was obvious: a third of the team's work was routine enough to delegate. The danger was that an agent with access to operational systems can take actions that are expensive or impossible to undo, and a confidently wrong agent is far worse than a slow human.
- The tasks required reasoning over several systems at once — the kind of cross-tool judgement that defeated rules-based automation.
- Some actions were low-risk and reversible; others moved real inventory or money and could not be allowed to happen unsupervised.
- The operations team needed to stay in control and in the loop, not be replaced by an opaque system they could not trust or correct.
- Any solution had to integrate with the client's existing tools and APIs rather than demanding they rebuild their stack.
The honest first conversation was about scope. We did not promise an agent that could handle anything; we proposed agents that would handle a bounded, well-defined class of exceptions extremely reliably, and expand only as the evidence justified it.
Our approach
We built a planner/executor agent system: a planner decomposes an incoming exception into steps and decides what to do next, and an executor carries out one step at a time against the client's APIs, observing each result before proceeding. Separating planning from execution made the agent's behaviour inspectable and gave us natural points to enforce control.
Guardrails keyed to risk
The heart of the design is that every action the agent can take is classified by the cost of it going wrong, and the controls are proportional. Low-risk, reversible actions execute freely — that is where the productivity comes from. High-risk or irreversible actions require a dry-run preview and a human approval before they touch the systems of record. This single decision is what made the client comfortable granting any autonomy at all, because the downside was bounded by design.
Policy enforced in code, not in prompts
We did not rely on instructing the model to behave. The rules of what the agent may do — the allowlist of tools, parameter validation, spending and scope limits, the approval requirements — live in a policy layer outside the model that it cannot talk its way around. The model proposes; the policy layer disposes. That distinction is what separates an agent you can run in production from one you can only run in a demo.
Total observability
Every plan, every tool call with its arguments and result, and every outcome is written to a durable trace store. When something goes wrong — and in the early stages it did — we could replay exactly what the agent saw and decided, fix the cause, and add the case to the evaluation set so it could never recur silently. The same traces gave the operations team and the client's auditors a complete record of what the agent did and on whose authority.
Earning autonomy in stages
We did not flip a switch. The agent ran first in shadow mode, proposing actions without executing them while we compared its proposals to what the team actually did. Then it moved to human-in-the-loop, acting only with approval, which built both the trace history and the team's trust. Only once the evidence showed it was reliable on a class of exceptions did we let it act autonomously on the low-risk ones, escalating the rest. Each stage was gated by data from the one before.
The results
Roughly a third of the team's routine back-office work now runs without human intervention, and exceptions that used to sit in a queue for hours are resolved in under two seconds end to end (p50 action latency from exception receipt to system-of-record update, measured at steady-state load). Critically, not a single high-risk action has executed unsupervised — the guardrails held, and the operations team supervises outcomes rather than doing the busywork. The team was not replaced; their time was redirected from repetitive handling to the genuinely hard cases that need human judgement.
Because the system is observable and evaluated, the client can extend it to new exception types with confidence, following the same shadow-to-autonomy path. What began as a nervous experiment became a dependable part of operations — and a template the client now applies to other process areas.
What made it work
Agents succeed in production through boring engineering, and this was a case study in exactly that: risk-classified guardrails, policy in code, total observability, bounded scope, and a rollout that earned autonomy one evidence-backed notch at a time. The intelligence was the easy part. The discipline around it is what turned an autonomous system from a liability the client feared into an asset they rely on.
“They were honest about what AI could and couldn't do for us, then shipped an agent that quietly automates a third of our back-office work.”


