Skip to content
Fintech & InsuranceSeries B Fintech

A grounded RAG assistant that enterprise customers actually trust

We turned a flaky internal chatbot into a cited, evaluated knowledge assistant serving thousands of support and operations queries a day — with the faithfulness and latency an enterprise can stand behind.

A grounded RAG assistant that enterprise customers actually trust — illustrative visual
Support handling time
−40%Support handling time
Answer faithfulness (eval)
95%Answer faithfulness (eval)
Answers carrying citations
100%Answers carrying citations

The client is a fast-growing fintech whose support and operations teams field thousands of questions a day about products, policies and edge cases — the kind of detailed, high-stakes queries where a wrong answer is not just embarrassing but potentially a compliance problem. They had built an internal assistant on top of a large language model in the hope of deflecting some of that load. It demoed beautifully and, in production, nobody trusted it.

By the time we were brought in, the assistant had quietly fallen out of use. Agents had learned that it sometimes invented policies that did not exist, cited nothing, and gave different answers to the same question on different days. The leadership team still believed in the opportunity — the volume of repetitive, knowable questions was real — but they needed a system the support floor would actually rely on and the compliance team would actually approve. That is a very different bar from a working demo.

The challenge

The problems were concrete and compounding. The prototype hallucinated because it leaned on the model's parametric knowledge rather than the company's actual documentation. It cited nothing, so an agent had no way to verify an answer in the two seconds they had before responding to a customer. And the knowledge it needed to draw on was scattered and inconsistent.

  • Knowledge lived across hundreds of policy PDFs, a sprawling help centre, and three internal tools, each with different access rules and update cadences.
  • Many questions hinged on exact identifiers — product codes, plan names, regulatory references — exactly the tokens pure semantic search handles worst.
  • Compliance required that every answer be traceable to an authoritative source, with no exception, before the tool could face customers even indirectly.
  • Latency mattered: an answer that took eight seconds was useless to an agent on a live call.

Underlying all of it was the absence of any way to measure whether the system was right. Every change to the prototype had been a matter of opinion, which meant improvements in one area silently broke another, and nobody could say with confidence whether the thing was getting better or worse.

Our approach

We treated this as a retrieval and evaluation problem first and a generation problem second. Before touching prompts, we built the pipeline that would let us measure retrieval quality independently, and the evaluation harness that would turn every subsequent decision into a measurement rather than an argument.

Architecture of the fintech RAG assistant: knowledge sources (PDFs, help centre, tools) are ingested, chunked and indexed in a hybrid vector and keyword store. Retrieval with access control feeds re-ranking and grounded generation with citations, surfaced to the support agent. An evaluation harness scores faithfulness and recall on every release and tunes retrieval.
The system we built: hybrid retrieval with access control, grounded generation with citations, and an evaluation harness wired into every release.

Retrieval that finds the right thing

We replaced naive vector search with hybrid retrieval — combining semantic search with keyword search — so that questions about specific product codes and policy references actually surfaced the right document. On top of that we added a re-ranking model that scored candidates against the query, lifting precision so the generation step received a tight, relevant context rather than a noisy one. Chunking was redesigned to respect the structure of the source documents, keeping policies and their conditions together instead of fragmenting them.

Citations and access control, enforced where it counts

Every answer now carries inline citations to the source documents, which transformed the agent experience: a claim can be verified in a glance rather than taken on faith. Crucially, access control is enforced at retrieval time, not bolted on afterwards — the system can only ground an answer in documents the current context is authorised to use, which satisfied the compliance team's hard requirement and closed off a whole class of data-leakage risk.

Measuring faithfulness on every release

We built an evaluation harness around a dataset of real questions pulled from support logs, deliberately seeded with adversarial cases — questions whose answer was not in the corpus, where the right behaviour is to decline rather than invent. Every release is scored automatically for faithfulness, retrieval recall and answer relevance. A change that improves fluency but lowers faithfulness is treated as a regression and blocked, which is what finally made progress monotonic instead of a game of whack-a-mole.

Rolling out without betting the floor

We shipped behind a feature flag to a small group of senior agents first, with the system's answers reviewed before wider release. The traces and evaluation scores from that period built both the trust and the evidence to expand. Only once the numbers held across a representative range of real questions did we widen access to the full support floor, and even then with monitoring on faithfulness and latency so any regression would surface immediately rather than in a customer complaint.

The results

The assistant went from abandoned to genuinely relied upon. Support handling time fell by 40% compared to the prior quarter's average, measured over the first 60 days of full-floor deployment, as agents trusted the answers and reused them directly, citations and all. Every response now carries traceable citations, which satisfied the compliance team's audit requirement and removed the blocker that had kept the tool away from customer-facing work. Faithfulness reached 95% on a held-out labelled evaluation set — each generated claim was checked for entailment against the cited source documents by an independent grader, not self-assessed by the system — and the evaluation harness catches regressions before they reach production, so the team can upgrade models and refine prompts without fear.

Latency targets were also met: median (p50) time to first token stays under 200 ms on standard queries, keeping the tool usable for agents on live calls. Just as important as the headline numbers was the change in posture. The team stopped arguing about whether the assistant was good enough and started reading the scorecard. New engineers could improve the system safely, because the gate protected them from shipping a regression. The assistant became something the organisation could build on rather than a liability it tolerated.

What made it work

Nothing about this engagement was exotic. It was ordinary engineering discipline applied to a probabilistic system: fix retrieval before prompts, measure faithfulness as a number, enforce access control as a security boundary, cite everything, and roll out on evidence. The difference between the prototype that was abandoned and the system that is relied upon was not a better model — it was the discipline around the model. That is the work, and it is the work that makes AI trustworthy enough to put in front of an enterprise's customers.

DerbaTech took our RAG prototype from a flaky demo to a system our enterprise customers actually trust — with citations, evals and the latency we needed.
VP of Engineering · Series B Fintech

Let's build the AI that moves your business.

Tell us the problem. We'll propose the smallest first step that proves real value — usually within a week.