The Production Gap: Why Demos Don't Deploy

Chris Brown

03 Mar 2026 — 11 min read

Authors: Chris Brown - The Build Paradox, Mike Daly - Insurtech World
Published: 03/03/2026

The demo worked brilliantly. The pilot impressed stakeholders. The vendor’s case studies are compelling. Have you been in that position?

But now someone’s asking when you can roll it out to production?

This is where most AI initiatives in insurance stall. Not because the technology failed, but because nobody planned for what production requires in a regulated, evidence-dependent, litigation-exposed industry. BCG’s 2024 Build for the Future study found that only 7% of insurance companies managed to scale in a meaningful way [1]. A Bain & Company analysis of generative AI in claims reached a similar conclusion: most insurers remain in pilot or limited-use stages, held back by data security concerns, insufficient in-house expertise, and accuracy concerns [2]. A separate Roots survey of insurance executives found that while nearly 90% identify AI as a top strategic initiative, only one in five have AI solutions running in production [3].

The industry experienced similar outcomes trying to adapt RPA to its problems, so the findings above should hardly surprise us.

The gap between “impressive demo” and “defensible in production” is where most AI programmes in insurance are stuck right now. Not because of the technology itself, but because of the surrounding requirements.

What Production Actually Requires

The FCA has been clear: it won’t introduce AI-specific regulations but expects existing frameworks to apply [4]. That means the burden is on you to demonstrate compliance through accountability under the Senior Managers & Certification Regime (SM&CR), alignment with the Consumer Duty, and operational resilience.

Having led integration programmes, including ISO 42001 implementation and LLM-based classification systems, I can outline the key requirements for any UK insurer deploying agentic AI in claims.

Infrastructure that actually works. Not a demo environment, but production-grade deployment with resilience, proper versioning, and comprehensive logging. Every input, prompt, and output needs to be captured with timestamps and user context. Monitoring that detects degradation before customers do. Fallback procedures for when AI is unavailable, with manual processes that staff can execute. This sounds basic, but most POCs skip it entirely.

Model risk management. A model inventory documenting what AI exists, where, what it does, and who owns it. Independent validation, not just the team that built it reviewing their own work. Model tiering by materiality and risk, with governance proportionate to each tier. Documentation of design decisions, assumptions, and known limitations. Data lineage showing the sources of the training data and the assumptions embedded in it.

Governance with teeth. Prompts versioned with full history and rollback capability. Sign-off processes for prompt changes that include business and compliance review, not just code review. Testing frameworks for both internal validation and client acceptance testing. Named accountability under SM&CR, with a specific senior manager owning AI governance rather than a committee. Consumer Duty alignment documented, including how outcomes and fair value are evidenced for AI-assisted journeys.[5].

Bias and fairness controls. Claims and pricing decisions can lead to discriminatory outcomes, even when no one intended them. Consumer Duty raises the bar on avoiding foreseeable harm and delivering good outcomes, and the Equality Act 2010 provides a hard legal backstop when protected characteristics are affected, directly or indirectly [6]. So, treat fairness as a production requirement: pre-deployment bias testing; checking performance and routing outcomes across relevant segments (for example, age bands, disability indicators where known, and location); and ongoing monitoring, as data and claimant behaviour shift over time. The FCA has been explicit that it is actively watching how AI intersects with insurance outcomes. In Treasury Committee evidence, David Geale noted the FCA has not found evidence of systemic bias in current insurance pricing practices to date, but also signalled further work on AI’s interaction with pricing models as part of the FCA’s 2026 insurance priorities [7, Q173–Q174]. In January 2026, Sheldon Mills set out the FCA’s long-term review into AI and retail financial services, with recommendations due to the FCA Board in summer 2026 [8]. The direction of travel is clear: you should assume scrutiny will increase, not decrease.

Third-party provider governance. If you’re using foundation models from OpenAI, Anthropic, Google, or Microsoft, you’re building critical infrastructure on services you don’t control. Due diligence on their data handling, model update policies, and security practices. Data residency questions: Where does your UK customer data reside when it reaches your API? Monitoring for vendor model changes affecting your system behaviour. Exit strategy if you need to migrate.

Operational resilience. Business continuity plans for when AI degrades or fails. Human override protocols that are documented, trained, and friction-free, enabling handlers to challenge AI decisions. Model drift detection is distinct from general performance monitoring.

Each of these requires time, expertise, and budget. None are optional for regulated deployment. All are invisible in the demo.

The FCA has been direct about where responsibility sits. David Geale, Executive Director for Payments and Digital Finance, told the Treasury Committee that FCA supervisors will not be examining individual AI models: “They are not coders, and they will not go into that” [7, Q139]. Instead, the FCA will assess whether firms designed products with clear outcomes in mind, tested those outcomes, and can demonstrate compliance. When pressed on whether the FCA could detect flawed data entering an AI system, Geale was equally candid: “Genuinely, I do not think that we would be able to spot a rogue piece of data ourselves” [7, Q143].

The implication is clear. The regulator will not catch your production issues for you. The burden of building defensible AI infrastructure falls entirely on the firm.

Non-Determinism Meets Regulatory Evidence

Here’s the fundamental challenge. LLMs are non-deterministic by design.

Same prompt. Same model. Same input. Different output.

This isn’t a bug to be fixed. It’s fundamental to how these systems work. The capability that makes LLMs useful for complex reasoning is inseparable from their unpredictability.

For most applications, you can tune this variability to acceptable bounds. Temperature settings constrain randomness. Structured outputs enforce formats. Guardrails bound behaviour.

For insurance claims decisions, “acceptable bounds” is a much smaller target.

The FCA expects you to demonstrate how customer outcomes were determined. When a coverage decision is challenged, you need to explain exactly what inputs were considered, what logic was applied, and why that outcome was appropriate for that specific customer.

“The AI decided, and it might decide differently if we ran it again” isn’t an answer a regulator will accept. It’s not an answer you’d want to defend in court. It’s not an answer that will satisfy a customer who believes they’ve been treated unfairly.

The Moving Target Meets Long-Tail Accountability

Your AI system’s behaviour can change without any deployment on your part.

When your model provider releases an update (and they will, without asking your permission), your carefully tuned prompts may behave differently. Edge cases you’d handled are starting to fall through. Outputs that passed testing start failing in production.

This isn’t a one-time migration problem. It’s the permanent operational reality of building on foundation models you don’t control.

In most industries, you detect the problem, fix it, and move on. In insurance, you have a different challenge: long-tail accountability.

A claim handled today might be disputed in 2027. Litigated in 2029. Reviewed by regulators in 2030. You need to be able to explain and defend that decision years after it was made.

If your AI’s behaviour has shifted three times since then, through model updates you didn’t control, can you reproduce what it did and why? Can you demonstrate that the decision was appropriate based on the information available at the time?

Most AI architectures can’t answer these questions. They weren’t designed with this requirement in mind.

What you need: immutable logs of every input, prompt, model version, parameters, and output. Version-pinned models wherever providers allow it, recognising that providers deprecate versions and version-pinning buys time rather than permanence. Prompt version control with the same rigour as production code, with every change tracked and every version recoverable. A complete audit trail that can demonstrate exactly what your system did at the time of any given decision, even if the model can no longer reproduce that output today. Retention policies that match your regulatory and litigation exposure.

If you can't show exactly what your AI did on a specific claim two years ago, what went in, what came out, and what model and parameters produced it, you're not ready for production in claims.

Testing: The Underestimated Problem

With conventional software, you define inputs and expected outputs. If the function returns what you expect, the test passes. Deterministic. Repeatable. Well understood.

LLMs don’t work that way. The same prompt with the same input can produce different outputs. Your test matrix explodes. You end up sampling, which means accepting that some failures will reach production.

Even with well-designed testing frameworks covering both internal validation and client acceptance testing, the conversation with compliance can be difficult:

“What’s your error rate?”

“We measured 97% accuracy on classification against our test dataset.”

“What happens in the 3% failure cases?”

“Depends on the failure. Some are caught by handlers who reclassify them. Some might reach customers before being corrected.”

“Can you guarantee it won’t get worse over time?”

“No guarantee. We can monitor, detect degradation, and respond, but the model behaviour can shift.”

Securing compliance sign-off for probabilistic systems is a conversation most regulated industries aren’t ready to have, but they can see coming. In insurance, that conversation also involves your legal team, your reinsurers, and potentially the FCA.

The Helpful Assistant Problem

AI assistants are designed to be helpful. In claims, that’s the problem.

An AI trying to be helpful might offer coverage where none exists because the customer’s situation seems sympathetic. It might accept evidence that should be challenged because challenging feels unhelpful. It might miss fraud indicators because the claim narrative is coherent and the AI is optimised for helpfulness rather than suspicion. It might overlook vulnerability indicators because they weren’t explicitly stated. It might make commitments that create problems: statements that could be interpreted as estoppel, coverage confirmations that weren’t intended as such.

The guardrails can’t be behavioural. Telling the AI “Please check for fraud” or “remember to consider vulnerability” doesn’t reliably produce the behaviour you need.

The guardrails must be architectural. The system design must constrain the AI’s actions, the decisions it can make autonomously, and the situations requiring human judgment, regardless of what the AI deems appropriate.

In practice, architectural guardrails mean the AI never operates in open-ended mode. Structured output schemas constrain responses to predefined decision paths rather than allowing free-text responses. Hard-coded business rules gate its actions: no coverage confirmation above a defined threshold without human review, no payment authorisation regardless of confidence score, no direct customer communication without handler approval. Claim types with fraud indicators, vulnerability signals, or policy ambiguity route automatically to human review, not because the AI was asked to flag them, but because the workflow makes it structurally impossible for those claims to proceed without human judgment. Separate validation steps check the primary agent's output against policy terms before anything reaches the customer. The AI's authority is bounded by system design, not by how well you wrote the prompt.

This is harder to design than it sounds. The same flexibility that makes LLMs useful makes them hard to constrain reliably.

The Explainability Requirement

Insurance decisions get challenged. When they do, you need to explain how the decision was made.

Not in general terms (“the AI assessed the evidence and determined coverage applied”). In specific terms (“these inputs were considered, this logic was applied, these factors were weighted, and this outcome was reached for this specific claim”).

Most LLM-based systems can’t produce this explanation. They can produce post-hoc rationalisations, plausible-sounding explanations generated after the decision. But generating an explanation for why you did something isn’t the same as documenting what you actually did.

For regulatory purposes, for litigation purposes, for customer complaint purposes, you need the actual chain of reasoning. Not a reconstructed narrative.

When the Treasury Committee pressed Geale and Rusu on how the FCA would assess whether a firm’s AI decision-making had sufficient transparency and explainability, the answer reinforced how much it depends on the firm’s own architecture. The FCA’s approach is outcomes-based: firms are expected to design products with clear outcomes in mind, test whether those outcomes are being delivered, and demonstrate that they can evidence the results [7, Q139, Q163]. The regulator will look at what the product was designed for, what comes out the other end, and what controls are in place. If a firm cannot explain how its AI reached a specific decision, the existing regulatory framework already addresses that, with no new rules required.

In practice, this means requiring each agent in your system to output its process, the data it used, and the assumptions it made. This is not optional. Without that audit trail, you have a black box making coverage decisions, which is indefensible to regulators.

Why Projects Stall

David Geale put it plainly to the Treasury Committee: “‘I did not understand it’ is not a defence, because you should understand what you are deploying, or you should understand what you are seeking to achieve” [7, Q160].

Yet the joint Bank of England and FCA survey found that only 34% of management feel they have a full grasp of the complexities of the AI models their firms use [7, Q131]. Most current AI workloads sit in low-materiality areas: Know Your Customer (KYC) checks, anti-money laundering, and document processing. When firms begin deploying AI in higher-stakes decisions such as claims, the governance gap will widen.

Here’s the uncomfortable truth. Most of the infrastructure work required for production isn’t exciting.

Building deployment pipelines, implementing logging, setting up prompt versioning, running penetration tests, and documenting sign-off processes. None of this makes for a compelling demo. None of it impresses stakeholders as much as a working AI classification.

So, it gets deprioritised. The resource is allocated to the next feature. The production infrastructure work sits in the backlog, while the demo is being shown to additional executives.

Then someone asks, “When can we go live?” The honest answer is, “We haven’t built the infrastructure to operate this safely in production.”

The AI works. The governance framework gets designed. But the production deployment work, the unglamorous infrastructure that makes it defensible, keeps slipping because there is always something more urgent.

That’s not a technology failure. It’s an organisational failure to understand what production-ready means in a regulated industry.

What Production-Ready Actually Means

Before deploying any agentic AI in claims, you need clear answers to these questions:

Explainability: Can you produce a clear, human-readable explanation of why the AI took a specific action on a specific claim? Not in general terms. For this claim, with this customer, at this time. In a format a court would accept.

Reproducibility: If you replay the same inputs through your system in six months, will you get consistent outputs? If the model has been updated, can you reproduce what the system would have done at the time in question?

Auditability: Are all AI actions logged with sufficient detail for FCA review? Can you demonstrate the complete chain from customer input to system output to claim outcome? Can you produce that audit trail five years from now?

Accountability: When the AI makes a mistake, who is responsible? Geale was unequivocal: somebody has to be accountable for the activities, including the deployment of the technology [7, Q141]. How does your governance framework assign accountability for AI-influenced decisions? Have your lawyers signed off on that allocation?

Override mechanisms: Can handlers override AI decisions when their judgment differs? Is this easy to do, or does the system create friction that discourages disagreement? What happens to the audit trail when a decision is overridden?

Drift detection: How will you know when AI behaviour has changed? What monitoring detects degradation before customers do? Before regulators do?

If you can’t answer these questions clearly, you’re not ready for production, regardless of how impressive the demo was.

The Vendor Question

When a vendor shows you an impressive demo, ask them:

“How do we explain a specific decision to a regulator?”

“What happens when you update the model, and our prompts behave differently?”

“Can we reproduce the exact system state from a decision made eighteen months ago?”

“What’s our liability exposure if the AI makes a coverage determination that’s later challenged?”

Watch how they answer. If they pivot to discussing accuracy rates and customer satisfaction scores, they haven’t built for regulated-industry deployment.

The vendors who understand insurance will have answers. The vendors who built a general-purpose solution and added “insurance” to their marketing won’t.

Next in series: The Measurement Problem – the double standard between human and AI accuracy, and what mature measurement looks like.