Thought Leaders
Why Enterprise AI Breaks After Deployment – and What to Do About It

Warning: The Problem Isn’t the Model
In 2023, New York City launched the MyCity chatbot to help businesses navigate complex regulations. The idea was simple: make legal information easier to access.
In practice, the system produced answers that were not just wrong, but legally misleading – from tipping rules to housing discrimination to payment laws.
An audit later found that 71.4% of user feedback was negative. Instead of fixing the underlying issues, the response was to add disclaimers. The chatbot even remained in “beta” for over two years before being shut down.
The failure wasn’t technical. The system broke down in production because there was no mechanism to ensure accuracy, no clear accountability, and no way to intervene when things went wrong.
That’s the pattern behind enterprise AI today: the technology works, but organizations aren’t set up to operate it reliably once it’s live.
From Pilot to Production: Where It All Falls Apart
Building a pilot is quite straightforward – pick a use case, choose a model, prepare data, find a sponsor. Running a system in production is a different league entirely.
The gap is like the difference between jumping into a pool and jumping from the stratosphere, as Felix Baumgartner did in 2012. Same basic physics, completely different conditions – and very different consequences for failure.
In production, AI enters real decision-making flows, interacts with customers, and creates legal and operational consequences. That’s where gaps start to appear – not in the model, but in how it’s governed.
Europe makes this visible earlier than most regions. Regulations like the EU AI Act, GDPR and NIS2 don’t slow adoption – they expose whether organizations can operate AI systems under real constraints.
In 2025, 55% of large EU enterprises were already using AI. Adoption is already happening at scale. The challenge is what happens after deployment.
At that point, basic operational questions start to surface. And often, no one can answer them: Who is accountable for AI outputs and autonomous decisions? What happens when the system behaves in unexpected ways? And who will catch it before the damage reaches the media?
Liability rests with the company, not the technology. Air Canada’s chatbot gave a customer incorrect information about bereavement fares. The customer relied on it and was later denied a refund. A tribunal ruled that the airline was responsible – the chatbot was not a separate entity.
Same problem, different angle: McDonald’s McHire system exposed sensitive data from nearly 64,000 applicants. The cause wasn’t a sophisticated attack – the admin login used “admin” and “123456.” The system looked advanced. The failure was elementary.
When you bolt governance onto a live system, it’s already too late. Deploying a system is a technical decision. Operating it reliably is an organizational one. And that the part most companies underestimate.
Who Actually Owns AI Risk? Nobody.
This is the core of the problem, and paradoxically the one least discussed. IT manages infrastructure. Legal handles compliance. Business teams push use cases. But no one owns end-to-end AI risk.
That creates two immediate problems. The “go” decision slows down – because no one wants to take responsibility. And the “stop” decision slows down equally – because no one knows who can.
The data reflects it. Fewer than 10% of AI use cases make it from pilot to production, and most organizations struggle to generate measurable business impact. At the same time, many are already deploying AI – but according to a governance maturity survey, only 7% had well-structured and consistently applied governance in place.
Why does this happen so consistently? Because most frameworks and corporate policies define what should happen – not who is accountable when it matters. When a system starts producing incorrect outputs at midnight on a Friday, the question isn’t theoretical. Who acts? And who has the authority to decide?
This only gets worse with scale. One system can be managed informally. When you have thirty, responsibility fragments across teams, and no one has the full picture.
Commonwealth Bank of Australia provides a clear example. The bank replaced 45 customer service workers with AI voice bots, expecting demand to drop. It didn’t. Call volumes increased, managers stepped in to handle overflow, and the bank had to rehire all 45 employees. When challenged, it couldn’t demonstrate that automation had reduced workload.
No one had validated the assumptions before deployment. No one owned the outcome when those assumptions failed. That’s what an accountability vacuum looks like in practice.
Having Rules Isn’t Enough. You Need a Mechanism
Most organizations don’t lack policies. They lack systems that work when something goes wrong.
A policy defines what should happen. A mechanism determines what actually happens – when a model produces incorrect outputs, when a vendor changes something in the background, or when a system starts behaving in unexpected ways.
That difference becomes visible in production – when decisions have to be made under real conditions.
These failures follow a consistent dynamic. In each case, the same operational gaps appear – just in different forms.
Ownership comes first
Every deployed AI system needs a clearly accountable owner – one person, not a team or a department, with the authority to approve, pause, and shut it down.
Without that, neither fast deployment nor safe intervention is possible. As seen in the Commonwealth Bank example, the absence of clear ownership leads directly to operational failure.
Data and legal clarity are often missing
Many systems go live without documented data flows, a verified legal basis, or clarity on what obligations apply once the system is in production.
The Italian regulator’s action against DeepSeek in 2025 illustrates this clearly. The issue wasn’t model quality – it was the inability to explain how personal data was handled. The result was a sudden service disruption for European users.
Testing rarely reflects real-world use
Systems are often evaluated on scenarios where they perform well, but not on the cases where failure would matter most.
The MyCity chatbot is a clear example. Basic edge cases – around labor law, housing discrimination, or payment rules – were not caught before deployment. Once exposed to real users, those failures became public immediately.
Testing isn’t just about performance – it’s about identifying where the system fails before users, regulators, or journalists do.
Intervention is unclear or too slow
Even when issues are visible, there is often no clear trigger or authority to pause or shut down the system.
Zillow Offers demonstrates this at scale. The system used an algorithm to price and purchase homes. As the market cooled in 2021, the system kept buying at inflated prices. There was no mechanism to detect drift in time, and no clear decision point to stop to it. The result was losses exceeding $880 million and the closure of the entire division.
Monitoring is not ownership
Monitoring is often reduced to dashboards, but that’s not what prevents failures.
What matters is defined responsibility: who tracks signals, what triggers escalation, and who is expected to act.
Deloitte Australia’s case shows what happens when that is missing. A government report included hallucinated citations and incorrect legal references because no one was explicitly responsible for verifying outputs before delivery. The result was a partial refund and reputational damage.
Agentic AI: What’s Coming Will Be Even Harder
Generative AI produces outputs. Agentic AI takes action. That changes the risk entirely.
Instead of a single response to evaluate, one instruction can trigger a chain of decisions across systems – API calls, data access, transactions, updates – often without human intervention at each step.
When something goes wrong, the problem is no longer accuracy. It’s traceability. Which step caused the issue? What data was used? Who authorized the action? In many cases, those questions are difficult to answer after the fact.
That’s where existing gaps become critical. Unclear ownership, weak monitoring, and lack of intervention don’t just persist – they compound. A flawed answer can be corrected. A flawed action can create consequences before anyone notices.
Early signals already point in this direction. Gartner estimates that more than 40% of agentic AI projects will be cancelled by 2027 – not due to model limitations, but because organizations struggle to control cost, risk, and outcomes. That’s the same pattern we see with generative AI after deployment. Just with higher stakes.
Regulators are already responding with a simple principle: automation does not remove accountability. For organizations, that creates a clear implication: if ownership and control are unclear today, scaling into agentic systems will not solve the problem. It will amplify it.
Operate It – Or Lose It
AI is no longer the constraint. Models are widely available, capable, and increasingly commoditized. The real differentiator is not whether an organization can build AI – but whether it can operate it reliably once it’s live.
That’s where most failures occur – in how systems are run, not how they are built. The organizations that succeed will not be the ones with the most advanced models. They will be the ones with the clearest operational structures around them.
This can be tested directly. Take your most important AI system and answer three questions:
- Who can shut it down?
- How do you know when it’s failing?
- What happens when it does?
If those answers are unclear, the system isn’t ready for production.
The model might be. The organization isn’t.












