Thought Leaders
Why AI Agents Pass QA and Still Fail in Production

Continual learning is becoming the engineering discipline for making agents improve after deployment, without breaking what already worked.
An AI agent can pass every pre-launch eval and still fail in production a week later. That is not a contradiction. The eval set reflects what the team knew to test before launch. Production is where the missing cases show up: strange phrasing, missing context, tool edge cases, impatient users, conflicting policies, and workflows no benchmark designer imagined.
The agent gets corrected by the users all the time. It disappoints users. Then the session ends, the log is stored, and the next user meets essentially the same system.
This is why continual learning is becoming central to agent engineering. It is not a feature of one product. It is a category of methods for making agents improve from experience while preserving what already works. Classic continual learning research framed the problem as learning over time without catastrophic forgetting. Agents make that problem wider. The thing that changes may be a model, but it may also be a prompt, a tool, a skill, a workflow, or memory.
That distinction matters, because most agent failures are not solved by reaching first for model training.
The fine-tuning reflex is too narrow
When teams talk about making an AI system better, the default plan often sounds like this: collect failures, label better answers, fine-tune the model. That instinct is understandable. Supervised fine-tuning, Direct Preference Optimization, Group Relative Policy Optimization, and parameter-efficient methods such as LoRA are useful tools when the model itself needs to change.
But many production failures are not model-weight failures. They are system failures.
The agent may rely on stale memory, skip a required confirmation, call a tool with the wrong argument, or route a case through the wrong workflow. Often, the problem is not the base model’s capability. It is the context, memory, tool interface, or workflow wrapped around it.
A modern agent has several layers. The model reasons and generates. The harness around it defines the prompts, tools, skills, code, routing, and workflow. Memory carries facts and learned procedures across sessions. Continual learning is the discipline of deciding which layer should change, how small the change can be, and how to verify that the change actually helped.
Sometimes the right fix is a memory write. Sometimes it is a prompt edit. Sometimes it is a tool wrapper, a routing rule, or a workflow patch. Fine-tuning should remain available, but it should not be the first answer to every failure.
Benchmarks are useful, but production rarely gives you one
There is exciting work on optimizing the agent harness itself. Methods such as GEPA, Meta-Harness, and related prompt or workflow optimization approaches treat the agent as a system that can be mutated and tested. They can propose edits to prompts or other harness components, run candidates, and keep the versions that score better.
That is the right direction. It moves improvement out of the narrow frame of “update the weights” and into the broader frame of “improve the agent.”
But there is a catch: these methods usually assume a benchmark. They need a task that can be run repeatedly and an evaluator that says whether candidate A is better than candidate B. Without that, optimization becomes guesswork with better tooling.
This is not what most teams have in production.
What they have are logs. They have traces, user corrections, support tickets, thumbs-down events, escalation notes, and occasional expert feedback. Those signals are valuable, but they are not yet a benchmark. They tell you something happened. They do not automatically tell you how to replay it, what success should look like, or how to score a proposed fix.
That gap is where many continual-learning efforts stall. The team has experience, but not yet a learning environment.
Logs are not lessons
A production log records one path through an interaction. A user asked for a flight. The agent searched. The user said the date was wrong. That is evidence of a failure, but it is not enough to learn from.
The log does not define the counterfactual. Should the agent have asked for confirmation? Should it have inferred the date from earlier context? Should it have called a different tool? Should it have refused to proceed until the ambiguity was resolved? A human may know the answer after reading the trace, but the system does not get that structure for free.
For continual learning to work, a raw failure has to be turned into something replayable. That means a task the agent can face again, a user or simulator that recreates the relevant pattern, tools the agent can call, and evaluators that define success. The evaluator might check the final answer, the tool calls, a policy boundary, latency, cost, or all of the above.
This is the less visible part of the work, but it is the part that makes improvement real. Once a failure becomes a replayable environment, you can ask a concrete question: did the proposed change actually fix the behavior?
Without that step, teams are mostly patching from memory.
David Silver and Richard Sutton have described a coming era of experience, where agents learn primarily from interaction with the world rather than from static human data. For enterprise agents, that vision depends on turning messy production experience into environments that can be replayed, scored, and reused.
Experience alone is not enough. It has to be made testable.
Regression is the hidden cost
Even when a failure becomes testable, the hardest part remains: fixing it without breaking something else.
Anyone who has maintained a complex agent has seen this pattern. You add an instruction so the agent escalates aggressive refund requests. Now it escalates routine refunds that should be handled quickly. You reduce tool calls in one workflow. Now another workflow skips a required check. You correct a stale memory. Now the agent overgeneralizes the correction to a different product line.
Each patch makes sense locally. The system still drifts globally.
This is the agent version of catastrophic forgetting. In neural networks, the phrase usually refers to new training overwriting older capabilities. In agents, the failure is broader and often harder to see. Forgetting can happen in prompts, tools, memory, routing, and workflow. It shows up not as a clean metric on a training curve, but as a user saying: “This used to work.”
That is why regression control cannot be a final review step. It has to be inside the learning loop itself.
The goal is not simply to maximize performance on the newest failure. The goal is to improve the new case while preserving the old ones. Every fix that works should become part of the agent’s growing memory of what must keep working. In practice, that means old failures become regression tests. The agent’s history becomes a constraint, not just an archive.
This is where continual learning becomes more like serious software engineering than prompt tinkering. A change is not good because it sounds better. It is good because it improves a measured behavior and does not regress the behaviors the system had already earned.
What practical continual learning requires
A production-ready continual learning loop needs four properties.
First, failures must be replayable. A one-off failure is an anecdote. A replayable, graded environment is a test. Until the agent can face the same pattern again, nobody can prove the fix worked.
Second, diagnosis has to be holistic. The fix may belong in the model, but it may also belong in memory, the prompt, the tool layer, or the workflow. The best fix is usually the smallest durable change that explains the failure.
Third, learning has to be lifelong. The agent should not improve this week by quietly undoing last week’s hard-won behavior. Prior successes should become constraints during optimization, not surprises after deployment.
Fourth, the loop has to be efficient. If every improvement requires a quarterly retraining project, the system will never keep up with production. The loop has to try cheap fixes first, escalate only when needed, and keep verification close to the change.
None of this means agents should update themselves blindly. It means the opposite. Improvement should become measurable. Every change should have a test, a before-and-after score, and a regression check.
That is what turns continual learning from a vague aspiration into an engineering discipline.
The future of agents will not be defined only by larger context windows, stronger base models, or more tools. Those will matter. But the more important question for enterprises is what happens after deployment.
When the agent fails tomorrow, can the system turn that failure into a test? Can it route the fix to the right layer? Can it prove the fix helped? Can it prove nothing else broke?
If the answer is no, the agent is not really learning from production. It is accumulating risk.
The agents that matter next will do something better. They will compound.












