New Research Maps Why AI Agents Break Down After the Demo

Researchers from UIUC, Stanford, Harvard, UC Berkeley, and several other institutions have published a comprehensive survey examining why agentic AI systems consistently fail when deployed beyond controlled demonstrations. The paper, titled "Adaptation of Agentic AI," introduces a four-part framework for understanding how these systems should learn to adjust their behavior.

The timing matters. Gartner recently predicted that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs and unclear business value. Deloitte's 2025 survey found that while 38% of organizations are piloting agentic solutions, only 11% have systems running in production.

The adaptation gap

The core problem, according to the 34-author research team, isn't raw intelligence. It's that most agents are designed to execute plans rather than revise them when circumstances change.

The paper draws a distinction that sounds obvious but has major implications for system design: execution means following a predetermined sequence of actions, while adaptation means recognizing when that sequence no longer makes sense. Most current agents do only the first.

"Current agentic AI systems still struggle with unreliable tool use, limited long-horizon planning, domain-specific reasoning gaps, robustness issues in real-world environments, and poor generalization to unexplored environments where the agent lacks prior interaction experience," the authors write. That's a lot of gaps for systems being marketed as autonomous.

Four ways to fix it (sort of)

The framework breaks adaptation into four paradigms, organized by what gets updated and where the learning signal comes from.

The first two involve changing the agent itself. In what the researchers call A1 adaptation, the agent learns from direct feedback when it uses external tools. If a code execution fails or a database query returns errors, that signal drives improvement. Methods like DeepRetrieval, which achieved roughly threefold improvements in recall on literature search tasks, exemplify this approach.

A2 adaptation is broader. The agent learns from evaluations of its final outputs, regardless of tool involvement. DeepSeek-R1 and Kimi-1.5 fall into this category, using reinforcement learning to refine reasoning based on whether answers are correct.

The other two paradigms leave the agent frozen and modify its tools instead. T1 involves training tools independently, think standard dense retrievers or vision models that any agent can plug into. T2 is more interesting: tools that adapt specifically to serve a particular frozen agent, learning from signals the agent produces.

Why this matters for production

The research suggests that the most robust systems won't rely on a single adaptation strategy. The authors argue for combining infrequent updates to the base model with continuous refinement of surrounding tools and memory systems.

This has practical implications. Training a large language model is expensive and carries risks of catastrophic forgetting. Adapting retrievers, planners, and memory modules around a frozen core model is cheaper and more modular. You can swap components without destabilizing the whole system.

The paper points to recent work like S3, which trains a search subagent to maximize performance for a frozen generator, as evidence this hybrid approach works. The subagent learns what kinds of queries and retrieved documents actually help the main model succeed.

What's still missing

The framework is descriptive, not prescriptive. It maps existing research rather than providing a blueprint for building adaptive agents from scratch. The authors acknowledge several open problems: how to coordinate adaptation across agents and tools simultaneously, how to handle continuous learning without degradation, and how to ensure safety when agents can modify their own behavior.

There's also a gap between the academic literature surveyed and the production failures companies are experiencing. The paper catalogs dozens of methods showing improvements on benchmarks, but Gartner's 40% cancellation prediction and various industry reports suggesting failure rates above 80% indicate something isn't translating.

The GitHub repository accompanying the paper is available at github.com/pat-jj/Awesome-Adaptation-of-Agentic-AI. The full survey runs to dozens of pages, covering everything from code execution environments to formal theorem proving.

For teams building agentic systems, the takeaway is worth considering: if your agent can't detect when its assumptions about the world have diverged from reality, it's not autonomous. It's just automation with more impressive demos.

New Research Maps Why AI Agents Break Down After the Demo

The adaptation gap

Four ways to fix it (sort of)

Why this matters for production

What's still missing

Liza Chan

Related Articles

Anthropic Launches 10 Claude Cowork Plugins Targeting Finance, HR, and Engineering

Anthropic Study: The Better AI Output Looks, the Less People Bother Checking It

Trump Orders All Federal Agencies to Drop Anthropic Over Pentagon AI Dispute

Stay Ahead of the AI Curve