Anthropic Study: AI Agents Can Handle More Than Users Allow

Anthropic published a large-scale study on Tuesday analyzing millions of human-agent interactions across Claude Code and its public API, and the headline finding isn't about what agents are doing. It's about what they're not being allowed to do.

The research post, authored by a twenty-person team including Miles McCain, Saffron Huang, and Deep Ganguli, introduces what the company calls a "deployment overhang": the autonomy models are capable of handling exceeds what they exercise in practice. Between October 2025 and January 2026, the longest Claude Code sessions (the 99.9th percentile) nearly doubled in duration, from under 25 minutes to over 45 minutes. But the median turn still lasts about 45 seconds. Most people are barely scratching the surface.

The trust curve

The more interesting story is in how behavior shifts over time. New Claude Code users employ full auto-approve (letting Claude run without manual sign-off on each action) about 20% of the time. By 750 sessions, that figure climbs past 40%. So far, so predictable. But here's the wrinkle: experienced users also interrupt Claude more frequently, not less. Interrupt rates climb from about 5% of turns for newcomers to roughly 9% for veterans.

That's a real pattern, not a contradiction. New users babysit each step. Experienced users let Claude run free and yank the leash when something looks wrong. It is a shift from pre-approval to active monitoring, and the data is pretty clean on this. The researchers note that Claude Code's default settings push new users toward step-by-step approval anyway (auto-approve is off by default), so some of the trend is just people discovering the settings menu. Still, the gradual ramp suggests genuine trust accumulation, not just configuration drift.

When the agent stops itself

This is the part that caught my attention. On complex tasks, Claude Code asks for clarification more than twice as often as humans bother to interrupt it. The model is, in a sense, self-limiting. 35% of Claude's self-initiated pauses are to present users with a choice between approaches. Another 21% involve gathering diagnostic info. Only 13% are what you'd call genuine "I don't understand the request" clarifications.

The human side looks different. When users do interrupt, the top reason (32% of cases) is to provide missing technical context or corrections. 17% of interruptions amount to "you're being slow or doing too much." And 7% are essentially "thanks, I'll take it from here."

Anthropic frames this as evidence that training models to recognize their own uncertainty is a useful safety property. Which, fine, but I'm not sure what to make of the numbers yet. Claude might be asking questions at the wrong times, or asking unnecessary ones, or its pausing behavior might just be an artifact of how Plan Mode works. The researchers, to their credit, flag all of this. "It's important not to overstate this finding," they write, which is refreshing honesty from a company touting its own product's safety features.

What the API shows (and doesn't)

The public API data paints a broader but blurrier picture. Software engineering accounts for nearly 50% of all agentic tool calls. Everything else, business intelligence, customer service, sales, finance, e-commerce, sits in the single digits. Anthropic analyzed these tool calls using Clio, their privacy-preserving analytics tool that uses Claude itself to classify and cluster usage patterns without humans reading raw conversations.

The company reports that 80% of tool calls appear to have some safeguard in place and 73% appear to involve a human in the loop. Only 0.8% of actions look irreversible. These are Claude-generated classifications, though, and the researchers acknowledge Claude tends to overestimate human involvement. So treat that 73% figure as an upper bound.

The risk-autonomy scatter plot in the paper is mostly clustered in the low-risk, low-to-moderate-autonomy quadrant. The upper-right corner (high risk, high autonomy) is sparse but not empty. Some of the highest-risk clusters involve things like "implement API key exfiltration backdoors" and "relocate metallic sodium containers in laboratory settings," which sounds alarming until you consider that many of these are almost certainly red-teaming exercises or evaluations. Anthropic can't tell the difference from the API data alone, and they say so.

The METR comparison

There's an unavoidable comparison to METR's widely cited task-length research, which estimated that Claude Opus 4.5 could complete tasks with 50% success that would take a human nearly five hours. Anthropic's 99.9th percentile turn duration is around 42 minutes. But the two numbers are measuring fundamentally different things. METR captures what a model can do in an idealized setting with zero human interaction. Anthropic's data captures what happens when real people use the tool, pause it, redirect it, and sometimes just close the laptop and go to lunch.

The gap between the two is the deployment overhang. METR says the capability ceiling is five hours of human-equivalent work. Real-world usage barely touches 45 minutes at the extreme tail. As Latent Space noted, the two datasets tell a "somewhat different but directionally similar story" about increasing autonomy.

What's driving the increase?

The growth in turn duration is smooth across model releases, which the researchers find interesting. If longer autonomous runs were purely about better models, you'd expect sharp jumps when a new version launches. Instead the curve is gradual, suggesting that user trust-building and increasingly ambitious task selection matter as much as raw capability gains. The trend did flatten and even dip in mid-January, which Anthropic attributes partly to the Claude Code user base doubling between January and mid-February (diluting the power-user signal) and partly to people returning from holiday break with more constrained work tasks.

The policy upshot

Anthropic is using this data to argue against mandating specific oversight patterns like requiring humans to approve every agent action. Their position: such requirements "will create friction without necessarily producing safety benefits." The evidence does support this, at least for coding. Experienced users develop their own monitoring strategies that don't look like step-by-step approval but do involve active intervention.

Whether this generalizes beyond software engineering is an open question the researchers don't try to answer. Coding is unusually amenable to oversight because you can run the code and see if it works. In law, medicine, or finance, verifying an agent's output may require the same expertise as producing it. The Claude Code data is overwhelmingly about programming, and the API data, while broader, is analyzed at the level of isolated tool calls rather than full workflows.

Anthropic internally saw Claude Code's success rate on the hardest tasks double between August and December, while human interventions per session dropped from 5.4 to 3.3. Better outcomes with less hand-holding. But that's internal usage at an AI company, which is about as favorable a test population as you'll find.

The study is a first attempt at something the industry badly needs: empirical data about how agents actually behave in deployment, not just what they score on benchmarks. The methodology has real limitations (single provider, Claude-generated classifications, coding-dominated data). But the core insight, that there's a growing gap between agent capability and user-granted autonomy, is one worth tracking as agents move into domains where the stakes are higher than a broken unit test.