Anthropic published a report on June 16 arguing that the thing separating successful AI-assisted coding from failed attempts isn't whether you can write code. It's whether you understand the problem you're trying to solve. The claim rests on a privacy-preserving analysis of roughly 400,000 Claude Code sessions from about 235,000 people between October 2025 and April 2026.
What the data actually says
The headline split is clean enough to be suspicious, so it's worth stating plainly. In a typical session, people make around 70% of the planning decisions, what to build and what counts as finished. Claude makes about 80% of the execution decisions: which files to touch, what code to write, which commands to run. The Anthropic report frames this as a division of labor where humans decide what and the agent decides how.
Expertise here is not a job title. The researchers built a classifier that rates how precisely someone frames instructions, what they ask Claude to verify, and whether the user corrects Claude or the other way around. By that definition an accountant who knows exactly which reconciliation rules a script must enforce counts as an expert, while a senior engineer asking their first Rust question does not.
The payoff scales with that rating. Novice sessions set off about five Claude actions and 600 words per prompt. Expert sessions trigger roughly 12 actions and 3,200 words, more than twice the activity and five times the output. The trend held inside every kind of work and every band of task value, which is the part that makes it harder to wave away.
The success gap is real, and lopsided
Anthropic uses a strict bar it calls verified success, meaning a classifier judged the session successful and found hard evidence like a commit, a passing test suite, or explicit user confirmation. Novice sessions cleared that bar 15% of the time. Intermediate and expert sessions landed between 28% and 33%.
Most of that jump happens early. The distance from novice to intermediate is much wider than intermediate to expert, which the report reads as good news: a working grasp of a domain captures most of the benefit, and deep mastery adds only a little more. The recovery numbers are starker. When a session hit trouble, 19% of novice sessions ended abandoned with zero lines of code written, against 5 to 7% for everyone else.
Does your profession matter?
Less than you'd think, at least on coding tasks. People in software jobs reached verified success about 30% of the time overall versus 26% for everyone else. Among sessions that actually produced code, it was 34% versus 29%. Every one of the ten largest occupations landed within seven points of software engineers.
Management occupations scored highest on verified success, slightly above software engineers. The report floats two explanations: management skills that transfer to directing an agent, or the duller possibility that managers just say "thanks, that worked" more often, which is exactly the kind of confirmation the success classifier is looking for. The second reading is the honest one.
That caveat matters more than Anthropic lets on. A success metric that partly rests on people verbally confirming success will reward whoever talks to the agent like a boss approving deliverables. Worth keeping in mind before treating the occupation ranking as gospel.
The work changed too
Over the seven months, the share of sessions spent fixing broken code dropped from 33% to 19%. Operating software climbed from 14% to 21%. Writing and data analysis each roughly doubled. People are pointing the tool at more than bug-squashing now.
Anthropic also estimates the typical session grew about 27% more valuable, judged against freelance job postings. The company is upfront that this figure is coarse and built on a fuzzy match between sessions and gig listings, so it's useful for tracking direction over time, not as a dollar amount anyone should quote at a budget meeting. The original framing of the work put this closer to 25%, but the report itself says 27%.
All of this comes with a flashing asterisk: every classification depends on a model reading a transcript, no human checks individual sessions, and the analysis excludes headless and automated usage that makes up a large slice of real activity. Anthropic says measuring that excluded chunk is next on its list. The report promises updates as models and users shift, with the telling signal being whether returns to expertise start to fall, the moment that would mean the model is supplying the judgment users currently bring themselves.




