Anthropic Study Finds Developers Using AI Score 17% Lower on Comprehension Tests

Developers who leaned on AI assistants to write code understood that code significantly less well than those who struggled through it themselves, according to a randomized controlled trial published by Anthropic. The gap amounts to nearly two letter grades on a follow-up quiz, and the steepest decline showed up in debugging skills.

The kicker: AI users didn't even finish much faster. About two minutes on average, not enough to reach statistical significance.

What the study actually tested

Fifty-two junior software engineers learned Trio, a Python library for asynchronous programming. Half got access to an AI assistant that could see their code and generate complete solutions on demand. The other half worked with documentation and web search only.

Both groups then took a 14-question quiz covering concepts they'd just used. The Anthropic blog post reports the AI-assisted group averaged 50%, while the hand-coding group hit 67%. Debugging questions showed the widest gap, which is awkward given that catching AI-generated bugs is supposedly the human's job in an AI-assisted workflow.

The participants weren't novices. All had been using Python at least weekly for over a year and had prior experience with AI coding tools. They just hadn't used Trio before.

Not all AI use is equal

Here's where the study gets more interesting than a simple "AI bad" headline. The researchers watched screen recordings of every participant and identified six distinct patterns of AI interaction.

Three patterns correlated with quiz scores below 40%: full delegation to AI, progressive reliance that started small and escalated, and iterative debugging where participants kept asking AI to check and fix their code. These participants finished fastest but learned least.

Three patterns correlated with scores above 65%: asking for explanations after generating code, requesting code with explanations bundled in, and asking only conceptual questions while writing code independently. The last approach turned out to be the second-fastest overall, trailing only full delegation.

The difference wasn't whether participants used AI. It was whether they stayed mentally engaged while using it.

The debugging problem

Participants who coded without AI encountered more errors. That sounds like a point in AI's favor until you consider what those errors taught them.

The control group ran into Trio-specific errors, things like RuntimeWarning when a coroutine wasn't awaited or TypeError when passing a coroutine object instead of an async function. Resolving these meant actually understanding how Trio handles concurrency.

The AI group encountered fewer errors overall, and almost none of the Trio-specific variety. The code just worked. But then they couldn't explain why.

What this doesn't tell us

The study measured immediate comprehension, not long-term retention. Whether a 17% quiz gap predicts career-long skill deficits remains unknown. The tasks also took under an hour total, a far cry from the months developers typically spend learning a new tool on the job.

The researchers note that their chat-based AI setup differs from agentic coding tools like Claude Code, where "the impacts on skill development are likely to be more pronounced." Less interaction means even less opportunity to engage.

And there's a selection effect baked into the design: participants knew a quiz was coming. Real-world developers optimizing for deadline, not learning, might behave differently.

Why this matters now

The finding lands as AI coding tools have become standard. Previous research from the same team found AI can speed up some tasks by 80%. Other studies show novice developers benefit most from AI assistance.

But if novices are the ones most likely to use AI aggressively, and aggressive AI use undermines learning, the math gets uncomfortable for engineering managers thinking about workforce development.

The researchers suggest intentional design choices: learning modes that force explanation, workflows that preserve cognitive engagement. Both Claude Code and ChatGPT already offer such features. Whether developers actually use them is another question.

The study preregistration is available on OSF. Code and annotated transcripts are on GitHub.