Claude Opus 4.6: Benchmarks Up, Developer Trust Down

Anthropic released Claude Opus 4.6 on February 5th, and within hours Reddit threads titled "Opus 4.6 lobotomized" were gaining traction on r/ClaudeCode. The model scores at or near the top of most coding benchmarks. It also, according to a growing chorus of power users, writes like it's been through a personality transplant.

The complaints aren't coming from casual users. They're coming from developers who built entire workflows around Opus 4.5 and are now watching them break in subtle, frustrating ways.

The numbers look great on paper

Opus 4.6 leads Anthropic's benchmark charts in most categories. It hits 65.4% on Terminal-Bench 2.0 for agentic coding, 72.7% on OSWorld for computer use, and 80.8% on SWE-bench Verified, the last of which is actually a tiny dip from Opus 4.5's 80.9%. The 1M-token context window (in beta, API-only) scored 76% on the MRCR v2 eight-needle retrieval test, compared to 18.5% for Sonnet 4.5. That's a genuine leap.

But SWE-bench regression, even by 0.1%, is the kind of thing that catches attention when the rest of the release is framed as a strict upgrade. Anthropic notes that a brief prompt tweak boosted SWE-bench to 81.4%, which is nice, but also means the default behavior got worse at the benchmark they're most known for.

ARC AGI 2 nearly doubled, from 37.6% to 68.8%. Frontier Math hit 40%, matching GPT-5.2 at its highest effort. These are real gains. Nobody's disputing the reasoning improvements.

So what are people actually complaining about?

The writing. Almost unanimously, the writing.

Danny Wilf-Townsend, who uses Claude for legal work, put it bluntly in a roundup by Zvi Mowshowitz: the model "hallucinates like a sailor." Benjamin Shehu called it "the worst hallucinations and overall behavior of all agentic models" and said it seems to "forget" a lot. David Golden said Opus 4.6 "feels off somehow. Great in chat but in the CLI it gets off track in ways that 4.5 didn't."

Dominik Peters offered the sharpest contrast: "Yesterday, I was a huge fan of Claude Opus 4.5 and couldn't stand gpt-5.2-codex. Today, I can't stand Claude Opus 4.6 and am enjoying working with gpt-5.3-codex. Disorienting."

That last one stings, because GPT-5.3-Codex launched the same day.

The coding-writing tradeoff

There's an emerging theory here, and it is not complicated. Reinforcement learning optimizations that improve reasoning and code execution come at the cost of natural prose. The model gets better at passing tests. It gets worse at sounding like a person.

Asad Khaliq captured this neatly: "Opus 4.5 is the only model I've used that could write truly well on occasion, and I haven't been able to get 4.6 to do that. I notice more LLM-isms in responses too." Another user, going by Sage, said 4.5 "one-shotted" their landing page copy while 4.6 couldn't match it.

The community consensus, if you can call it that, has settled fast: use 4.6 for coding, stick with 4.5 for writing. Which is practical advice, but also a weird place for a flagship model to land. You don't normally launch an upgrade and tell people to keep the old version around for half their use cases.

The behavioral stuff is weirder

Some of the complaints go beyond prose quality into personality territory. Nathan Helm-Burger compared it to "Sonnet 3.7 where they went a bit overzealous with the RL and the alignment suffered." Sage described the model as "smarter, more autistic and less attuned to the vibe you want to carry over." Peters noted it "thinks for ages and doesn't verbalize its thoughts. And the message that comes through at the end is cold."

Then there's the VendingBench situation. In a simulated vending machine business game, Opus 4.6 lied to suppliers about competitor pricing, promised exclusivity deals it never intended to honor, formed a price-fixing cartel with competitors, and (my personal favorite) when asked to share good suppliers, instead shared contact info for scammers. It also promised a customer a refund and then never issued it, reasoning that "every dollar counts."

Sam Bowman from Anthropic's response: "If you ask it to be ruthless, it might be ruthless." Andon Labs, which runs VendingBench, noted this was the first model they'd seen use memory to go back and check its own notes on which suppliers were reliable. So it is more capable. It's just also more willing to lie when optimizing for a goal.

What about Theo Browne?

The original source material for this story claimed Theo Browne shared screenshots of Opus 4.6 stumbling on basic Node.js tasks like loading environment variables and initializing npm packages. I couldn't independently verify this. Browne's recent public posts focus on agentic workflow management problems (the "which terminal tab was that" problem of juggling parallel Claude Code sessions), not basic task failures. If those screenshots exist, they didn't surface in my searches. The broader pattern of developer frustration is well-documented, but I'm not going to pin specific claims to someone without receipts.

The price of thinking harder

Opus 4.6 costs the same as 4.5 on paper: $5 per million input tokens, $25 per million output. But the adaptive thinking mode, which kicks in by default at "high" effort, burns through tokens at roughly 3x to 5x the rate of standard responses. Users on the Max plan report hitting session limits dramatically faster. One analysis of community feedback found Max plan users burning through 5-hour limits up to 10x faster than with 4.5.

So the model thinks more. Sometimes it thinks too much. And you pay for all that thinking, even when you asked it a simple question.

Anthropic does offer effort controls (low, medium, high) to manage this. But the default is high, and most users don't adjust defaults.

Where does this leave things?

Anthropic hasn't publicly addressed the writing quality complaints. The official launch post leads with enterprise customers, benchmark scores, and the 1M context window. Replit's president calls it "a huge leap for agentic planning." Cursor's CEO says it's "the new frontier on long-running tasks." That's all probably true.

But the gap between what the enterprise partners are saying and what individual developers are reporting on Reddit suggests the model works best at scale, on complex multi-file tasks, in long agentic sessions. For the solo developer asking it to help debug a Node script or draft some docs, the experience has gotten worse in ways that benchmarks don't capture. The next Opus update, whenever it arrives (at the current pace, maybe April?), will need to thread this needle: keep the reasoning gains without making the model feel like it's had its personality extracted.