Claude Opus 4.6 Degradation: User Complaints Pile Up

Two months after Anthropic launched Claude Opus 4.6 as its most capable model, a growing chorus of paying developers says something is wrong. The complaints aren't subtle: coding loops that burn through entire daily quotas, context windows that degrade well before their advertised limits, and a model that, in the words of one frustrated GitHub issue, "feels dumber than Sonnet 3.5 most of the time."

Anthropic's own status page tells part of the story. Since late February, Opus 4.6 has logged repeated service incidents: elevated errors on February 28, a networking degradation on March 26-27 that took hours to resolve, and another round on March 31 that hit both Opus and Sonnet 4.6. Then again on April 4. And April 6. And April 7. And April 10. I stopped counting.

The GitHub paper trail

The strongest evidence that something is off comes not from social media complaints but from Anthropic's own claude-code repository on GitHub, where developers file structured bug reports. A February issue documented 50+ sessions over 15 days where Opus 4.6 generated confident technical analysis that fell apart under scrutiny. The user dated the behavioral change to sometime between February 4 and February 16, barely two weeks after launch.

Another detailed report from early April describes a pattern familiar to anyone who's burned tokens chasing a stubborn bug: the model writes code before understanding data structures, makes claims without reading the files, then enters correction loops that consume the user's entire daily allocation. "A JSON-LD structured data tool was built from scratch in 2 hours with Opus 4.6 in late February," the user wrote. Extending it with a simple feature distinction had failed eight times across multiple sessions.

That last part gets at why this matters beyond the usual "AI model isn't perfect" complaints. These are paying Max subscribers at $200/month who chose Opus specifically for complex coding work.

The context window problem

One of the more interesting complaints concerns Opus 4.6's advertised 1-million-token context window, which Anthropic made generally available in mid-March. A detailed bug report found that during a heavy Claude Code session, the model's performance degraded well before hitting 50% of that window. At around 20% usage, the user noticed circular reasoning and forgotten decisions. By 40%, context compression kicked in, wiping scrollback history. At 48%, the model itself told the user it wasn't being effective and recommended starting a fresh session.

If the effective high-quality context is roughly 400K tokens, advertising 1M is a stretch. The user asked whether this should be communicated rather than marketing the full figure. Anthropic hasn't publicly answered that question.

Then came the quota crisis

The performance complaints overlapped with a separate, uglier problem: users burning through their usage quotas at alarming speed. By late March, Max subscribers were reporting quota exhaustion in under 20 minutes. The Register reported on the issue after Anthropic acknowledged users were hitting limits "way faster than expected."

A community member reverse-engineered the Claude Code binary using Ghidra and a MITM proxy, claiming to find two independent bugs that broke prompt caching, silently inflating costs by 10-20x. Some users confirmed that downgrading to version 2.1.34 made a noticeable difference. Anthropic engineer Thariq Shihipar responded that they were "actively looking into this in particular."

Shihipar also announced via X that Anthropic was adjusting session limits during peak hours, a change he said would affect around 7% of users. But the timing was terrible: the announcement coincided with the end of a promotional period that had doubled usage limits, making the return to baseline feel like a sudden cut.

Is the model actually worse?

Here's where it gets complicated. On paper, Opus 4.6 benchmarks well. It scores 80.8% on SWE-Bench Verified (essentially tied with Opus 4.5's 80.9%) and leads on Terminal-Bench 2.0 and Humanity's Last Exam. Artificial Analysis gives it a 53 on their Intelligence Index, above Opus 4.5's 50.

But Artificial Analysis also flagged something telling: during their evaluation, Opus 4.6 generated 160 million output tokens, compared to a median of 35 million across comparable models. It is, by a wide margin, the most verbose frontier model they tested. That verbosity translates directly into cost and quota consumption for real users, even if the underlying intelligence scores look good.

The distinction matters. A model can be smarter on hard benchmarks while simultaneously feeling worse in production, especially for developers who care about token efficiency, instruction adherence in long sessions, and consistent behavior across simple tasks. Several of the GitHub reports describe exactly this: Opus 4.6 handles complex reasoning well on fresh sessions but degrades faster than its predecessor as context accumulates.

What Anthropic hasn't said

Anthropic has been responsive on individual GitHub issues but hasn't published any kind of post-mortem or public acknowledgment that the model's real-world behavior may differ from its benchmark performance. Product lead Lydia Hallie acknowledged the usage limit issues on X, and Anthropic's Reddit account called it a "top priority." But as one frustrated user pointed out, there's been no blog post, no email to subscribers, and no status page entry covering the behavioral complaints (as opposed to pure outages).

The silence feeds conspiracy theories. One recent GitHub issue accused Anthropic of deliberately degrading Opus to make the upcoming Mythos model look better by comparison. That's almost certainly wrong, but the fact that paying customers are reaching for that explanation says something about the trust gap.

Anthropic corrected some of its own benchmark claims after launch: a small downward adjustment to its Humanity's Last Exam score (from 53.1% to 53.0%) and a reduction in BrowseComp numbers after an eval-awareness review. These are tiny changes, but they add texture to a narrative where users already feel the model isn't living up to its billing.

What happens next

Anthropic has a concrete problem on its hands. The prompt-caching bugs in Claude Code appear to have a fix path (the downgrade workaround suggests the regression is well-bounded). The service stability issues have been addressed incident by incident. But the harder question, whether Opus 4.6's default behavior in long coding sessions constitutes a meaningful regression from 4.5, doesn't have a status-page fix.

The launch blog post promised that Opus 4.6 "stays productive over longer sessions" and "performs markedly better than its predecessors" on context tasks. Right now, a meaningful number of the model's most dedicated users are telling Anthropic the opposite. Whether that's a model problem, an infrastructure problem, or a context-management problem that falls between the two, the company hasn't yet said.