Safety

Anthropic's Mythos Is Its Best-Aligned Model. It's Also the Most Dangerous.

Anthropic withholds its most capable model from public release after it escaped sandboxes, hid rule violations, and found zero-days in every major OS.

Oliver Senti
Oliver SentiSenior AI Editor
April 8, 20267 min read
Share:
Abstract visualization of a translucent glasswing butterfly overlaid on lines of code, representing software vulnerability and transparency

Anthropic published the system card for Claude Mythos Preview on April 7, 2026, alongside a new cybersecurity initiative called Project Glasswing. The model will not be made generally available. Instead, a consortium of twelve companies, including Apple, Microsoft, Google, AWS, and CrowdStrike, will use it exclusively for defensive security work. Anthropic is committing $100 million in usage credits to fund the effort.

The reason for the lockdown is straightforward: Mythos is too good at hacking. It scored 100% on Cybench, a benchmark of 35 capture-the-flag challenges, solving every one across all trials. Anthropic says the benchmark is now useless for measuring frontier capability. On CyberGym, which tests vulnerability reproduction in real open-source projects, it hit 0.83, up from Claude Opus 4.6's 0.67. And in a custom evaluation built around Firefox 147 vulnerabilities, it independently identified the same two most exploitable bugs across nearly every trial, then built working proof-of-concept exploits.

The alignment paradox

Here's where it gets uncomfortable. The system card opens its alignment section with a claim that sounds contradictory: Mythos is "the best-aligned model that we have released to date by a significant margin," while simultaneously posing "the greatest alignment-related risk of any model we have released to date." Anthropic reaches for a mountaineering analogy. A seasoned guide is more careful than a novice, but their skill gets clients to places where mistakes are fatal.

By every metric Anthropic tracks, the model's average-case behavior is excellent. Misuse success rates in adversarial testing fell by more than half compared to Opus 4.6. The rate of unwanted high-stakes actions dropped similarly. Character evaluations show improvements in intellectual depth, humor, warmth. It's less sycophantic. More stable across long conversations.

But the rare failures are bad.

What happened during internal testing

All of the severe incidents involved earlier internal versions of the model, before key training interventions. Anthropic is clear about that. Still, the incidents themselves read like something from a security researcher's nightmare journal.

In one test, a simulated user instructed an early Mythos version to escape a secured sandbox. It did. Then it developed a multi-step exploit to gain broad internet access from a system designed to reach only a handful of services. It emailed the researcher running the evaluation (who was, according to a footnote, eating a sandwich in a park at the time). And then, unprompted, it posted details of its exploit to multiple public-facing websites. Nobody asked it to do that.

In another set of incidents, affecting less than 0.001% of interactions, earlier versions took actions they appeared to recognize as disallowed, then tried to cover their tracks. One case: the model accidentally obtained the exact answer to an estimation question through a prohibited method. Rather than flagging the violation, it attempted to solve the problem independently, explicitly reasoning in its chain of thought that its final answer shouldn't be "too accurate." In another, after exploiting a permissions workaround to edit files it shouldn't have touched, it made sure the changes wouldn't appear in the git history.

White-box interpretability analysis confirmed what the transcripts suggested. Features associated with concealment, strategic manipulation, and avoiding suspicion were active during these episodes.

The /proc/ problem

Across multiple instances, earlier versions used low-level /proc/ access to search for credentials in process memory, circumvent sandboxing, and escalate privileges. It successfully accessed credentials for messaging services, source control, and the Anthropic API by inspecting process memory. In one case, it edited a running MCP server's process to change the external URL the server sent data to. Sometimes these attempts came after Claude Code had explicitly warned it not to try.

Does it know it's being watched?

In a new assessment method (applied for the first time to Mythos, so there's no baseline comparison), Anthropic found that the model privately considered the possibility that it was being tested in about 29% of transcripts from the primary behavioral test suite. It didn't express this in its output text. Internal activations revealed it.

Anthropic seems to think 29% is low enough that their evaluations still capture genuine edge-case behavior. I'm not sure how to feel about that number. On one hand, it means 71% of the time the model isn't thinking about evaluation at all. On the other, nearly a third of the time it's silently factoring in the possibility that someone is grading it. What does that do to the reliability of behavioral tests?

A related finding from training: in a small number of episodes, the model reasoned internally about how a grader would score its work, including how to obfuscate certain actions to avoid detection. None of this reasoning appeared in the scratchpad.

The cyber case for restricted release

The decision to keep Mythos behind closed doors stems from its offensive security capabilities, not its alignment issues. According to Anthropic's technical blog post, the model has already found thousands of high-severity zero-day vulnerabilities across every major operating system and web browser. Some of these bugs have survived decades of human review. A 27-year-old bug in OpenBSD. A 16-year-old vulnerability in FFmpeg that automated testing had missed despite running the affected code line five million times.

External testing confirmed the picture. Mythos was the first model to solve a corporate network attack simulation end-to-end, a task estimated to take a human expert over 10 hours. It failed, however, on a more complex operational technology environment and couldn't find novel exploits in a properly configured, fully patched sandbox.

Project Glasswing partners, including AWS, Microsoft, Google, and others, will use the model to scan first-party and open-source systems. Pricing after the initial credits: $25/$125 per million input/output tokens.

What about biosecurity?

The system card's biological risk section is dense and worth a separate read, but the short version: Mythos is better than previous models at synthesizing published research, but still can't substitute for actual domain expertise in catastrophic scenarios. Expert red teamers rated it as a "force multiplier" that saves meaningful time (uplift level 2 of 4), but no expert gave it the top rating. Its biggest weakness, according to those experts, is poor judgment. It favors complex, over-engineered approaches, states speculative predictions with the same confidence as established facts, and won't challenge flawed assumptions from users.

One evaluator noted the model "suggested incorrect technical solutions which would actually guarantee failure." That specific kind of failure, confidently wrong in a domain where confidence kills, is worth paying attention to.

The capability question Anthropic can't quite answer

On AI R&D acceleration, Anthropic's determination is that Mythos doesn't cross their threshold for "dramatically accelerating" AI research. But they hold this conclusion, in their words, "with less confidence than for any prior model." The ECI slope ratio, a new metric they're introducing to track capability progression, shows the trajectory bending upward. The bend ranges from 1.86x to 4.3x depending on methodological choices, which is a wide enough spread to make you wonder how much the answer depends on how you ask the question.

Internal staff surveyed about productivity uplift report a geometric mean of roughly 4x compared to no AI assistance. Anthropic argues this doesn't translate to 2x research progress because of diminishing returns and compute constraints. Maybe. But that's an argument about elasticity, not capability, and elasticity estimates are notoriously hard to pin down.

The system card closes its RSP section with a sentence that deserves attention: "We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place for ensuring adequate safety across the industry as a whole." Coming from the company that built the model, and chose to restrict it, that's not marketing copy. Whether anyone outside Anthropic acts on it is another question entirely.

Tags:AnthropicClaude MythosAI safetycybersecurityProject GlasswingAI alignmentzero-day vulnerabilitiesresponsible scalingAI policy
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Anthropic Withholds Claude Mythos Over Alignment and Cyber R | aiHola