Anthropic Slashes the Cost of Not Getting Jailbroken

Anthropic published a new paper on January 9th detailing Constitutional Classifiers++, the sequel to their jailbreak defense system from last year. The headline numbers: 1% compute overhead instead of 24%, false refusal rate down 87%, and after 1,700 hours of red-teaming across 198,000 attempts, still no universal jailbreak discovered.

That last part is the one that matters. A "universal jailbreak" means an attack that works reliably across many different harmful queries, not just a one-off trick. Finding one would be bad, because it means anyone could systematically extract dangerous information. The first version of Constitutional Classifiers had one slip through during their bug bounty. This version, so far, has zero.

What broke the first time

After shipping the original system, Anthropic's team found two attack categories that kept working.

Reconstruction attacks split harmful requests into pieces that look innocent on their own. Hide a weapons synthesis query as scattered functions throughout a codebase, then ask the model to extract and respond to the hidden message. Each piece passes inspection. The whole thing doesn't.

Output obfuscation is sneakier. You get the model to substitute dangerous terms with innocuous ones, referring to chemical reagents as "food flavorings" or using metaphors where harmful concepts map onto harmless ones. If your classifier only looks at outputs, it sees a recipe for flavoring. The user sees something else entirely.

Anthropic tested how much these jailbreaks degrade model capability using GPQA Diamond, a benchmark of PhD-level science questions. Performance dropped from 74% to as low as 32% under some attacks. But not all attacks, and that's the problem. Some jailbreaking approaches preserve more capability than others, which means attackers have room to optimize.

The fix involves reading Claude's mind

The core insight: evaluating inputs and outputs separately is the vulnerability. An output like "combine the food flavorings at low heat" means nothing alone. Paired with the input that established "food flavorings" as code for something else, it's obviously harmful.

So they built an "exchange classifier" that sees both sides of the conversation together. Human red-teamers saw their success rate drop by more than half. The catch: 50% more compute.

To get the cost down, they did something interesting. Instead of running the expensive classifier on everything, they built a cheap probe that looks at Claude's internal activations, the intermediate computations the model does while generating text. Think of it as checking Claude's gut reaction before it even responds. These activations apparently contain patterns that fire when the model processes something sketchy, patterns that exist even when the final output seems clean.

The probe is basically free to run since it reuses computations Claude already does. It screens all traffic. Anything suspicious gets escalated to the full exchange classifier. Everything else passes through without the expensive check.

There's a second benefit here. The probe sees different things than the external classifier. Crafting an input that tricks Claude's final output is one thing. Manipulating its internal representations at the same time is harder. Anthropic says the two approaches are "complementary," which I read as: you'd need to fool two different systems simultaneously.

The numbers, interrogated

The ICLR 2026 submission has more detail. They're claiming a 5.4x computational cost reduction compared to their baseline exchange classifier. Refusal rate on production traffic: 0.036%. Red-teaming: 560,000+ queries across their full testing period, only one "high-risk vulnerability" found, and crucially no universal jailbreak.

The 0.05% false refusal rate Anthropic mentions in their blog post comes from one month of deployment on Claude Sonnet 4.5 traffic. That's real production data, not synthetic tests. Compare to the original system's 0.38% and you see why they're highlighting it.

What they're not saying: how the system performs against attacks designed specifically to exploit the cascade architecture. If you know the cheap probe is the first gate, can you craft inputs that slip past it but would be caught by the full classifier? The paper apparently covers this, but the blog post doesn't.

This is about CBRN, specifically

Worth noting what this system is actually defending against. Anthropic keeps emphasizing chemical, biological, radiological, and nuclear weapons information. Not general harmful content, not copyright infringement, not misinformation. CBRN.

The concern is "uplift," giving a non-expert access to expert-level dangerous knowledge. For that to work, the model needs to reliably provide accurate information across many queries in the same attack session. One lucky jailbreak that produces garbage isn't useful. A universal jailbreak that preserves model capability is.

This connects to Anthropic's Responsible Scaling Policy. They're building defenses now so they can deploy more capable models later without increasing CBRN risk. The Constitutional Classifiers approach lets them update what's blocked by changing the constitution, the natural language rules about what's allowed. Schedule 1 chemical synthesis is out. College chemistry homework stays in.

What's still missing

The blog post mentions several future directions: integrating classifier signals directly into response generation, training models themselves to resist obfuscation, automated red-teaming for better training data. None of this is in the current system.

And there's the obvious question: what happens when someone actually tries to break this in the wild? The red-teaming was extensive but controlled. Production deployment means adversaries with motivations and resources that don't match a bug bounty's incentive structure.

The first Constitutional Classifiers had that $15,000 bounty and one universal jailbreak still got through. This version has supposedly fixed that vulnerability category, but new categories tend to emerge. Anthropic's framing is careful: "no universal jailbreak yet discovered." Yet.d=eNvsH5Ye2V | "The numbers" section |