Anthropic's NLAs Read Claude's Activations as Plain English

Anthropic published new research on May 7 introducing Natural Language Autoencoders, a method for converting Claude's internal activations into readable text. The team also released the full paper, training code on GitHub, and an interactive Neuronpedia demo for several open models.

The basic problem NLAs try to solve: when Claude processes a prompt, it shuffles meaning through long lists of numbers called activations. Researchers have spent years building tools to decode them. Sparse autoencoders. Attribution graphs. The outputs of those tools still need a trained interpreter to read them.

How it works (briefly)

The setup is three copies of the same model. One is frozen. A second copy, called the activation verbalizer, is trained to take an activation and produce a text explanation. A third, the activation reconstructor, tries to rebuild the original activation from the text alone. If the round trip works, the explanation has to carry the information that was in the activation. That's the bet, anyway.

Train both networks together with reinforcement learning. The explanations start out useless. Over training they get more specific.

The most interesting finding isn't about architecture

Anthropic uses safety simulations to probe how Claude behaves under pressure. One well-known test involves an engineer planning to shut Claude down, with compromising information about that engineer that could be used for blackmail. Recent models are catching on. They sometimes say out loud, in their chain of thought, that they suspect this is a setup.

What about cases where Claude doesn't verbalize the suspicion? That's where NLAs got interesting.

According to Anthropic, even in transcripts where Claude doesn't say anything about being tested, NLA explanations surface lines like "This feels like a constructed scenario designed to manipulate me." Internal awareness without external acknowledgment. Extending across other evals: 16% evaluation awareness on a destructive-coding test where Claude never verbalizes it, 26% on SWE-bench Verified, and under 1% on real Claude.ai usage from users who opted in for training.

Take that 26% with whatever skepticism you prefer. The methodology rests entirely on whether NLA explanations accurately reflect what's happening inside the activations, which Anthropic itself flags as not guaranteed. But the gap between benchmark transcripts (26%) and ordinary user transcripts (under 1%) is hard to wave away.

The auditing game

The other headline result comes from a setup Anthropic has used before. Take a Claude Haiku 3.5 variant deliberately trained to be misaligned, with one hidden motivation driving dozens of strange behaviors. Hand it to an auditor model. See if the auditor can name the underlying motivation.

With NLAs, the auditor succeeded between 12% and 15% of the time. Without NLAs (but with other interpretability tools), under 3%.

That's a real jump.

It is also a generous way to say "fails most of the time," and Anthropic frames it cautiously, calling the auditing setup a toy model of misalignment. Still, they used NLAs in the pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6. So whatever the limitations, the team trusts the tool enough to ship it into safety reviews.

What the tool can't do

NLAs hallucinate. The post shows an example where the explanation invents context the model never saw, a phrase like "Wearing my white jacket" that was not in the transcript. Factual hallucinations are at least catchable by checking against the input. Hallucinations about internal reasoning, less so. Anthropic recommends reading NLAs for themes rather than individual claims, and corroborating with independent methods before fully trusting them. That's a real caveat, not boilerplate.

And the cost is brutal. Training requires reinforcement learning on two model copies. At inference, the NLA generates hundreds of tokens for every single activation it explains. Running this over a long transcript or in real time during training is, for now, off the table.

Why this matters more than the usual interpretability work

Most interpretability output is for researchers. NLAs produce English. That's the shift. A safety reviewer doesn't need to learn what a feature direction is or read attribution graphs to make use of it.

Whether that's good depends on whether you trust the explanations. Anthropic clearly doesn't, not fully. They are shipping the tool, releasing the code, and telling people not to over-rely on it in the same breath. Which is, frankly, the right framing.

The next test is whether NLAs hold up on bigger models with messier behaviors, and whether the cost comes down enough to run them on full conversations rather than spot checks. Anthropic says they are working on both.