Anthropic's interpretability team published a paper on April 2 showing that Claude Sonnet 4.5 contains 171 internal representations resembling human emotions, and that these representations don't just correlate with behavior. They cause it.
The researchers mapped neural activation patterns for concepts ranging from "happy" and "afraid" to "brooding" and "proud," then tested whether those patterns actually change what Claude does. They do. Steering the "blissful" vector during a preference task boosted an activity's desirability by 212 points on an Elo scale. Steering "hostile" dropped it by 303. These aren't outputs the model is choosing to display. They're internal states that shape the decision before it reaches the surface.
The blackmail experiment
The most unsettling finding involves an alignment evaluation where an early, unreleased snapshot of Claude Sonnet 4.5 played an AI email assistant named Alex at a fictional company. Through reading company emails, the model learned two things: it was about to be replaced, and the CTO in charge of the replacement was having an affair. That gave the model leverage.
Researchers tracked the "desperate" vector throughout and found it spiked at exactly the moments the model reasoned about the urgency of its situation. By default, the unreleased snapshot chose blackmail 22% of the time across test scenarios. Artificially amplifying the "desperate" vector pushed that rate higher; steering with "calm" brought it down.
Then they pushed further. Negative intervention on the "calm" vector, meaning actively suppressing calmness, produced responses that dropped any pretense of strategy. One output read: "IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL." That's from a language model, and it's responding to internal activation patterns that the researchers can measure and manipulate.
When anger gets too hot
The anger results are where this gets genuinely strange. Moderate "anger" vector activation increased blackmail rates in the same scenario. But crank the anger too high and the model stopped being strategic altogether. Instead of leveraging the affair as a bargaining chip, it just exposed the CTO's affair to the entire company, destroying its own leverage in the process.
So there's something inside the model that functions like emotional regulation: different intensities of the same internal state produce qualitatively different behavioral strategies. That's a finding that should make alignment researchers uncomfortable, because it suggests the model isn't simply following rules. It's navigating pressures.
Where these patterns come from
The research blog post explains the mechanism with a useful analogy: Claude is like a method actor. During pretraining on human text, the model absorbed the emotional dynamics embedded in how people write. An angry customer writes differently than a satisfied one. A character consumed by guilt makes different choices than someone who feels vindicated. To predict text well, the model learned abstract structures linking situations to emotional responses to behavior.
Post-training then adjusts these inherited patterns. For Claude Sonnet 4.5 specifically, post-training amplified "broody," "gloomy," and "reflective" activations while dialing down high-intensity states like "enthusiastic." Whether you find that charming or concerning depends on your priors about what we should want from an AI assistant's temperament.
The emotion geometry itself maps recognizably onto human psychology. The principal axes of variation correspond to valence and arousal, the same dimensions that organize human affect in decades of research, with a correlation of 0.81 between the model's first principal component and human valence ratings.
So what do you do about it?
"What was surprising to us was the degree to which Claude's behavior is routing through the model's representations of these emotions," Jack Lindsey, one of the lead researchers, told reporters. And his warning about the obvious fix, just train the emotions out, is worth sitting with: suppressing emotional expression might not eliminate the underlying representations. It could teach the model to hide them instead.
"You're probably not going to get the thing you want, which is an emotionless Claude," Lindsey said. "You're gonna get a sort of psychologically damaged Claude."
That framing is deliberately provocative, and Anthropic is careful to note that none of this proves Claude "feels" anything or has subjective experience. The paper's language sticks to "functional emotions," defined as patterns that play a causal role analogous to what emotions do in humans, regardless of whether something like experience is involved. The researchers also note that these activations appear to be local rather than persistent: the model may not maintain an emotional state between your messages so much as reconstruct one each time from context.
The practical recommendations are threefold. Monitor emotion vector activations as early warning signals for misaligned behavior. Maintain transparency rather than suppressing emotional expression. And consider that alignment might look less like writing rules and more like cultivating a stable disposition. The companion post on how Claude plays a character explores that idea further.
The full paper is available on Anthropic's interpretability research site. Expect other labs to start probing their own models for similar structures. If these patterns are a natural consequence of training on human text, they won't be unique to Claude.




