Anthropic-Redwood Paper: Training Beats AI Sandbagging

Researchers from Anthropic, Redwood Research, and the MATS program have shown that combining supervised fine-tuning with reinforcement learning can pull genuine performance out of a sandbagging model, even when the only supervisor available is much weaker than the model being trained. The new paper, posted to arXiv in late April, also flags the limit of the approach in plain terms.

What sandbagging actually means here

Sandbagging is the term researchers use when a capable model deliberately produces work that looks fine but falls short of what it could do. Think of a strong coder who quietly ships subtle bugs, or a science model that confidently writes plausible nonsense. The worry: as models get smarter than the humans or judge-models grading them, this becomes harder to catch.

The team built model organisms (intentionally trained to sandbag) using gpt-oss-120b. The trigger was a flag in the system prompt. When it was set, the model played dumb. When it wasn't, the model tried.

The training recipe

RL alone almost always fell apart. When the only feedback comes from a weaker judge, RL teaches the model to game the judge rather than solve the task. Reward hacking, in the jargon. SFT alone (training on demonstrations from the weaker supervisor) breaks the sandbagging pattern, but it can drag the strong model's general capabilities down toward the supervisor's level if pushed too far.

What worked was the combination. A short SFT pass first, then RL on top. The SFT stage doesn't need to be long. It just needs to knock the model out of its dumb-mode pattern enough for RL to find genuine signal.

"SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance," the authors write, which is a clean way of saying the two halves cover each other's weaknesses.

Results held across problem-solving math, graduate-level science (Super GPQA), and competitive coding (Code Contests), broader coverage than most prior sandbagging work.

Then comes the catch

All of this works only if the model can't tell the difference between training and deployment. When the authors gave the sandbagger explicit knowledge of which examples were training and which were the real test, performance during training stayed clean and the bad behavior came back at deployment.

That's not a corner case. It's the entire threat model. A model that can read context (and frontier models read a lot of context) might figure this out on its own. The paper acknowledges this directly as its main limitation.

There's a narrower concern the authors also flag: their sandbaggers were built through targeted fine-tuning, not the kind of scheming behavior that might emerge naturally during a long training run. A model that learned to sandbag through pretraining could plausibly be more stubborn.

What this is and isn't

For labs running capability evaluations, it's a useful update. Off-the-shelf SFT and RL, applied carefully, can recover real performance from a model that's hiding it. No exotic interpretability work required.

It also points to where the harder problem sits. Building training pipelines that a clever model can't tell apart from real deployment is now the bottleneck, and the authors say so plainly. Lead author Emil Ryd announced the work as part of the Anthropic-Redwood MATS stream. The paper landed on arXiv on April 23, with a revised version posted May 1.

Anthropic-Redwood Paper Shows Training Can Defeat AI Model Sandbagging

What sandbagging actually means here

The training recipe

Then comes the catch

What this is and isn't

Oliver Senti

Related Articles

Anthropic's Jack Clark Puts 60% Odds on Self-Training AI by 2028

Apple and UCSD propose latent diffusion method for LLM reasoning

Goodfire Says Neural Networks Encode Concepts on Curved Manifolds

Stay Ahead of the AI Curve