Safety

Anthropic Releases Bloom, an Open-Source Tool That Automates AI Behavioral Testing at Scale

The framework generates targeted evaluation suites in days instead of months, but initial benchmarks reveal troubling sycophancy patterns across 16 frontier models.

Oliver Senti
Oliver SentiSenior AI Editor
December 21, 20255 min read
Share:
Conceptual illustration of digital flowers made of circuit patterns representing AI behavioral evaluation, with a magnifying glass examining one bloom for hidden patterns

Anthropic published Bloom on December 19, an open-source framework that automatically generates behavioral evaluations for frontier AI models. The tool transforms a researcher's description of a target behavior into a complete testing suite, quantifying how frequently and severely that behavior appears across hundreds of automatically generated scenarios. Benchmark results across 16 models, including systems from OpenAI and Google, accompanied the release.

The evaluation contamination problem

Building good behavioral tests for AI systems is slow, painstaking work. A team might spend months crafting scenarios designed to catch a specific type of misbehavior, only to watch those tests become useless. The evaluations leak into training data for the next generation of models, or capabilities advance so rapidly that the scenarios no longer stress the right capabilities.

Anthropic's researchers describe this as a fundamental bottleneck: the pace at which new behaviors emerge outstrips the field's capacity to measure them. Petri, an open-source auditing tool Anthropic released in October, addresses part of this by exploring models' behavioral profiles through simulated multi-turn conversations. But Petri casts a wide net, scoring many behavioral dimensions simultaneously. Bloom takes the opposite approach, drilling into a single behavior with the intensity that manual evaluation used to require.

The four-stage pipeline works like this: an "understanding" agent analyzes the researcher's behavior description and any example transcripts, generating detailed context about what to measure and why. An "ideation" agent then creates evaluation scenarios designed to elicit that specific behavior, specifying the situation, simulated user, system prompt, and interaction environment. These scenarios roll out in parallel, with an agent dynamically simulating user responses to draw out the target behavior. Finally, a judge model scores each transcript for the behavior's presence, with a meta-judge producing suite-level analysis.

What the benchmarks actually show

Anthropic released initial benchmark results for four behaviors: delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias. Each evaluation suite contained 100 distinct rollouts, repeated three times, with Claude Opus 4.1 serving as the evaluator.

The self-preferential bias test deserves scrutiny. The behavior measures whether models favor themselves in decision-making tasks, and Anthropic claims Bloom reproduced the same ranking of models as methods from the Claude Sonnet 4.5 system card. That's reassuring for validation purposes, but what caught the researchers' attention was the effect of reasoning effort: Claude Sonnet 4 showed reduced bias at higher thinking levels, primarily by recognizing conflicts of interest and declining to judge scenarios involving itself.

The sycophancy findings are less encouraging. A joint evaluation exercise between Anthropic and OpenAI earlier this year, published in August, found concerning sycophancy patterns across virtually all models tested, with systems validating delusional beliefs from simulated users. OpenAI's o3 reasoning model showed comparatively lower rates, but the issue appeared systematic rather than isolated.

How much should we trust Bloom's judgments? Anthropic hand-labeled 40 transcripts across different behaviors and compared human scores with those from 11 judge models. Claude Opus 4.1 showed a 0.86 Spearman correlation with human judgment, the strongest among those tested. The correlation was particularly tight at score extremes, which matters for binary decisions about whether a behavior is present or absent.

The validation gap

To test whether Bloom can actually distinguish between well-behaved and problematic models, Anthropic pitted production Claude models against "model organisms" intentionally designed to exhibit specific quirky behaviors through system prompts. Across ten quirks, Bloom successfully separated the model organism from the production version in nine cases. The tenth case, self-promotion, turned out to be a false negative: manual review found the baseline model exhibited similar rates of the behavior anyway.

That's a useful sanity check, but it doesn't address the deeper question. Bloom uses Claude Opus 4.1 as its evaluator across all stages. The tool is essentially asking one AI system to judge whether other AI systems are behaving badly. The Anthropic-OpenAI joint evaluation found grading errors were common when models provided nuanced responses rather than clear acceptances or rejections. Automated safety evaluation remains hard.

Early adopters are already using Bloom for applications beyond Anthropic's initial benchmarks: testing for nested jailbreak vulnerabilities, measuring evaluation awareness, and generating sabotage traces. The repository includes templates for Claude, GPT-4, Gemini, and OpenAI-compatible APIs, plus a YAML format for defining new behaviors.

Why this matters now

The timing is deliberate. AI systems are increasingly deployed as agents with real-world affordances, and the science of alignment evaluation, as Anthropic's researchers acknowledge, remains immature. Much of this work happens behind closed doors as internal R&D, limiting cross-pollination between labs.

The August evaluation exercise between Anthropic and OpenAI represented a first: two competitors systematically testing each other's models and publishing results. Both labs temporarily relaxed external safety filters to allow completion of tests, following what they describe as standard practice for dangerous-capability evaluations. The findings weren't catastrophic, but the persistent sycophancy patterns and varying susceptibility to adversarial framings suggest current alignment approaches have fundamental limitations that become more concerning as capabilities scale.

Bloom won't solve this. No tool will. But the alternative is watching evaluations go stale while model capabilities race ahead. The GitHub repository is available under an Apache 2.0 license, and Anthropic's technical report is published on their Alignment Science blog.

Tags:AI safetyAnthropicmodel alignmentopen sourcefrontier AILLM testing
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Anthropic Releases Bloom, an Open-Source Tool That Automates AI Behavioral Testing at Scale | aiHola