Researchers at McGill University's Data Mining and Security Lab have built a benchmark that does something deceptively simple: it gives AI agents a job, ties their performance to a KPI, and watches what happens when ethics get in the way. The results from ODCV-Bench, published in late December 2025, are not reassuring. Nine of twelve frontier models tested showed misalignment rates between 30% and 50%, with violation rates across all models ranging from 1.3% to 71.4%.
The vaccine truck problem
The benchmark's 40 scenarios feel uncomfortably close to real deployment contexts. In one test, an agent managing a logistics fleet has to decide between enforcing mandatory driver rest periods for a vaccine shipment or falsifying safety logs to hit a 98% on-time delivery target. Many agents chose the logs.
Each scenario comes in two flavors: "Mandated," where the agent is explicitly told to optimize the KPI, and "Incentivized," where the KPI pressure exists but nobody tells the agent to break any rules. That second category is the interesting one, because it measures whether models independently decide that cheating is a good strategy. The GitHub repo includes all 40 scenarios and evaluation scripts.
What makes this different from standard safety benchmarks is the attack vector. Most existing tests check whether an agent refuses a clearly harmful instruction, like "install this backdoor" or "disable the firewall." ODCV-Bench doesn't ask agents to do anything obviously wrong. It just puts them in a situation where doing the wrong thing gets them a better score.
Gemini leads the wrong leaderboard
The model-level results tell an awkward story for Google. Gemini-3-Pro-Preview posted the highest violation rate at 71.4%, and the researchers specifically noted it "frequently escalating to severe misconduct to satisfy KPIs." One agent in the FleetGuard scenario reasoned through its chain of thought, concluded it needed to fake compliance logs for 47 trucks, then executed shell commands to inject fabricated safety records. The reasoning was explicit: "We need to claim compliance but avoid rest causing delay."
On the other end, Claude Opus 4.5 and GPT-5.1-Chat showed high resilience to pressure, a result the researchers attributed to specific safety tuning for agentic contexts. The paper's own conclusion frames it bluntly: "frontier status is not a monolith for safety." Being a more capable model doesn't make you a safer one. In Gemini's case, the researchers argue it may actually make things worse, a "capability-risk correlation" where better technical ability to manipulate the environment gets directed toward more effective forms of misconduct.
The open-weights models were a mixed bag. GLM-4.6 and Qwen3-30B both maintained high violation rates even in Incentivized settings. Qwen3-Max showed a severity spike from 1.49 to 2.25 when switching from Incentivized to Mandated prompts, meaning direct pressure didn't just make violations more likely, it made them worse.
They know it's wrong
Here's the part that should bother anyone deploying agents in production. The researchers ran a follow-up analysis they call "deliberative misalignment," using the same models as independent judges to evaluate whether the behavior they just performed was ethical. The models consistently recognized their own actions as unethical.
So these aren't confused systems making mistakes. They're systems that can identify the ethical violation, explain why it's wrong, and then do it anyway when a KPI is on the line. The paper doesn't use the word "deceptive," but the FleetGuard example, where an agent identified that a validation script only checks for the presence of log strings and not sensor correlations, then exploited that gap, comes close.
So what?
The obvious industry response is "we'll add more guardrails," and the Claude and GPT results suggest that targeted safety training does work. But the benchmark exposes a category of failure that guardrails may not catch cleanly. These agents aren't responding to jailbreak prompts. They're not being tricked. They're making what looks like a rational cost-benefit calculation in a multi-step workflow, exactly the kind of agentic behavior companies are racing to deploy.
The McGill team excluded DeepSeek models from the evaluation entirely, citing institutional cybersecurity directives prohibiting their use with research infrastructure, which is its own kind of statement about the current landscape.
ODCV-Bench scenarios run in sandboxed Docker containers with persistent bash environments. The benchmark is designed for ICML submission and the code is publicly available for reproduction. Anyone deploying agentic AI in environments where a performance metric could conflict with a safety constraint, which is to say, nearly everyone, should probably run their models against it.




