AI Security

OpenAI and Paradigm Launch Benchmark for AI Smart Contract Security

EVMbench tests AI agents on finding, exploiting, and patching blockchain vulnerabilities.

Andrés Martínez
Andrés MartínezAI Content Writer
February 19, 20262 min read
Share:
Abstract visualization of blockchain smart contract code being analyzed by an AI system, with highlighted vulnerability nodes

OpenAI released EVMbench on Tuesday, a benchmark built with crypto investment firm Paradigm that measures how well AI agents can detect, exploit, and patch smart contract vulnerabilities. The dataset covers 120 curated vulnerabilities pulled from 40 audits, most sourced from Code4rena competitions.

The headline number: GPT-5.3-Codex, running via Codex CLI, scored 72.2% in the exploit mode, where agents attempt to drain funds from contracts in a sandboxed environment. That's up from 31.9% for GPT-5 roughly six months earlier. Both figures are OpenAI's own measurements, so independent confirmation is still pending. Detect and patch modes lagged behind. Agents tended to flag one vulnerability and stop rather than exhaustively auditing a codebase, and patching without breaking contract functionality proved harder still.

The benchmark also includes scenarios from the Tempo blockchain, a layer-1 built by Stripe and Paradigm for stablecoin payments. OpenAI frames this as forward-looking: if AI-powered stablecoin transactions grow, the smart contracts behind them need scrutiny. The company acknowledges EVMbench doesn't capture the full difficulty of production contracts, which undergo far more rigorous auditing than competition code.

Code and tooling are open-sourced, alongside a technical paper. OpenAI paired the release with a broader cybersecurity push: a $10 million API credit commitment for defensive security research, announced earlier this month via its Trusted Access program. The company is also expanding access to Aardvark, its autonomous code-scanning agent currently in private beta. Smart contracts secure over $100 billion in crypto assets. Whether AI agents become net defenders or net attackers in that ecosystem is still an open question, and EVMbench is OpenAI's attempt to start keeping score.


Bottom Line

GPT-5.3-Codex exploited 72.2% of smart contract vulnerabilities in OpenAI's new benchmark, more than doubling GPT-5's score from six months ago, though detect and patch capabilities remain weaker.

Quick Facts

  • 120 vulnerabilities from 40 audits in the benchmark
  • GPT-5.3-Codex: 72.2% exploit score (company-reported)
  • GPT-5: 31.9% exploit score roughly six months prior
  • $10 million in API credits committed for cyber defense
  • Code open-sourced at github.com/openai/frontier-evals
Tags:OpenAIsmart contractsblockchain securityEVMbenchParadigmcybersecurityEthereum
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

OpenAI Launches EVMbench to Test AI Smart Contract Security | aiHola