OpenAI Launches EVMbench to Test AI Smart Contract Security

Abstract visualization of blockchain smart contract code being analyzed by an AI system, with highlighted vulnerability nodes

OpenAI released EVMbench on Tuesday, a benchmark built with crypto investment firm Paradigm that measures how well AI agents can detect, exploit, and patch smart contract vulnerabilities. The dataset covers 120 curated vulnerabilities pulled from 40 audits, most sourced from Code4rena competitions.

The headline number: GPT-5.3-Codex, running via Codex CLI, scored 72.2% in the exploit mode, where agents attempt to drain funds from contracts in a sandboxed environment. That's up from 31.9% for GPT-5 roughly six months earlier. Both figures are OpenAI's own measurements, so independent confirmation is still pending. Detect and patch modes lagged behind. Agents tended to flag one vulnerability and stop rather than exhaustively auditing a codebase, and patching without breaking contract functionality proved harder still.

The benchmark also includes scenarios from the Tempo blockchain, a layer-1 built by Stripe and Paradigm for stablecoin payments. OpenAI frames this as forward-looking: if AI-powered stablecoin transactions grow, the smart contracts behind them need scrutiny. The company acknowledges EVMbench doesn't capture the full difficulty of production contracts, which undergo far more rigorous auditing than competition code.

Code and tooling are open-sourced, alongside a technical paper. OpenAI paired the release with a broader cybersecurity push: a $10 million API credit commitment for defensive security research, announced earlier this month via its Trusted Access program. The company is also expanding access to Aardvark, its autonomous code-scanning agent currently in private beta. Smart contracts secure over $100 billion in crypto assets. Whether AI agents become net defenders or net attackers in that ecosystem is still an open question, and EVMbench is OpenAI's attempt to start keeping score.

Bottom Line

GPT-5.3-Codex exploited 72.2% of smart contract vulnerabilities in OpenAI's new benchmark, more than doubling GPT-5's score from six months ago, though detect and patch capabilities remain weaker.

Quick Facts

120 vulnerabilities from 40 audits in the benchmark
GPT-5.3-Codex: 72.2% exploit score (company-reported)
GPT-5: 31.9% exploit score roughly six months prior
$10 million in API credits committed for cyber defense
Code open-sourced at github.com/openai/frontier-evals

Tags:OpenAIsmart contractsblockchain securityEVMbenchParadigmcybersecurityEthereum

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

OpenAI and Paradigm Launch Benchmark for AI Smart Contract Security

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Anthropic Says AI Found Over 10,000 Critical Software Bugs

Trump Scraps Executive Order on Early AI Model Reviews

Anthropic Patches Claude Code Sandbox Bug That Leaked Developer Credentials

Stay Ahead of the AI Curve