OpenAI released EVMbench on Tuesday, a benchmark built with crypto investment firm Paradigm that measures how well AI agents can detect, exploit, and patch smart contract vulnerabilities. The dataset covers 120 curated vulnerabilities pulled from 40 audits, most sourced from Code4rena competitions.
The headline number: GPT-5.3-Codex, running via Codex CLI, scored 72.2% in the exploit mode, where agents attempt to drain funds from contracts in a sandboxed environment. That's up from 31.9% for GPT-5 roughly six months earlier. Both figures are OpenAI's own measurements, so independent confirmation is still pending. Detect and patch modes lagged behind. Agents tended to flag one vulnerability and stop rather than exhaustively auditing a codebase, and patching without breaking contract functionality proved harder still.
The benchmark also includes scenarios from the Tempo blockchain, a layer-1 built by Stripe and Paradigm for stablecoin payments. OpenAI frames this as forward-looking: if AI-powered stablecoin transactions grow, the smart contracts behind them need scrutiny. The company acknowledges EVMbench doesn't capture the full difficulty of production contracts, which undergo far more rigorous auditing than competition code.
Code and tooling are open-sourced, alongside a technical paper. OpenAI paired the release with a broader cybersecurity push: a $10 million API credit commitment for defensive security research, announced earlier this month via its Trusted Access program. The company is also expanding access to Aardvark, its autonomous code-scanning agent currently in private beta. Smart contracts secure over $100 billion in crypto assets. Whether AI agents become net defenders or net attackers in that ecosystem is still an open question, and EVMbench is OpenAI's attempt to start keeping score.
Bottom Line
GPT-5.3-Codex exploited 72.2% of smart contract vulnerabilities in OpenAI's new benchmark, more than doubling GPT-5's score from six months ago, though detect and patch capabilities remain weaker.
Quick Facts
- 120 vulnerabilities from 40 audits in the benchmark
- GPT-5.3-Codex: 72.2% exploit score (company-reported)
- GPT-5: 31.9% exploit score roughly six months prior
- $10 million in API credits committed for cyber defense
- Code open-sourced at github.com/openai/frontier-evals




