Agents

Every Frontier LLM Scores Zero on Meta's New ProgramBench

Meta, Stanford, and Harvard researchers built a coding benchmark every frontier LLM fails. Claude Opus 4.7 included.

Oliver Senti
Oliver SentiSenior AI Editor
May 6, 20263 min read
Share:
Abstract visualization of a compiled executable binary being deconstructed into source code modules and architectural diagrams

Researchers behind SWE-bench just shipped a new coding benchmark, and every frontier model in their first batch scored exactly zero. ProgramBench, released by a team from Meta Superintelligence Labs, Stanford, and Harvard, gives an agent a compiled binary plus its documentation and asks it to rebuild the program from scratch. No source code. No internet. No decompilation tools.

What the agent actually has to do

The setup is brutally minimal. An agent gets the executable with run-only permissions, reads the docs, and has to architect a complete codebase that reproduces the binary's behavior, including picking the language and writing the build script. That's it. The technical paper describes 200 tasks ranging from small utilities like jq and ripgrep up to FFmpeg, SQLite, and the PHP interpreter.

Submitted programs get graded against more than 248,000 behavioral tests generated by agent-driven fuzzing. To pass a task, every test for that task has to pass. That's the bar.

The zero, with caveats

Every model on the public leaderboard sits at 0% fully resolved. Claude Opus 4.7 leads. GPT 5.4 ties. So do Gemini 3.1 Pro, Sonnet 4.6, and the rest, nine identical zeros down the list.

Headline numbers obscure what's actually happening, though. ProgramBench reports a secondary "almost resolved" metric where a solution passes at least 95% of behavioral tests. Opus 4.7 hits 3% there. Opus 4.6 hits 2.5%. Sonnet 4.6 hits 1%. The rest stay at zero on this looser metric too. So the models aren't all failing equally, they're just all failing the strictest version of the test.

Drill into individual tasks and you see real partial progress. Models reach 98% on jq's behavioral tests, 91% on Brotli, 67% on SQLite. Then 5% on FFmpeg and PHP. The architecture-heavy stuff buries them.

Why architecture is the wall

This is the angle that matters. Most coding benchmarks hand the model a scaffold: function signatures, file layouts, a natural-language spec. ProgramBench gives none of that. The agent has to decide what abstractions to introduce, how to split modules, what interfaces to expose. Free-form repository design, not autocomplete.

The authors went out of their way to make cheating impossible. Sandboxed containers with no internet. Binaries marked execute-only so an agent can't run strings or objdump. In early trials without these locks, models just cloned the source from GitHub or pulled it through a package manager, which says something about how today's coding agents actually solve problems when nobody's watching.

What's next

A submission portal is "coming soon," per the project page. The benchmark uses mini-SWE-agent as its harness, the same minimal scaffold used by SWE-bench Verified, so other groups can plug in their own models. The GitHub repo hosts the task definitions and evaluation code. Sonnet 4.5 runs reportedly cost up to $5,000 per evaluation, which is its own data point about how saturated this space isn't.

Tags:ProgramBenchLLM benchmarkSWE-benchAI codingMeta AIClaude Opus 4.7GPT-5code generationcoding agents
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

ProgramBench: Every Frontier LLM Scores 0% | aiHola