Google LEAP Solves All 12 Putnam 2025 Problems in Lean

A team from Google Cloud AI Research and DeepMind published a paper on June 2nd describing LEAP, an agentic system that wraps a general-purpose language model around the Lean compiler and gets it to write formal, machine-checkable proofs. The headline result: it solved all 12 problems from the 2025 Putnam Competition, the brutal annual undergraduate math contest where the 2025 median score was 2 out of 120. The technical report credits Gemini-3.1-pro as the backend, with no fine-tuned prover doing the heavy lifting.

That last part is the actual claim worth chewing on. The whole field has spent the last two years assuming you need a specialized model, something trained directly on Lean corpora, to do this. AlphaProof, DeepSeek-Prover-V2, Goedel-Prover-V2, Seed-Prover: all fine-tuned, all built on the premise that a general chatbot can't be trusted with a kernel-verified proof. LEAP's pitch is that the premise was wrong, or at least that it was measuring the wrong thing.

What's the trick, then?

Instead of asking the model to one-shot a complete proof (which, the authors note, falls apart fast on anything long), LEAP builds the proof as a graph. It drafts an informal blueprint, decomposes the theorem into supporting lemmas, and stores everything as an AND-OR directed acyclic graph so that lemmas proved in one branch can get reused in another rather than rediscovered. The Lean compiler sits in the loop the whole time, rejecting bad steps and feeding errors back for correction. The name unpacks to LLM-in-Lean Environment Agentic Prover, which tells you roughly everything about the design philosophy.

And here's the number that makes the architecture argument land. On the paper's new benchmark, the same general LLM that scores under 10% writing proofs one-shot jumps to 70% once it's running inside LEAP. The bottleneck, in other words, wasn't whether Gemini understands Lean. It was the absence of structured back-and-forth with a verifier.

The benchmark question

Putnam 2025 is a strong demo, but it's worth being skeptical about saturation. So the team also built Lean-IMO-Bench, 60 IMO-style problems formalized in Lean, chosen specifically for short statements with long, non-routine, structurally nasty proofs. The kind of thing where the formalization is easy to state and miserable to actually do.

On that benchmark LEAP hit 70% across difficulty tiers. Specialized ATP models managed 5%. And Aristotle, Harmonic's gold-medal-caliber IMO system, came in at 48% in the authors' own evaluation. I'd flag that this is the paper's own framing of its own numbers, run on a benchmark the same authors designed, so take the head-to-head with the usual grain of salt. Still, 5% to 70% is a big enough gap that methodology quibbles don't erase it.

One thing the source chatter around this paper got muddled: the 48% figure is Aristotle on Lean-IMO-Bench, not Putnam. On Putnam 2025 itself, LEAP's claim of 12/12 matches two other systems, the closed-source Axiom and Numina, both of which also report perfect scores. Aristotle's Putnam number isn't the comparison being drawn here.

The irony nobody at Google addressed

Buried in the introduction, the authors take a clean shot at their closed competitors. Axiom and Numina, they write, claim strong Putnam results but remain closed and "scientifically unverifiable." Fair enough. Reproducibility matters, especially in a field where reinforcement learning is notoriously good at finding ways to fake a certified proof.

But LEAP's own framework code isn't released either. What they did publish is the set of Lean solutions, sitting in a GitHub repo, plus the benchmark resources. Anyone can re-run those proofs through the Lean compiler and confirm they check out, which is a real form of verifiability that Axiom and Numina apparently don't offer. So it's not nothing. But you can't rebuild LEAP from what's on offer, which makes the criticism land a little awkwardly. Glass houses, etc.

Granted, releasing verified proof artifacts is a meaningfully higher bar than the competitors cleared. I just think the paper would read better if it acknowledged the asymmetry instead of leaning on it.

Research-level, not just contest math

The part I found most interesting got the least space in the abstract. LEAP autonomously formalized a verified proof for a subproblem in Knuth's work on Hamiltonian decomposition of even-order Cayley graphs, and also touched Erdős Problem 457. Contest problems have known answers somewhere. Formalizing a piece of an open combinatorial question is a different kind of task, and the fact that an agentic loop around a general model can produce a kernel-checked artifact there is the result that actually points somewhere.

Whether this scales is the open question. The Putnam and IMO benchmarks reward a particular shape of problem. Research mathematics mostly does not look like that.

The Lean solutions and Lean-IMO-Bench are public now, so the 12/12 Putnam claim can be independently checked by anyone with a Lean install. The framework itself stays behind the curtain. If Google wants the "general LLMs beat specialized provers" argument to stick, releasing the agent is the obvious next move, and the one everyone's going to be asking for.

Google's LEAP Solves All 12 Putnam 2025 Problems Using Only General-Purpose LLMs

What's the trick, then?

The benchmark question

The irony nobody at Google addressed

Research-level, not just contest math

Oliver Senti

Related Articles

Claude Code Creator Says He Now Writes Loops, Not Prompts

Harness-1 Pushes Search-Agent State Out of the Model and Into the Environment

OpenAI Codex Adds Sites for Hosting Internal Apps From Prompts

Stay Ahead of the AI Curve