Google researchers published a paper this month on LEAP, an agentic system that gets general-purpose language models to write formal, machine-checked proofs in Lean. The headline result: it formally solved all twelve problems from the 2025 William Lowell Putnam Mathematical Competition, the undergraduate contest where the 2025 median score was 2 out of 120.
Why this is harder than it sounds
Natural-language proofs are notoriously hard to verify automatically. They skip steps. They hide assumptions. A formal proof, written in a language like Lean and checked by a compiler, gives you a correctness guarantee instead, but writing one is brutal work. The field has mostly been won by specialized models fine-tuned specifically for Lean.
LEAP takes a different route. The technical paper describes an agentic framework that runs on general models (Gemini 3.1 Pro as the backend) rather than a Lean-specialized prover. It sketches a high-level blueprint, structures it as a dependency graph, then generates Lean code and fixes errors recursively using compiler feedback. Solutions are posted on GitHub.
The numbers, and what they actually compare against
Here the source framing needs a correction. LEAP did not stand alone on Putnam. The paper itself says the 12-out-of-12 result matches two other systems, the closed Axiom and Numina, both of which also cleared all twelve. So perfect Putnam performance, while real, is no longer the rare feat it was a year ago.
The sharper comparison is Lean-IMO-Bench, a new set the authors built from 60 IMO-style problems formalized into Lean, picked for non-routine, structurally messy proofs. On that benchmark LEAP hit a 70% solve rate. General LLMs on their own land under 10%. Specialized prover models manage around 5%. Aristotle, the Harmonic system that reached gold-medal level at IMO 2025, scored 48% here. That gap is the part of the story worth paying attention to, more than the Putnam clean sweep.
One concrete flex: LEAP formally verified a key subproblem in Knuth's work on Hamiltonian decomposition of even-order Cayley graphs, generating more than 5,000 lines of Lean 4 to do it. It also formalized a proof for Erdős Problem 457.
About that irony
The authors take a swing at closed competitors. They call Axiom and Numina scientifically unverifiable because both stayed closed-source with no public access. Fair enough.
But the LEAP framework code itself does not appear to be open either. The paper releases the generated Lean proofs, which is something, since anyone with the Lean compiler can recheck those. Full reproducibility of the system that produced them is a different matter, and that part stays inside Google for now.
The paper went up on arXiv on June 2, 2026. The Lean-IMO-Bench resources are listed as available, so independent benchmarking against the 48% Aristotle figure is the obvious next test.




