GPT-5.2 Clears 40% on FrontierMath as AI Nibbles at Research-Level Problems

OpenAI's GPT-5.2, released December 11th, 2025, now solves 40.3% of problems on FrontierMath Tiers 1-3, up from GPT-5.1's 31%. The benchmark, maintained by Epoch AI, consists of 350 problems designed to take professional mathematicians hours or days to complete.

The numbers, and what they don't tell you

The headline figure is real. GPT-5.2 Thinking hit 40.3% on the main benchmark, while GPT-5.2 Pro pushed even further on Tier 4, the hardest subset, reaching 29.2%. A year ago, frontier models struggled to clear 2% on the lower tiers. Greg Brockman called it evidence that AI is approaching the capability for scientific breakthroughs.

But the benchmark methodology matters here. OpenAI's internal evaluations differ from Epoch AI's independent runs. Different scaffolding, different compute budgets, potentially different problem subsets. Epoch AI notes explicitly that their implementation differs from what OpenAI used for their o3 evaluations, and they weren't involved in running OpenAI's numbers. The difference could be meaningful.

What's harder to dispute: progress is fast. Six months ago, 4% on Tier 4 was state of the art.

The proof that actually worked

OpenAI is leaning hard into a specific case study. Researchers asked GPT-5.2 Pro to solve an open problem in statistical learning theory, something that originated at a 2019 conference: does collecting more data reliably improve model performance? The model proposed a proof without being given intermediate steps or a strategy. Human researchers verified it, external experts reviewed it.

According to OpenAI's writeup, the model then extended the result to higher-dimensional settings on its own.

This is genuinely unusual. The standard AI math story involves massive scaffolding, human hints, iteration. Here the claim is that GPT-5.2 Pro produced a verifiable proof from a cold prompt.

Terence Tao is paying attention

The broader math community is watching with a mix of interest and suspicion. Terence Tao has been maintaining a wiki page tracking AI contributions to Erdős problems, the famous collection of open questions posed by Paul Erdős over decades.

Just this week, GPT-5.2 Pro apparently solved Erdős Problem 281. Tao's initial reaction was positive. Then someone found a prior solution buried in a 1936 paper, by Erdős himself. The AI proof was still different from the literature proof, and Tao noted the model avoided common mistakes with limit interchanges, but the "previously unsolved" framing collapsed.

This keeps happening. The problems that fall to AI tend to be from the "long tail" of under-explored questions, not the famous hard ones. Tao's wiki page is careful to distinguish between these categories.

What the mathematicians actually think

The Malliavin-Stein experiment from September 2025 offers a more measured perspective. Researchers deliberately tested GPT-5 on an open problem in probability theory. It worked, they got new results, but the process raised concerns.

Their worry isn't that AI can't do math. It's that the workflow feels disorienting. The model produces technically correct arguments that lack the intuition-building struggle that training mathematicians normally requires. PhD students who rely on AI assistance might never develop the sense for what works and what doesn't.

And then there's Sébastien Bubeck's viral post from August 2025, where GPT-5 Pro improved a known bound in convex optimization. Mathematicians had nuanced reactions. The math checked out. But several pointed out this was synthesis of known techniques, not the kind of creative leap that defines hard problems.

The reliability problem

Epoch AI published an analysis showing that even with repeated runs across all their models, they estimate less than 70% of FrontierMath problems are solvable by current AI. The curve flattens. Models that can solve some problems tend to solve similar problems. Web search helps on 5% of questions where obscure mathematical knowledge matters.

GPT-5.2 Pro only uniquely solved one problem that no other model had touched.

This matches what researchers report. The models are getting better at the problems they can do. Completely novel reasoning, the kind that opens new fields, remains elusive.

Where this leaves us

The benchmark trajectory is clear. FrontierMath Tiers 1-3 will probably saturate within a year or two at this pace. Epoch AI is already building a new benchmark focused on problems with automatically verifiable solutions, specifically because they expect this one to run out of signal.

The interesting question isn't whether AI will hit 50% or 60% on FrontierMath. It's whether any of these problems actually matter to working mathematicians. The ones falling to AI are mostly low-hanging fruit that hadn't been picked because nobody cared enough to try.

The hard stuff, the Millennium Prize problems, the questions that would reshape entire fields, those remain untouched. For now.