Michael Truell, CEO of Cursor, announced this week that his company's autonomous agent system produced a novel solution to Problem Six of the First Proof challenge, a set of ten research-level math problems assembled by eleven prominent mathematicians. The solution, Truell claims, yields stronger results than the official human-written proof.
The kicker: Cursor is a code editor. Not a math lab, not a research institute. A company that makes software for programmers. And the agent harness it used here is the same one that built a web browser from scratch a few weeks earlier.
What First Proof actually is
First Proof dropped on February 5 as an attempt to cut through the fog of AI-for-math hype. The problems were contributed by researchers from Stanford, Harvard, Yale, Berkeley, Columbia, and other institutions, including 2014 Fields Medalist Martin Hairer. Each problem is a lemma, the kind of minor theorem a working mathematician might hand to a talented graduate student. They're not famous open problems. They're the grunt work of real research, extracted from the authors' own unpublished papers.
The design was deliberate. Daniel Spielman, a Yale professor on the team, noted that most published AI-math results come from the companies building the models. "It comes across as a bit of an advertisement," he told Scientific American. First Proof was supposed to be different: problems that couldn't be found in any training data, evaluated by mathematicians with no financial stake in the outcome.
The results, released on February 14, were messy. None of the AI systems solved all ten. The First Proof team's own preliminary tests found that the best publicly available models could only handle two. But once companies and independent researchers got involved, things got complicated fast.
The autonomy question
This is where it gets interesting, and where Cursor's claim stands apart.
OpenAI's chief scientist Jakub Pachocki announced that their internal models produced six solutions with "a high chance of being correct." But mathematicians quickly poked holes in at least one. And Pachocki later acknowledged that some solutions involved human guidance. The Problem 6 attempt, for instance, was prompted with specific strategic hints about using a "BSS barrier type argument," guidance that originated from reviewing the model's earlier attempts. Not exactly autonomous.
Google DeepMind's Aletheia agent, powered by Gemini 3 Deep Think, claimed six solutions (Problems 2, 5, 7, 8, 9, and 10) with zero human intervention. Their methodology is arguably the most transparent: raw prompts and outputs are published, and the pipeline guaranteed no human touching the math at any point. But they didn't crack Problem 6.
Cursor's approach? Four days of fully autonomous operation. No nudging, no hints. The solution was produced by Shengtong Zhang, a Stanford PhD student in mathematics, and Wilson Lin, the Cursor engineer behind the FastRender browser project, using the company's multi-agent coordination harness.
Does the math check out?
Daniel Litt, a mathematician at the University of Toronto who has been one of the more careful voices on AI-math capabilities, wrote in a recent blog post that Cursor's Problem 6 solution "seems more clearly autonomous" than OpenAI's attempt, and that "experts seem to vouch for" it. That's a measured endorsement from someone who has been consistently skeptical of AI-math hype, who recently admitted he's been "not correctly calibrated" about model capabilities.
Truell's specific claim, that the solution "yields stronger results than the official, human-written solution," is harder to independently verify. The original source material suggests the agent's approach differs from the human proof and improves a constant in the bound. But I haven't seen a detailed public writeup of the Cursor proof, and the First Proof team hasn't officially graded any submissions from this first round.
So: experts vouch for correctness. The autonomy claim is plausible given the harness. The "better than human" framing is the part that needs more scrutiny.
The harness
What caught my attention is the infrastructure claim. Cursor says this is the same system that coordinated hundreds of GPT-5.2 agents to write a Rust-based web browser in a week. That project, FastRender, produced over a million lines of code using a hierarchical system: planners break down work, workers execute tasks independently, and judge agents decide whether to continue or iterate. The whole thing runs without human code review.
The browser itself was imperfect. Critics pointed out it couldn't compile from a clean checkout, and one developer called the output "AI slop." Simon Willison, who had predicted AI-built browsers wouldn't arrive until 2029, said he was "very surprised." Fair to say opinions vary.
But applying that same coordination layer to mathematics is a different kind of test entirely. Writing code that compiles (or doesn't) is one thing. Producing a proof that domain experts accept as correct and novel is something else. The fact that Cursor didn't build a specialized math reasoning system, that they just pointed their coding infrastructure at a math problem, is either a sign that general-purpose agent coordination is more powerful than expected, or a sign we should wait for more rigorous verification before drawing conclusions.
Probably both.
The bigger picture, briefly
Mohammed Abouzaid, the Stanford mathematician who initiated First Proof, observed something telling about the AI-generated proofs that did work: they feel like 19th-century mathematics. Valid, but lacking the conceptual framework that modern mathematicians build. The AI can close out lemmas. It can't (yet) ask the right questions or develop new frameworks.
Litt puts it more starkly. He now expects to lose a bet he made in 2025 that AI wouldn't produce research comparable to the best human mathematics by 2030, at 3:1 odds. And he's a guy who has spent years pushing back against hype.
The First Proof team is planning a second round with tighter controls, with details expected around March 14. Until then, Cursor's claim sits in an awkward middle ground: endorsed by credible mathematicians, produced by a genuinely autonomous system, but not yet formally verified by the challenge organizers. The proof exists. Whether it's better than the human version in any way that matters to mathematicians, that's a question the next round might be better equipped to answer.




