François Chollet spent years telling the AI industry that large language models couldn't truly reason. He built a benchmark to prove it, a set of visual puzzles with colored squares on grids that humans solve almost instantly but algorithms couldn't crack. For five years, ARC-AGI stood as a rebuke to the scaling hypothesis. Then, in December 2024, OpenAI's o3-preview blew past the human baseline on the original test. And now GPT-5.2 has done the same to the harder sequel.
The benchmark nobody was supposed to beat
The Abstraction and Reasoning Corpus began as a thought experiment. Chollet, who created the Keras deep learning library at Google before leaving in late 2024 to start an AGI-focused venture, published his paper "On the Measure of Intelligence" in 2019 with a core claim: existing AI benchmarks measured the wrong things. Systems could ace tests through memorization and pattern matching without demonstrating anything resembling human-like reasoning.
ARC-AGI tasks are deliberately simple for people. You see two or three examples of an input grid transforming into an output grid, some colored squares shifting, rotating, or changing according to an unstated rule. Then you get a new input and have to predict what the output should be. The average person solves about 75% of ARC-AGI-2 tasks on their first try, often in under three minutes per puzzle.
For AI systems, this was supposed to be devastating.
The original 2019 paper was explicit about the limitations Chollet expected: "As far as we know, ARC cannot be approached by any existing machine learning technique, including neural network-based approaches, due to its focus on broad generalization and learning from few examples."
What actually happened
When ARC-AGI launched in 2020, the winning Kaggle submission scored 21%. For four years, nobody got past 55% on the hidden test set. Then o3-preview arrived.
OpenAI presented the results during their "12 Days of OpenAI" event in December 2024. At low compute, o3-preview scored 75.7%. With 172 times more compute (generating 1024 candidate solutions per puzzle and selecting the most frequent plausible answer), it hit 87.5%. The cost was staggering: roughly $17-20 per task in low-compute mode, with estimates running into hundreds of thousands of dollars for the high-compute configuration.
Chollet acknowledged the result but pointed toward ARC-AGI-2, which launched in March 2025. The new version uses larger grids, more objects, and puzzles requiring combinations of multiple reasoning patterns. When it first went public, frontier models scored under 5%.
By December 2025, GPT-5.2 hit 54.2% on the semi-private evaluation set. A six-person startup called Poetiq matched that score using a different approach entirely.
The Poetiq question
Poetiq's result complicates the story Chollet has been telling. The startup didn't train new models. Instead, they built what they call a "meta-system" that orchestrates existing LLMs through iterative refinement loops. Take a puzzle, generate a candidate solution, evaluate whether it works on the given examples, refine based on feedback, repeat.
The system scored 54% on ARC-AGI-2's semi-private test at $30.57 per task, roughly half the cost of Google's Gemini 3 Deep Think, which scored 45.1% at $77.16 per task. Poetiq integrated Gemini 3 and GPT-5.1 within hours of those models' release without any retraining. Their GitHub repository is public.
What Poetiq demonstrated is that much of the "intelligence" being measured on ARC-AGI might be unlockable through clever prompt engineering and system architecture rather than fundamental model improvements. The ARC Prize Foundation now categorizes their result separately as a "Model Refinement" rather than a base model score.
The human baseline problem
The claimed 100% human score on ARC-AGI-2's leaderboard is misleading. It means every task was solved by at least two people out of roughly nine who attempted it. The average human test-taker actually scored around 60% across all tasks they tried. Some puzzles stumped seven or eight out of nine participants.
A detailed analysis on LessWrong noted that over 300 ARC-AGI-2 tasks were solved by only two participants, meaning certain puzzles might be getting included that only a small fraction of people can actually crack. The benchmark's designers explicitly acknowledge this creates "ambiguity" that requires AI systems to submit two guesses per puzzle.
So when GPT-5.2 hits 54%, it's not underperforming the average human. It's roughly matching the average test-taker's performance while still failing on the puzzles that trip up most humans too.
The synthetic data arms race
The real story behind the benchmark improvements isn't some breakthrough in reasoning architecture. It's synthetic data.
NVIDIA's team won the ARC Prize 2025 Kaggle competition with a 4-billion-parameter model that achieved 24% on the private evaluation, far less than GPT-5.2's 54% but under severely constrained compute budgets. Their approach: generate millions of synthetic ARC-like puzzles, train on every conceivable pattern variation, then perform test-time training where the model fine-tunes itself on each new puzzle's examples before attempting an answer.
The NVIDIA blog post is admirably direct about the strategy: "Heavyweight LLM methods, chain-of-thought, tool use, even RL-agents, couldn't fit within Kaggle's runtime. So NVARC moved all complex reasoning offline into a synthetic data pipeline, and trained smaller models capable of running fast during evaluation."
OpenAI confirmed that o3 was trained on 75% of the ARC-AGI public training set. It's reasonable to assume frontier models from Google and OpenAI have consumed vast quantities of synthetic reasoning examples.
What this means for "AGI"
Chollet himself maintains that solving ARC doesn't mean achieving AGI, a point he makes repeatedly despite the benchmark's name including those three letters. He views progress on ARC as evidence that test-time adaptation, systems that can modify their own behavior while running, works better than scaling alone.
"The big update has been that now we have models that can actually adapt at test time to something they've never seen before," Chollet said in a recent talk. He's shortened his personal AGI timeline from ten years to five, not because scaling is working but because reasoning models represent a genuine architectural shift.
The distinction matters. ARC puzzles test a narrow slice of what "intelligence" might mean: visual pattern recognition, rule inference from sparse examples, basic geometric and numerical reasoning. They explicitly exclude language, cultural knowledge, and the vast corpus of learned facts that LLMs excel at retrieving.
A system that scores 54% on ARC-AGI-2 still fails on roughly half the puzzles humans find trivial. GPT-5.2 still makes errors on tasks a child could solve, suggesting the underlying mechanism differs fundamentally from human cognition even when outputs appear similar.
Version 3 and the interactive pivot
Chollet is already building ARC-AGI-3, scheduled for full release in 2026. The format changes entirely: instead of static input-output pairs, systems will play interactive mini-games where they must discover the rules through trial and error.
The preview is live at arcprize.org. You control a colored square on a grid, pressing arrow keys to move, and have to figure out how to reach a goal state. There are no instructions. The mechanics (walls, teleporters, conditional movement) must be learned through exploration.
Early results are brutal. AI systems have scored zero points on most preview games. Humans clear them in minutes, often finding the theoretical minimum number of moves.
The shift to interactivity addresses a criticism implicit in how current models beat ARC-AGI-2. Static puzzles can be solved through exhaustive code generation and verification. Given enough compute, you can try millions of candidate programs until one produces correct outputs for the training examples. Interactive environments require something closer to real-time learning, maintaining state, forming hypotheses, and revising them based on immediate feedback.
Whether this will remain "hard for AI" longer than a few years is anyone's guess.
The cost problem nobody talks about
GPT-5.2's ARC-AGI-2 results came at roughly $1.90 per task in its standard "Thinking" configuration. OpenAI's premium "Pro X-High" configuration costs about $15.72 per task for 54.2% accuracy.
Human test-takers in ARC Prize's study were paid $5 per task solved.
The efficiency metric matters more than the raw score for most practical applications. A system that matches human accuracy at three times the cost isn't "human-level" in any economically meaningful sense. Poetiq's refinement approach cuts costs but remains expensive compared to just asking a person.
The ARC Prize Foundation now requires all submissions to report cost per task alongside accuracy, making the benchmark as much about efficiency as capability.
An honest assessment
ARC-AGI accomplished what it set out to do: force AI research to confront limitations in generalization and novel reasoning. The benchmark hasn't been "solved" in any final sense, it's been partially overcome through a combination of better models, massive synthetic training datasets, and clever test-time compute strategies.
The progression from 5% to 54% in roughly a year tells us something about the pace of capability improvement, but not necessarily about progress toward general intelligence. We've gotten very good at building systems that can solve visual puzzles through sophisticated pattern matching and program search. Whether that constitutes "reasoning" in the way Chollet originally meant remains contested.
The forthcoming ARC-AGI-3 may prove more durable. Interactive environments resist the generate-and-verify approach that cracked static puzzles. But the AI community has consistently underestimated how quickly any fixed target will fall once sufficient resources are directed at it.
Chollet's own pivot to a five-year AGI timeline suggests even the skeptics are running out of safe hedges.




