Gemini 3 Deep Think Scores 84.6% on ARC-AGI-2 Benchmark

Google DeepMind released an upgraded Gemini 3 Deep Think reasoning mode on February 12, scoring 84.6% on ARC-AGI-2, the abstract reasoning benchmark that François Chollet's ARC Prize Foundation designed to resist exactly this kind of rapid saturation. The score was independently verified by the Foundation. For context, when ARC-AGI-2 launched in March 2025, frontier AI reasoning systems were scoring in the single digits.

Less than a year. That is how long the benchmark lasted.

What 84.6% actually means (and what it doesn't)

The conventional threshold for considering a benchmark "saturated" sits around 80%. Deep Think blew past it. And the gap between Google's model and its competitors is not small: Claude Opus 4.6 in thinking mode managed 68.8%, OpenAI's GPT-5.2 hit 52.9%, and Google's own Gemini 3 Pro (the non-reasoning variant) landed at 31.1%. The original ARC-AGI-1? Deep Think scored 96% on that. It is, for all practical purposes, done.

But here's the thing that should give you pause. The ARC Prize Foundation itself flagged something uncomfortable during the verification process. Their results analysis noted that Gemini 3 Deep Think was using correct ARC color mappings in its reasoning chain, even though the verification harness never mentioned ARC tasks or color formats. The Foundation's assessment: ARC data is "well represented" in the model's training set, enough to make correct inferences just from the structure of 2D integer arrays.

Is the model actually reasoning, or is it recognizing? The Foundation says they can't tell how much this new form of overfitting contributes to the score. And that's a pretty significant caveat buried under the headline number.

The cost problem

Deep Think isn't a separate model. It is a reasoning mode that throws more compute at inference time, letting Gemini 3 explore multiple solution paths before answering. The ARC Prize Foundation verified the score at $13.62 per task on ARC-AGI-2. Compare that to the human baseline: participants in the Foundation's controlled study averaged about 60% accuracy at roughly $17 per task.

So Deep Think beats humans on score and (barely) on cost. Sounds like a clean win. But consider: the Foundation's Kaggle competition winner, using a small-model approach, scored 24% on the private eval for a fraction of the cost. And Poetiq, a team of ex-DeepMind researchers, hit 54% on ARC-AGI-2 using a refinement harness wrapped around Gemini 3 Pro at $30 per task. The base Gemini 3 Pro model they built on costs $0.81 per task for 31%.

Nobody has disclosed what running Deep Think costs at production volume. No enterprise customer has talked publicly about inference bills. At $13.62 per puzzle, the math gets awkward fast if you're trying to deploy this for anything beyond benchmarks.

The other numbers

ARC-AGI-2 is the headline, but Google dropped results across several benchmarks simultaneously. On Humanity's Last Exam (no tools), Deep Think scored 48.4%, beating Claude Opus 4.6 at 40.0% and GPT-5.2 at 34.5%. On Codeforces, a competitive programming benchmark, it reached 3,455 Elo, which puts it in "Legendary Grandmaster" territory. Claude Opus 4.6 sits at 2,352 for comparison.

Then there's the Olympiad results. According to Google's blog post, Deep Think achieved gold-medal-level performance on written sections of both the 2025 International Physics Olympiad and International Chemistry Olympiad, and also on the 2025 International Math Olympiad. Plus 50.5% on the CMT-Benchmark, which tests advanced theoretical physics.

These are all self-reported by Google, and I haven't seen independent verification beyond the ARC-AGI-2 numbers. Take them with appropriate seasoning.

So is ARC-AGI-2 dead?

Sort of. The ARC Prize Foundation was already working on ARC-AGI-3 before Deep Think dropped, planning to introduce interactive and agentic elements to the benchmark. Their technical report says the 2D-grid puzzle format has provided a "useful scientific canary" for reasoning progress but that benchmark design needs to evolve.

There is a real question about what high ARC-AGI-2 scores actually tell us. The Foundation pointed out that Deep Think used 138,000 reasoning tokens to solve a task that base Gemini 3 Pro handled in 96 tokens. More thinking tokens correlate with higher scores, but they also mean the model is brute-forcing its way through solution space rather than exhibiting the kind of efficient generalization the benchmark was supposed to measure.

And the contamination issue looms. If the model has effectively memorized ARC-style patterns during training, then a high score on ARC-AGI-2 measures something different from what Chollet intended. The whole point of the benchmark was to test fluid intelligence on genuinely novel tasks. If the tasks aren't novel to the model anymore, the signal degrades.

Who can try it

The updated Deep Think is available now in the Gemini app for Google AI Ultra subscribers. Google is also opening API access through an early access program for researchers and enterprise users. General API availability hasn't been announced, and pricing details beyond the ARC verification cost remain unclear.

Google's move here is clear enough: position Gemini as the reasoning leader heading into 2026. Whether the benchmarks justify that framing depends a lot on how much you trust the benchmarks themselves. The ARC Prize Foundation, which has every reason to celebrate its benchmark being taken seriously by Google, is the one raising the biggest red flags about what the scores mean. That tension tells you more than the numbers do.