Google Upgrades Gemini 3 Deep Think, Claims Record Scores on ARC-AGI-2 and Humanity's Last Exam

Google released an upgraded version of Gemini 3 Deep Think on February 12, a specialized reasoning mode aimed at science, research, and engineering problems. The update is available to Google AI Ultra subscribers in the Gemini app, and for the first time, select researchers and enterprises can request early API access through Google's Vertex AI platform.

The benchmark numbers are aggressive: 84.6% on ARC-AGI-2, 48.4% on Humanity's Last Exam (without tools), and a 3455 Elo rating on Codeforces. Google also claims gold-medal performance on the 2025 International Math, Physics, and Chemistry Olympiads.

The ARC-AGI-2 score, and why it needs a footnote

That 84.6% figure is the headline number, and it's genuinely eye-popping. But it arrives with baggage.

The ARC Prize Foundation verified the score, yes. They also flagged something uncomfortable in the same analysis: evidence that ARC task data is well-represented in Gemini 3's training corpus. During their verification process, the model used correct ARC color mappings in its reasoning chain even though the verification harness never mentioned ARC tasks or color formats. The Foundation's language was careful but pointed: "This strongly suggests that ARC data is well represented in the underlying model." They couldn't determine whether the contamination was incidental or intentional.

For context, humans average about 60% on ARC-AGI-2 tasks. The previous Deep Think version (released with Gemini 3 in late 2025) scored 45.1% with code execution. So we're looking at a jump from 45% to 85% in roughly two months, on a benchmark whose creator is openly questioning whether the benchmark is being overfit. The ARC Prize team is already building ARC-AGI-3, a new interactive format designed to resist exactly this kind of saturation.

The cost per task tells its own story. According to ARC Prize's leaderboard data, Gemini 3 Deep Think spends $13.62 per task on ARC-AGI-2. For comparison, GPT-5.2 achieves 52.9% at $1.90 per task. Google's model scores higher, but it's burning roughly seven times the compute to get there. The ARC Prize Foundation has been pushing efficiency as a core metric of intelligence, and on that axis, this result looks less clean.

What's actually different under the hood

Google's blog post describes the approach as "advanced parallel reasoning," where the model explores multiple hypotheses simultaneously rather than following a single chain of thought. This builds on work the DeepMind team developed for the 2025 IMO, where an earlier Deep Think variant scored 35/42 (gold-medal level) by producing end-to-end natural language proofs within the 4.5-hour competition window. That was a real step up from 2024, when AlphaProof and AlphaGeometry needed days of computation and manual translation into formal languages like Lean.

The updated model also leans into reinforcement learning for multi-step reasoning and draws from what Google calls "a curated corpus of high-quality solutions to mathematics problems." How curated, and from where, remains undisclosed.

Beyond math and coding, Google is now claiming competence in chemistry and physics, including 50.5% on the CMT-Benchmark for theoretical physics. I'm not sure what to make of that number without knowing the baseline or what other models score on it. Google doesn't provide comparison data.

The $250 question

Deep Think is gated behind Google AI Ultra, the $249.99/month subscription tier that Google launched in early 2026. That's a lot of money. For reference, OpenAI's ChatGPT Pro runs $200/month, and Anthropic's Claude Max tops out at the same price.

Google bundles YouTube Premium, 30TB of storage, and access to tools like Flow and Veo 3 into the Ultra tier, which muddies the value calculation. But the core selling point is access to the most capable reasoning mode. And if you're a researcher who actually needs this for work, the API access (now opening via an application process) is the more interesting development. Consumer-tier access is a nice perk. API integration into research pipelines is where this model either proves itself or doesn't.

Real users, real claims, no independent verification

Google highlights a handful of early testers. Lisa Carbone, a mathematician at Rutgers, reportedly used Deep Think to find a logical flaw in a technical paper on mathematical structures linking gravity and quantum mechanics (the kind of problem with very little existing training data). At Duke, the Wang Lab used it to optimize crystal growth fabrication, designing a recipe for semiconductor thin films larger than 100 micrometers.

These are interesting anecdotes. They're also unverifiable from the outside, at least until the researchers publish their own accounts. I'd love to see Carbone's paper and the specific error Deep Think caught. Without that, we're taking Google's word for it.

The capacity problem nobody's talking about

Here's something the announcement doesn't mention. Users of the previous Deep Think version already dealt with significant capacity constraints. When the mode first rolled out, wait times and error messages were common enough that Google had to throttle access. The original source material for this story (a Russian-language post from a regular Deep Think user) notes the updated version is already returning messages about server load and asking users to wait.

When you're paying $250 a month and the model tells you other people's requests take priority, that's a hard sell. The user in question asked, reasonably, how much more those other users had paid. (The model apparently apologized.)

Whether Google has solved the compute bottleneck for this update is an open question. Deep Think's parallel reasoning approach, by definition, burns more inference compute than standard generation. Scaling that to the full AI Ultra subscriber base while also opening API access is a non-trivial infrastructure challenge.

Where this sits in the reasoning race

Gemini 3 Deep Think's benchmarks put it ahead of Claude Opus 4.6 (68.8% on ARC-AGI-2, according to the OfficeChai comparison) and GPT-5.2 (52.9%) on that specific test. On Humanity's Last Exam, it leads Claude Opus 4.6 (40.0%) and GPT-5.2 (34.5%).

But the ARC Prize Foundation is already developing ARC-AGI-3, an interactive benchmark launching in late March that's specifically designed to test capabilities current models haven't demonstrated. Static reasoning benchmarks are approaching saturation across the board, and the people who build these tests know it.

The next real test for Deep Think isn't another benchmark. It is whether the researchers getting API access actually integrate it into their workflows and produce results they couldn't have reached otherwise.

Google says the API early access program is open now. No timeline on broader availability.