Gemini 3.1 Pro Benchmarks: Google Beats Claude, GPT-5.2

Google released Gemini 3.1 Pro on February 19, claiming top scores on 13 of 16 industry benchmarks and more than doubling its predecessor's performance on abstract reasoning tasks. The model is available now in preview across the Gemini API, Google AI Studio, the Gemini app, and NotebookLM.

Three months. That's how long Google's Gemini 3 Pro held the benchmark crown before Anthropic's Opus 4.5 and OpenAI's GPT-5.2 leapfrogged it. Now Google is back with a point release that, if the numbers hold up under independent scrutiny, puts it ahead again.

The ARC-AGI-2 number is wild

The headline figure: 77.1% on ARC-AGI-2, a benchmark designed to test whether models can solve logic patterns they've never encountered before. Gemini 3 Pro scored 31.1% on the same test. So Google is claiming a 2.5x improvement in what's supposed to be one of the hardest reasoning evaluations available. For context, Anthropic's Opus 4.6 sits at 68.8% and OpenAI's GPT-5.2 at 52.9%, according to Google's own blog post.

"Hitting 77.1% on ARC-AGI-2, it's a step forward in core reasoning," Sundar Pichai posted, which is a measured way to describe what would be a massive jump if confirmed by third parties. The ARC-AGI benchmark has been something of a moving target for AI labs. High scores on it haven't always translated into the kind of practical gains you'd expect from doubling a reasoning metric.

Coding and agents

Where things get more interesting for developers: LiveCodeBench Pro, a competitive coding benchmark, puts Gemini 3.1 Pro at an Elo of 2,887. That's up from 2,439 for Gemini 3 Pro and ahead of GPT-5.2's 2,393, according to The Decoder's reporting. An Elo near 2,900 in competitive coding puts this model in a tier that would embarrass most human programmers, though competitive coding benchmarks and production software engineering remain very different animals.

On SWE-Bench Verified, the closer-to-real-world coding test, the gap narrows considerably. Gemini 3.1 Pro scores 80.6%, nearly tying Claude Opus 4.6 at 80.8%. This has been Anthropic's turf for months, and the fact that Google is within spitting distance matters more than who technically leads by two-tenths of a percent.

The agentic benchmarks are where Google seems most eager to plant a flag. APEX-Agents, which evaluates long-horizon professional tasks, shows Gemini 3.1 Pro at 33.5% compared to Opus 4.6's 29.8% and GPT-5.2's 23.0%. MCP Atlas (69.2%) and BrowseComp (85.9%) round out a picture of a model Google clearly wants positioned for the agent-driven workflows everyone keeps promising and nobody has quite delivered.

Where Google still trails

Not everything went Google's way. Claude Sonnet 4.6 in its Thinking (Max) configuration matched Gemini 3.1 Pro on long-context retrieval and led convincingly on GDPval-AA, a test of expert-level professional tasks, scoring 1,633 Elo to Google's 1,317. That's not a narrow gap. Opus 4.6 also beat Gemini 3.1 Pro on Humanity's Last Exam with tools: 53.1% versus 51.4%.

And then there's GPT-5.3-Codex, OpenAI's coding specialist, which topped SWE-Bench Pro (Public) at 56.8% versus Gemini's 54.2% and led a Terminal-Bench 2.0 subcategory at 77.3%. OpenAI only reported Codex scores on a handful of benchmarks, though, making a full comparison impossible.

The leapfrog pattern

What's worth paying attention to isn't any individual benchmark. It's the cadence. Google ships Gemini 3 Pro in November 2025. Anthropic and OpenAI respond. Google fires back three months later with a point release that reclaims most categories. Benchmarks have become a marketing arms race where "winning" lasts about as long as a news cycle.

Google released 3.1 Pro in preview, with plans to make it generally available "soon" after gathering feedback on what it calls "ambitious agentic workflows." The model is accessible through AI Studio, Gemini CLI, Google Antigravity, Android Studio, Vertex AI, Gemini Enterprise, the Gemini app, and NotebookLM (the latter limited to Pro and Ultra subscribers). API pricing reportedly matches Gemini 3 Pro rates, which, at $2 per million input tokens, significantly undercuts Anthropic's Opus pricing.

No word yet on a Gemini 3.1 Flash, though given that Gemini 3 Flash launched about a month after 3 Pro, the pattern suggests it's a matter of when, not if.

Google Gemini 3.1 Pro Tops Benchmarks, Outscoring Claude Opus 4.6 and GPT-5.2

The ARC-AGI-2 number is wild

Coding and agents

Where Google still trails

The leapfrog pattern

Andrés Martínez

Related Articles

Google's Gemini 3.5 Flash Beats Its Own Flagship at I/O 2026

DeepSeek Sparse Attention Gets a From-Scratch Implementation Built for Reading

New Chronicles-OCR benchmark catches frontier vision models scoring near zero on ancient Chinese scripts

Stay Ahead of the AI Curve