AI Benchmarks Get a Reality Check as Intelligence Index 4.0 Crushes Top Model Scores

Artificial Analysis released version 4.0 of its Intelligence Index on Monday, and the results are humbling. Top models that previously scored in the 70s now max out around 50. OpenAI's GPT-5.2 with extended reasoning claims the top spot at 50 points, followed by Anthropic's Claude Opus 4.5 at 49 and Google's Gemini 3 Pro at 48.

A two-point spread at the top. That's the state of frontier AI in early 2026.

The saturation problem

The overhaul addresses something the industry has quietly acknowledged for months: traditional benchmarks stopped being useful. When every major model scores above 90% on MMLU-Pro, what's the point? You can't differentiate between systems, and enterprises trying to pick a model for deployment are flying blind.

So Artificial Analysis gutted three staple evaluations (MMLU-Pro, AIME 2025, and LiveCodeBench) that AI companies have been citing in press releases for years. In their place: tests designed to measure whether these systems can do actual work.

The recalibration is deliberate. Dropping top scores from 73 to 50 restores headroom for future improvement. Whether that headroom gets filled, well. We'll see.

What's actually being tested now

The new index weights four categories equally: Agents, Coding, Scientific Reasoning, and General. Each contributes 25% to the final score. The interesting additions are buried in the details.

GDPval-AA tests models on economically valuable tasks across 44 occupations and 9 industries. Models get shell access and web browsing through an agentic harness called Stirrup, then produce deliverables: documents, presentations, spreadsheets, diagrams. The kind of stuff people actually get paid to make. GPT-5.2 leads here with an ELO of 1442; Claude Opus 4.5's non-thinking variant scored 1403.

Then there's CritPt, and this is where things get interesting. More than 50 physicists from over 30 institutions created 71 research-level problems spanning condensed matter physics, astrophysics, high-energy physics, and a bunch of other fields I'd need a PhD to explain properly. These aren't textbook questions. They're the kind of warm-up exercises a hands-on PI might give a junior grad student.

The best model? GPT-5.2 at 11.5% accuracy. The frontier of AI reasoning hits a wall when you ask it to actually do physics research.

The hallucination trade-off

AA-Omniscience measures factual recall across 6,000 questions while explicitly penalizing hallucination. The results expose an uncomfortable pattern.

Google's Gemini models lead on raw accuracy (54% and 51% respectively), but they also hallucinate at rates of 88% and 85%. That means when they don't know something, they overwhelmingly make something up rather than admitting uncertainty.

Anthropic's Claude models show the opposite profile. Lower accuracy, but Claude 4.5 Haiku recorded the lowest hallucination rate of any model at just 26%. Claude 4.5 Sonnet and Claude Opus 4.5 came in at 48%. GPT-5.2 sits somewhere in the middle with 51% accuracy and the second-lowest hallucination rate.

The Omniscience Index penalizes wrong answers as much as it rewards correct ones. A score of 0 means a model answers correctly and incorrectly at the same rate. Only four models scored positive. Most frontier systems are, quite literally, more likely to confidently lie than give you the right answer.

For anyone deploying these in regulated industries, that's not a fun number to explain to compliance.

The cost question

Running the full Intelligence Index 4.0 evaluation suite on GPT-5.2 costs approximately $2,930, with $2,361 of that going to reasoning costs. Gemini 3 Pro Preview? Under $1,000. Claude 4.5 Sonnet? Also under $1,000.

The most capable model is also the most expensive to run by a factor of three. Whether that premium is worth a two-point lead on a benchmark, especially one where all three top models are essentially tied within measurement error, is a question enterprises will have to answer for themselves.

What this doesn't tell you

The Intelligence Index is text-only and English-only. Artificial Analysis benchmarks image, speech, and multilingual performance separately. So if your use case involves any of that, these numbers only tell part of the story.

And the GDPval benchmark uses Gemini 3 Pro as the grading model for pairwise comparisons. Whether that introduces any systematic bias toward models that happen to produce outputs Gemini prefers is an open question. Artificial Analysis freezes ELO scores at the time of evaluation to prevent gaming, but the grader choice is worth noting.

The CritPt benchmark also has an interesting disclosure: its dataset curation involved adversarial selection of questions based on tests with GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and several o1 variants. The authors explicitly discourage direct comparison with models not used in curation, as the dataset may be biased against them.

Where this leaves things

The three-way tie at the top confirms what many suspected: no single company has pulled ahead on general intelligence. OpenAI, Anthropic, and Google are essentially neck-and-neck, trading leads depending on which specific capability you care about.

The benchmark overhaul itself is probably more significant than the current rankings. By measuring what models can actually produce rather than what trivia they can recall, Intelligence Index 4.0 shifts the conversation toward practical utility. That's useful.

But the CritPt results are the real story here. Frontier AI systems have gotten very good at pattern matching across massive training corpora. They have not gotten good at the kind of rigorous, creative reasoning that actual research requires. An 11.5% accuracy rate on entry-level physics problems isn't a gap that scaling alone is likely to close.

The benchmarks have room to climb again. Whether the models can climb with them is the question for 2026.