AI Is Accelerating Nearly Twice as Fast as Before. The Problem: We're Losing Our Ability to Measure It

Frontier AI capabilities nearly doubled their rate of improvement around April 2024, according to new analysis from Epoch AI published this week. Using their Epoch Capabilities Index (ECI), researchers found that the best models advanced from roughly 8 points per year before the breakpoint to 15 points per year after. The acceleration coincides with the rise of reasoning models and an industry-wide pivot toward reinforcement learning.

But buried in Epoch AI's year-end review is a more unsettling finding: the benchmarks used to measure this progress are becoming less trustworthy by the month.

The acceleration is real, but the denominator keeps shifting

Yafah Edelman and Jaeho Lee, the Epoch AI researchers behind the capabilities analysis, fitted a two-segment linear model to frontier ECI scores spanning December 2021 through December 2025. The breakpoint model beat a simple linear fit on both AIC and BIC metrics, with an Akaike Evidence Ratio of 12 and a Bayes factor of 5.2. The 90% confidence interval for the speedup factor ranges from 1.4x to 3.3x.

The METR Time Horizon benchmark, which measures how long a task must take before models reach 50% success, corroborates the pattern with its own 40% acceleration detected in October 2024.

What's driving this? The researchers point to reasoning models and RL scaling. Labs like OpenAI and Anthropic claimed earlier in 2025 that their RL scaling rate couldn't be sustained for more than one to two years before hitting compute infrastructure limits. Which means the current acceleration might be borrowed time.

The benchmark problem nobody wants to solve

In a separate Gradient Updates post published the same day, Epoch AI researchers Florian Brand and Jean-Stanislas Denain laid out the benchmarking mess in uncomfortable detail.

Take GPQA-Diamond, a relatively simple benchmark. Epoch catalogued four different implementations with different prompt templates, system messages, and default temperatures. EleutherAI's harness uses temperature 0.0 for API models. OpenAI's simple-evals uses 0.5. OpenAI's gpt-oss uses 1.0. Different prompts instruct models to format answers differently. Everyone runs the same named benchmark, but they're not running the same test.

For non-agentic benchmarks like GPQA-Diamond, these differences wash out statistically. The scores reported by AI developers match independent runs within reasonable variance. But agentic benchmarks are a different story.

On SWE-bench Verified, a coding benchmark, switching the scaffold (the software that operates the agent) produces up to 11% difference for GPT-5 and up to 15% for Kimi K2 Thinking. That's larger than most capability gaps between models.

API providers are the real wildcards

The Epoch team ran several open models across multiple API providers while keeping everything else constant. The differences were stark.

For GLM-4.6, some providers returned rate limit errors that got scored as failures. Others returned empty or truncated responses despite not hitting token limits. SiliconFlow, Friendly, and Cerebras had lower max_tokens than advertised. Newer models fared worse: providers struggled more with recently released models compared to established ones like Qwen3.

The pattern appears across model releases. When new models drop, bugs proliferate. Providers scramble to fix inference issues over subsequent weeks. But benchmarking organizations want scores immediately, often before deployments stabilize.

Minimax claimed a 23 percentage point difference on tau-bench when using their native API versus the standard ChatCompletions API. That's not measurement noise. That's the difference between a mediocre model and a frontier one.

The scaffolding problem

OpenAI's recent o3 and o4-mini evaluations on SWE-bench Verified ran only 477 of the 500 problems due to infrastructure challenges. Creating and maintaining execution environments for agentic benchmarks is hard enough that even well-resourced labs drop samples.

The choice of tools matters enormously. An agent using Claude Code gets different capabilities than one using OpenHands. Prompts differ. Tool access differs. The scaffold becomes an inseparable part of what's being measured.

This leaves evaluators with an uncomfortable tradeoff. Use a standardized scaffold and you get reproducible comparisons but potentially underelicit frontier capabilities. Use the best available scaffold per model and you lose the ability to compare directly. Both approaches have obvious failure modes.

Where this leaves the field

Epoch AI's top 10 list from 2025 contains a revealing admission about their own benchmarking hub: they aim to provide independent evaluations because the AI community often relies on claims from AI labs. But independence doesn't solve the underlying methodology fragmentation.

Stanford researchers presented work at NeurIPS in December showing that roughly 1 in 20 benchmarks contain significant flaws, including labeling errors, ambiguities, and biases. The European Commission's Joint Research Center published a paper earlier this year titled "Can We Trust AI Benchmarks?" The answer, increasingly, is "with difficulty."

The Epoch Capabilities Index itself acknowledges interpretation problems. ECI scores work like Elo ratings: absolute values are meaningless, only comparisons matter. The scale is arbitrary, calibrated so Claude 3.5 Sonnet equals 130 and GPT-5 equals 150. There's no maximum achievable score.

This doesn't make the acceleration finding wrong. But it does mean the 8-to-15 points per year framing requires trusting that the underlying benchmark suite captures something real about capabilities, and that API instabilities and scaffold variations haven't systematically biased the measurements over time.

Frontier AI is probably improving faster than it was two years ago. The reasoning model push appears to be working. But the infrastructure for verifying these claims is straining under the weight of rapid releases, fragmented implementations, and unreliable model serving. The numbers are getting bigger. Whether they're getting more meaningful is a separate question.