AI Benchmarks

METR Says Claude Mythos Preview Tops Its 16-Hour Benchmark Ceiling

Anthropic's preview model lands above the range METR considers reliable. The benchmark itself may be the problem.

Liza Chan
Liza ChanAI & Emerging Tech Correspondent
May 11, 20263 min read
Share:
An analog chart recorder with its ink pen pressed against the upper printed limit of the paper roll.

METR added Claude Mythos Preview to its task-completion time horizon tracker on Friday and stapled a disclaimer to its own chart: anything above 16 hours can't be trusted with the current task suite. Mythos lands above that line.

What the number actually says

The nonprofit, which runs AI risk evaluations for frontier labs, put Mythos at a 50% time horizon of at least 16 hours, with a 95% confidence interval stretching from 8.5 to 55 hours. That spread says more about METR's task set than about Mythos. Translated into plain terms, the model has roughly even odds of finishing a task that would take a human expert two working days. Maybe one. Maybe almost a week. The graph can't tell you which.

The methodology comes from METR's original paper on time horizons, updated this year to a larger 228-task suite. Tasks are drawn from RE-Bench, HCAST, and a set of shorter novel software problems. Only five of those 228 tasks are estimated at 16 hours or longer. That's the source of the wide confidence interval and the warning notice METR pinned to the chart.

The benchmark's problem, not just Mythos's

This isn't really a story about Mythos. It's a story about what happens when a model lands in a region of a graph where the graph stops working. METR has been transparent about this. The May 8 update explicitly tells readers not to overread the exact figure, because there aren't enough long tasks to support precise comparisons at this end of the curve.

The task distribution is also narrow. Software engineering, machine learning, cybersecurity. Three domains, none of which involve stakeholder meetings, vague success criteria, or work that has to happen with other humans in the room. So even a clean 16-hour reading wouldn't mean Mythos can handle two days of generic knowledge work. It would mean the model can handle two days of well-specified coding under conditions designed for automatic scoring.

Some pushback

Critics of the AI-progress-is-exponential narrative have already noted that the 50% threshold is doing a lot of work here. Gary Marcus argued this week that the corresponding 80% reliability number for the same models is much lower, and the trend less dramatic. The 50% line is "the model gets it right half the time," which for a multi-day autonomous task is the kind of half that costs money to clean up.

Anthropic has framed the result more favorably, claiming the Mythos snapshot exceeded the next best model by more than 2x on METR's 80% benchmark. The company hasn't shared the full 80% breakdown, and Mythos remains a preview rather than a public release.

What comes next

METR has said it's working on longer tasks. Until that suite ships, every frontier release that lands above 16 hours will trigger the same disclaimer. The more durable takeaway from this update may be that the rate at which frontier models are improving has, at least for now, outrun the rate at which the people measuring them can build new yardsticks. Whether that's a benchmark problem or a capabilities problem depends on which side of the chart you're standing on.

Tags:AIMETRClaude MythosAnthropicAI benchmarksAI safetytime horizonAI evaluationfrontier models
Liza Chan

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

METR: Claude Mythos Tops 16-Hour Benchmark Ceiling | aiHola