METR added Claude Mythos Preview to its task-completion time horizon tracker on Friday and stapled a disclaimer to its own chart: anything above 16 hours can't be trusted with the current task suite. Mythos lands above that line.
What the number actually says
The nonprofit, which runs AI risk evaluations for frontier labs, put Mythos at a 50% time horizon of at least 16 hours, with a 95% confidence interval stretching from 8.5 to 55 hours. That spread says more about METR's task set than about Mythos. Translated into plain terms, the model has roughly even odds of finishing a task that would take a human expert two working days. Maybe one. Maybe almost a week. The graph can't tell you which.
The methodology comes from METR's original paper on time horizons, updated this year to a larger 228-task suite. Tasks are drawn from RE-Bench, HCAST, and a set of shorter novel software problems. Only five of those 228 tasks are estimated at 16 hours or longer. That's the source of the wide confidence interval and the warning notice METR pinned to the chart.
The benchmark's problem, not just Mythos's
This isn't really a story about Mythos. It's a story about what happens when a model lands in a region of a graph where the graph stops working. METR has been transparent about this. The May 8 update explicitly tells readers not to overread the exact figure, because there aren't enough long tasks to support precise comparisons at this end of the curve.
The task distribution is also narrow. Software engineering, machine learning, cybersecurity. Three domains, none of which involve stakeholder meetings, vague success criteria, or work that has to happen with other humans in the room. So even a clean 16-hour reading wouldn't mean Mythos can handle two days of generic knowledge work. It would mean the model can handle two days of well-specified coding under conditions designed for automatic scoring.
Some pushback
Critics of the AI-progress-is-exponential narrative have already noted that the 50% threshold is doing a lot of work here. Gary Marcus argued this week that the corresponding 80% reliability number for the same models is much lower, and the trend less dramatic. The 50% line is "the model gets it right half the time," which for a multi-day autonomous task is the kind of half that costs money to clean up.
Anthropic has framed the result more favorably, claiming the Mythos snapshot exceeded the next best model by more than 2x on METR's 80% benchmark. The company hasn't shared the full 80% breakdown, and Mythos remains a preview rather than a public release.
What comes next
METR has said it's working on longer tasks. Until that suite ships, every frontier release that lands above 16 hours will trigger the same disclaimer. The more durable takeaway from this update may be that the rate at which frontier models are improving has, at least for now, outrun the rate at which the people measuring them can build new yardsticks. Whether that's a benchmark problem or a capabilities problem depends on which side of the chart you're standing on.




