Researchers from the University of Oxford, Stanford, the Allen Institute for AI and Sakana AI released a benchmark this month testing a question the "AI scientist" crowd would rather not dwell on: can a language model actually see what's coming in science, or does it just recognize what already happened? Their technical paper, posted to arXiv on May 21, says mostly the latter.
The trick: send the model back in time
The problem with asking an LLM to predict a 2025 breakthrough is obvious. It probably read about the breakthrough during training. So the team built a benchmark called CUSP that clamps down on this. For any given event, the model is only allowed to reason from information available before that event's earliest publication date, pinned via DOI timestamps across Crossref, arXiv and other APIs.
Strip away the hindsight and you're left with genuine forecasting. The benchmark code is public, and the full dataset sits on Hugging Face.
CUSP draws on 4,760 scientific events spanning biology, chemistry, physics, medicine and AI, sourced from Nature, Science, Cell and community AI leaderboards. Those events generate 17,429 forecasting tasks across four dimensions: is this advance feasible, what mechanism underlies it, design a solution, and when will it land.
So how did the models do?
Fine at recognizing plausible directions. Bad at the part that matters. The paper reports GPT-5.4 hitting about 82% on multiple-choice mechanism questions, where the job is picking the right technical approach from competing options. That's the recognition task, and it's the one closest to retrieval.
Feasibility is where things fall apart. Asked whether a specific advance would actually be realized, the models hovered near coin-flip territory. The paper frames this bluntly: models fail to predict whether scientific advances will happen at all.
Timing was its own mess. Predictions overshot reality by roughly 14 months at the median, according to the authors. Models consistently guessed that things would take longer than they did, which is a strange failure mode if you think LLMs are supposed to be relentless techno-optimists.
The detail that complicates the easy explanation
You'd expect a model to do better on events from before its training cutoff, since it might have absorbed the relevant context. It barely mattered. Performance was largely insensitive to whether an event fell before or after the cutoff, which means "it just hadn't seen the data yet" doesn't explain the gap.
Giving models more information helped, but didn't close the distance to a full-information setting. And here's the uncomfortable bit: that remaining gap got wider for the most-cited, highest-impact discoveries. The science that matters most is the science these systems are worst at anticipating.
One more wrinkle the authors flag: the models were systematically overconfident, with response biases that make their stated uncertainty close to useless. A forecaster that's both wrong and sure of itself is worse than one that admits it doesn't know.
What this actually means
The headline framing of an "autonomous AI scientist" leans hard on the idea that these systems can chart where research should go next. Sakana AI, one of the paper's collaborators, has been among the loudest voices building toward that vision, which makes its name on this more skeptical result a little surprising.
The takeaway from the authors is restrained and probably correct: today's frontier models work better as interpreters of results already in hand than as forecasters of what's coming. Access to prior knowledge, it turns out, is not the same thing as foresight.
The dataset and evaluation code are already live for anyone who wants to run their own models against it. The real test will be whether the next generation of frontier releases moves the feasibility numbers off the coin flip.




