A consortium of researchers from Cornell, MIT, Stanford, and other institutions has released a benchmark that evaluates large language models on scientific discovery tasks rather than textbook knowledge. The framework, called Scientific Discovery Evaluation (SDE), tests whether AI can propose hypotheses, design experiments, and interpret results across biology, chemistry, materials science, and physics.
The gap between knowing and discovering
The SDE results paint a less flattering picture of frontier models than benchmarks like MMLU or GPQA. Those tests ask multiple-choice questions about established science. SDE asks models to do the work of science: generate testable hypotheses from observations, propose simulation parameters, interpret unexpected results.
Models that score above 85% on MMLU's science questions, which many do, perform considerably worse on SDE's scenario-based tasks. The paper reports "a consistent performance gap relative to general science benchmarks," though the authors don't release specific comparative scores in the abstract.
What's more interesting is the scaling finding. Larger models don't proportionally close that gap. The researchers describe "diminishing returns of scaling up model sizes and reasoning" on discovery tasks, which suggests the problem isn't just compute.
How it works
Domain experts defined research projects of "genuine interest," which presumably means projects they'd actually want to work on, not retrofitted puzzles. These projects decompose into modular scenarios. Each scenario generates questions that tie directly to the research context.
The two-level structure matters. Question-level accuracy tells you whether a model can handle individual research subtasks. Project-level performance tells you whether those subtasks compose into useful scientific work.
The finding that jumps out: models sometimes succeed at the project level even when their constituent scenario scores are low. The researchers attribute this to "guided exploration and serendipity in discovery," which is a polite way of saying that sometimes you stumble into the right answer for the wrong reasons.
No clear winner
The benchmark tested models from multiple providers, and "large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated." Put differently, which model wins depends on which project you're looking at.
That's bad news if you wanted a single recommendation. It's worse news for anyone hoping current LLMs are approaching "general scientific superintelligence." The authors explicitly state that all models remain distant from that target.
The paper includes 56 authors, which is either a sign of thorough domain coverage or a collaboration that got out of hand. Given the four-discipline scope and the claim that domain experts designed genuine research projects, the former interpretation seems more charitable.
SDE is available for reproducible evaluation. The framework "charts practical paths to advance their development toward scientific discovery," which is benchmark-speak for: here's what you should optimize for if you want AI that can actually do research.




