AI Scientist in Nature: What Sakana's Claims Show

Sakana AI's AI Scientist, the autonomous research system that can generate hypotheses, run experiments, and write papers without human intervention, has been published in Nature. The paper, which dropped on March 25, consolidates roughly eighteen months of work by a team spanning Sakana's Tokyo lab, the University of British Columbia, the Vector Institute, and the University of Oxford. It is one of the first autonomous AI research tools to survive Nature's peer-review gauntlet.

But surviving peer review and emerging unchanged are two different things. And the Nature version of this story is considerably more cautious than the one Sakana told in August 2024.

What actually happened at ICLR

The headline result: AI Scientist-v2 submitted three fully AI-generated papers to a workshop at ICLR 2025, and one was accepted. The accepted paper investigated compositional regularization in neural network training. It scored 6.33 on average from reviewers (individual scores of 6, 7, and 6), putting it roughly in the top 45% of submissions to that workshop. Sakana withdrew it before publication.

Here's the thing, though. This was a workshop paper, not a main conference submission. Workshop acceptance rates at venues like ICLR run around 60-70%, compared to 20-30% for the main track. Sakana's own co-founder David Ha acknowledged the paper isn't at the level of the best human-produced work accepted at the same conference. And the team's internal assessment concluded that none of the three generated papers met their own bar for a main conference publication.

The workshop in question, ICBINB ("I Can't Believe It's Not Better"), focuses on negative results and limitations of deep learning methods. It's a legitimate venue, but it's not exactly where you'd go to show off a system's best work. Whether that was strategic or coincidental is left as an exercise for the reader.

Nature made them tone it down

The original preprint from 2024 described AI Scientist as automating the "entire" scientific process and called it "the beginning of a new era in scientific discovery." Nature's reviewers apparently weren't having it. The published version walks back claims about full automation, acknowledges that humans helped filter the most promising outputs, and expands considerably on limitations and ethical considerations.

That's peer review working as intended, frankly. The 2024 preprint drew immediate scrutiny from researchers who felt the claims outran the evidence. Cong Lu, one of the lead authors and a postdoctoral fellow at UBC, compared the system's abilities to those of an early PhD student: some surprisingly creative ideas, vastly outnumbered by bad ones.

The independent audit is less flattering

A team led by Joeran Beel published an independent evaluation that paints a rougher picture. Five out of twelve proposed experiments (42%) failed due to coding errors. Several ideas that the system classified as novel turned out to be well-established techniques, like micro-batching for SGD. In one case, an experiment designed to optimize energy efficiency reported accuracy improvements while actually consuming more resources.

The generated manuscripts cited a median of five papers each, and most citations were outdated. Some papers contained hallucinated numerical results. Beel's team described the output quality as resembling "a rushed undergraduate paper."

And yet. The speed is hard to ignore. A full paper in 3.5 hours of human involvement, at a compute cost of $6 to $15. That's not nothing, even if the output needs significant cleanup.

The scaling law claim

The most interesting new result in the Nature paper might be the scaling analysis. Sakana built an Automated Reviewer that ensembles five independent AI reviews to evaluate paper quality. They benchmarked it against thousands of human decisions from the OpenReview dataset and report 69% balanced accuracy, which they say is comparable to human reviewer agreement. (The NeurIPS 2021 consistency experiment found that human reviewers themselves disagree at similar rates, so this bar is arguably not that high.)

Using this reviewer, they found a clear correlation: stronger foundation models produce higher-quality papers. Better base model in, better science out. Sakana's blog post frames this as a "scaling law of science," which is a bold claim for what amounts to a correlation across a handful of model generations evaluated by an AI grader. But the direction of the trend is consistent, and if it holds, the implication is straightforward. As models get cheaper and more capable, the quality gap between AI-generated and human research narrows on its own. No architectural breakthroughs required.

I'm not sure I buy the framing, but the data point is worth watching.

The flood problem

Jevin West, a computational social scientist at the University of Washington, called AI Scientist "a remarkable technological achievement" in Nature's coverage. But his concern is practical: what happens when conferences already drowning in submissions get hit with a firehose of $15 papers?

The timing is uncomfortable. A study of ICLR 2025 reviews found that roughly 21% were fully AI-generated. Not assisted by AI. Fully written by LLMs. Now we have AI writing papers and AI writing reviews of papers. The feedback loop writes itself.

Sakana, to their credit, watermarked all AI-generated submissions and withdrew the accepted one. They've called for community norms around AI-generated science. But norms and $15 in compute costs don't exactly create a high barrier.

What the system can't do

It is currently limited to computational experiments in machine learning. No wet labs, no field work, no physical apparatus. Jeff Clune has mentioned conversations with biologists about in silico experiments, and the open-source codebase theoretically allows adaptation to other domains. But the gap between running a training loop and conducting a controlled biological experiment is measured in orders of magnitude, not incremental improvements.

The system also still struggles with deep methodological rigor, can hallucinate citations, and occasionally duplicates figures in appendices. These are not edge cases; they are current baseline behavior.

So what is this, exactly?

Jennifer Listgarten at UC Berkeley has argued in Nature Biotechnology that AI is not about to produce breakthroughs across multiple scientific domains, in part because most fields lack the vast public datasets that made NLP and computer vision tractable. AI Scientist lives in the one domain where data is abundant and experiments are cheap to run: machine learning research about machine learning.

That's simultaneously the system's greatest strength and its most obvious limitation. It can iterate on ML techniques at a pace no human lab can match. But the resulting papers, at least so far, land at the workshop level. Interesting negative results. Preliminary findings. The kind of work that starts conversations rather than settles them.

The Nature publication is a legitimate milestone for Sakana and the broader field of AI-assisted research. But read past the headline and the picture is more complicated. A system that produces workshop-quality ML papers at $15 a pop is impressive engineering. Whether it's "the dawn of a new era" for science, as Sakana's blog puts it, depends on assumptions about scaling that haven't been tested yet.

The ICLR 2026 submission deadline is approaching. I'd bet money someone submits an AI Scientist paper to the main track this time. Whether it gets in will tell us more than any Nature publication can.