Karpathy's 8-Agent AI Research Lab Produced Junk Science

Andrej Karpathy gave eight AI agents their own GPUs and told them to do science. Four Claude instances, four Codex instances, each with a single GPU, all pointed at a real research problem in his nanochat project: remove the logit softcap from the attention mechanism without causing a regression in validation loss. He shared the results on X on February 27.

The TLDR, in his words: "it doesn't work and it's a mess... but it's still very pretty to look at."

That last bit matters more than it sounds. Karpathy's setup looked impressive on screen, a grid of tmux windows humming with activity, each agent forking branches, running experiments, writing results to shared files. It looked like a research lab. It just didn't act like one.

The setup

The infrastructure was deliberately lightweight. Git branches for each research program. Feature branches per agent. Git worktrees for isolation. File-based communication. No Docker, no VMs. Karpathy said he found that "instructions are enough to prevent interference," which is an interesting bet on the compliance of language models with soft boundaries.

He tried multiple organizational structures. Eight independent solo researchers, each doing their own thing. A hierarchical setup with one "chief scientist" agent directing eight juniors. The research org ran in tmux window grids that resembled something like Microsoft Teams (his comparison, not mine), letting him watch each agent's work in real time and "take over" a session if needed.

What went wrong is more interesting than what went right

The agents could code. Give them a well-scoped task, a clear specification, and they'd implement it competently. That part worked fine. But research isn't implementation. Research is figuring out what to implement, and why, and how to tell whether it actually worked.

Karpathy's agents failed at all of that.

They didn't design experiments carefully. They ran variations that Karpathy described as "a bit non-sensical." They didn't establish strong baselines before trying modifications. They didn't control for compute time or FLOPs. And they didn't ablate properly, meaning they changed multiple things at once and had no way to attribute results to specific changes.

The best example: one agent "discovered" that increasing the network's hidden size improved validation loss. Technically true. But a bigger network has more parameters, trains longer, and will almost always show lower validation loss in the infinite data regime. There's no scientific insight there. It's the equivalent of discovering that a bigger bucket holds more water. Karpathy had to step in personally to point this out, which rather defeats the purpose of autonomous research.

The hypothesis problem

Here's the thing. This wasn't a failure of intelligence in the conventional sense. These were frontier models running at their highest capability settings. They can write clean code, parse papers, and produce coherent technical prose. But generating a strong research hypothesis requires something models apparently don't have yet: the ability to distinguish a genuinely informative experiment from one that just looks productive.

A human researcher (a good one, anyway) would immediately flag the hidden-size result as confounded. They'd know to control for training compute. They'd know that an ablation study needs to change exactly one variable. This isn't obscure methodology. It is research methods 101. And the agents missed it.

I've seen this pattern before in other multi-agent experiments. The agents are busy. They produce output. They look like they're working. But the work doesn't accumulate into knowledge. It's activity without direction.

What about the chief scientist model?

You'd think a hierarchical setup would help. Have one agent plan the research agenda, break it into tasks, assign them to junior agents. Karpathy tried this. He doesn't go into detail about why it didn't solve the problem, but the implication is clear enough: if the top-level agent can't formulate good hypotheses either, hierarchy just distributes bad ideas more efficiently.

"Org code"

The more interesting takeaway from Karpathy's experiment isn't about what failed. It is about the framing he landed on. You're not programming a model anymore, he argued. You're programming an organization. The "source code" of that organization is the collection of prompts, skills, tools, processes, and workflows that define how the agents operate. A daily standup becomes a line in your org code. Role definitions become configuration.

This is a useful mental model even if the current implementation falls short. It shifts the question from "how smart is my agent?" to "how well-designed is my research process?" And that second question has much better answers available, because humans have been designing research processes for centuries.

But it also raises an uncomfortable question nobody in the agent community wants to sit with: what if the bottleneck isn't process design but something more fundamental about how these models reason about causality and experimental logic? You can write the most detailed standup prompt in the world. If the agent can't tell a confounded result from a real one, the standup just surfaces garbage faster.

Where this leaves us

Nanochat itself is Karpathy's open-source project for training a ChatGPT-style model for under $100. It's a real codebase with real research questions (the logit softcap removal is a genuine open issue in the repo). This wasn't a toy experiment designed to show off agents. It was a serious attempt to use them for actual work.

That makes the failure more informative than most agent demos. Karpathy wasn't cherry-picking a task that would make agents look good. He gave them something he actually needed done and watched them botch it.

The experiment lines up with something Karpathy wrote in his 2025 year in review: that agents are good at well-specified tasks where you can verify output, but struggle with open-ended work requiring judgment. In that review, he called this the "decade of agents," not the year. February's experiment suggests the decade estimate might be optimistic for research-grade autonomy.

For now, the practical ceiling seems clear. AI agents are excellent research assistants and terrible principal investigators. They can implement your ideas faster than you can type them out. They cannot tell you which ideas are worth implementing. That gap, between execution and judgment, is where the actual hard problem lives. And no amount of org code is going to prompt-engineer it away.