Stanford: Single-Agent LLMs Beat Multi-Agent Systems at Equa

A pair of Stanford researchers just poked a hole in one of the more popular assumptions in AI engineering right now: that throwing more agents at a problem makes it smarter. Dat Tran and Douwe Kiela, in a paper published April 2, tested single-agent versus multi-agent LLM architectures on multi-hop reasoning tasks while holding reasoning-token budgets constant across both setups. The single-agent systems won. Consistently.

The finding is awkward timing for the multi-agent hype cycle. Gartner reported a 1,445% surge in enterprise inquiries about multi-agent systems between Q1 2024 and Q2 2025. Frameworks like LangGraph and CrewAI are competing for developer mindshare. And here comes a relatively simple controlled experiment suggesting the gains people are seeing might just be... more compute in disguise.

The information theory angle

The paper's core theoretical claim leans on the Data Processing Inequality, a well-established concept from information theory. The argument: when you split a reasoning task across multiple agents, each handoff between agents loses information. A single agent with access to the full context and the same token budget should, in principle, be more information-efficient. No coordination overhead. No lossy context transfers.

That's the theory. But the authors don't stop there, and the empirical work is where things get interesting (and a bit uncomfortable for multi-agent advocates).

Three model families, same result

Tran and Kiela tested across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5. When reasoning tokens were held constant, single-agent systems matched or beat every multi-agent architecture they tried. The paper doesn't cherry-pick one model and call it a day. Three families, same conclusion.

But I want to flag something the authors themselves surface: the scope here is multi-hop reasoning specifically. Whether these results generalize to code generation, tool use, or open-ended creative tasks is an open question the paper doesn't attempt to answer. And it shouldn't, honestly. The multi-hop reasoning lens is narrow enough to be useful.

The Gemini 2.5 budget control problem

Here's where it gets really messy. The paper identifies what it calls "significant artifacts" in API-based budget control, particularly with Gemini 2.5. When you tell a model's API to limit its thinking tokens to a certain number, does it actually comply? According to this research, not reliably. And if the API is leaking extra compute to multi-agent runs without researchers noticing, the apparent advantages of multi-agent setups get inflated artificially.

This is the kind of methodological finding that doesn't make headlines but probably should. If your multi-agent system looks 15% better than a single agent, and 10% of that comes from sloppy token accounting by the provider's API, you don't actually have a 15% improvement. You have a measurement error.

When multi-agent does make sense

The paper isn't saying multi-agent is always wrong. It predicts, and finds some evidence, that multi-agent architectures become competitive when a single agent's effective context utilization degrades. Long contexts, messy inputs, situations where a single model starts losing track of what it's read. Anthropic's own engineering blog made a similar point earlier this year: context pollution, parallelizable tasks, and specialization are the three scenarios where multiple agents consistently help.

So the real takeaway isn't "multi-agent bad." It's more like: before you spin up an orchestrator with five sub-agents, check whether one agent with the same token budget solves your problem. Because it might. And it'll cost less.

Who's behind this

Kiela is an interesting figure here. He's the CEO of Contextual AI, an adjunct professor at Stanford, former head of research at Hugging Face, and one of the co-authors of the original RAG paper from his time at Meta's FAIR lab. He's not some outsider lobbing grenades at the multi-agent crowd. He's someone who builds production AI systems and has a track record in evaluation methodology going back years.

That doesn't make the paper automatically right. But it does make it harder to dismiss.

What this means for the framework wars

The multi-agent framework market is booming. LangGraph, CrewAI, OpenAI's Agents SDK, Microsoft's AutoGen (now the Microsoft Agent Framework). Billions of dollars in enterprise spend heading toward multi-agent orchestration. This paper won't stop any of that, and it shouldn't. Real production systems have constraints the paper doesn't model: organizational boundaries, latency requirements, different models for different sub-tasks.

But it should make teams more disciplined about their baselines. If you can't show your multi-agent system beats a single agent at the same compute budget, you haven't built a better architecture. You've built a more expensive one.