Single AI Agents vs Multi-Agent Systems: When Skills Beat Teams

A single AI agent equipped with the right skills can match the performance of multi-agent systems while cutting token usage and latency in half. But there's a hard ceiling on how many skills the agent can reliably choose from, according to new research from the University of British Columbia.

The paper, published January 8 by researcher Xiaoxiao Li, tackles a question that's becoming increasingly relevant as companies deploy AI agents in production: do you really need multiple specialized agents talking to each other, or can one capable agent just pick the right tool for the job?

The cost problem nobody talks about

Multi-agent systems have dominated recent AI research. The appeal is obvious: specialized agents collaborate to solve complex problems, each contributing domain expertise. But this coordination comes with overhead that practitioners rarely discuss publicly.

When agents pass messages back and forth, those tokens add up. Each handoff requires context to be re-encoded. Synchronization delays compound. The research shows that on standard benchmarks like GSM8K and HumanEval, a single agent with compiled skills achieved equivalent accuracy while using 54% fewer tokens and finishing 50% faster.

The gains came from a straightforward insight: what if you could internalize all those agent behaviors as selectable skills rather than as separate processes that need to communicate?

What makes this different from tools

You might wonder how this differs from giving an LLM access to a calculator or search API. The distinction matters.

Tools are atomic operations, things like fetching a webpage or running a calculation. Skills, as defined in this work, are "schema-bounded operations" that include not just what to do but how to reason about it. A skill bundles a semantic descriptor for selection, an execution policy governing behavior, and optionally an external backend.

This maps directly to how Anthropic introduced Agent Skills in late 2025, though the research treats skills more abstractly. The core idea is similar: package expertise into discoverable, invocable units that an agent can load dynamically.

The compilation trick

The paper frames this as a "compilation" problem. Take a multi-agent system, decompose each agent's capabilities into discrete skills, assign execution backends, and internalize the communication topology as input-output constraints within the skills themselves.

In practice, this meant converting a three-agent pipeline (decomposer, solver, verifier for math problems) into a single agent that invokes those behaviors sequentially via structured output sections. The GSM8K pipeline went from three API calls to one. Token usage dropped from 1,407 to 616 per problem.

Not everything compiles cleanly. The researchers identify specific conditions: agent interactions must be serializable with no information loss, agents can't maintain private state, and they need to share the same underlying model. Debate architectures where agents argue opposing positions? Those don't compile. Parallel sampling where you take the best of multiple independent attempts? That breaks the model too.

Where it falls apart

Here's where the cognitive science comes in, and where the research gets genuinely interesting.

As skill libraries grow, selection accuracy doesn't degrade gradually. It holds steady until a critical threshold, then drops sharply. The researchers observed this phase transition pattern across libraries ranging from 5 to 200 skills.

At around 50 to 100 skills, depending on the model, accuracy remained above 95%. Beyond that, performance collapsed to roughly 20% at 200 skills. The curve fits a sigmoidal decay function with a capacity parameter κ that resembles working memory limits in human cognition.

The reference to Hick's Law from 1952 isn't arbitrary. That research established that human choice reaction time scales logarithmically with the number of options, but breaks down entirely beyond about 8 choices. The parallel suggests LLMs may face analogous constraints when selecting among semantically similar actions.

It's not just about library size

What's more telling is the confusability finding. When skills are semantically distinct, selection accuracy stayed at 100% even with 20 skills in the library. But adding just one semantically similar "competitor" skill, something that sounds related but does something different, dropped accuracy by 7-30%.

Two competitors per skill caused 17-63% degradation. The semantic overlap, not the raw count, drove the failures.

This has immediate implications for anyone building skill-based systems. You can't just keep adding capabilities. If your PDF extraction skill sounds too similar to your document parsing skill, the agent won't reliably distinguish them. The researchers suggest that skill descriptors need to emphasize unique characteristics rather than generic capabilities.

Does hierarchy help?

The paper tests whether organizing skills into hierarchical categories mitigates the scaling problem. The results are clear: when flat selection fails at large library sizes, hierarchical routing recovers substantial accuracy.

For GPT-4o-mini, hierarchy improved accuracy by 37-40% absolute at library sizes exceeding the capacity threshold. The mechanism mirrors chunking theory from cognitive science: instead of one overwhelming decision among 120 skills, the agent makes two tractable decisions, first picking a category, then selecting within a small cluster.

This aligns with established human-computer interaction research on menu design, where 4-8 items per level has long been the recommendation.

What didn't matter

The researchers also tested whether complex execution policies, verbose instructions describing how to perform each skill, affected selection accuracy. They expected that longer policies would consume processing bandwidth and lower effective capacity.

That hypothesis didn't hold. Simple, medium, and complex policies showed overlapping accuracy curves. Policy length appears orthogonal to selection difficulty, at least within the ranges tested.

Practical implications

For practitioners, the guidelines are relatively straightforward. Monitor library size relative to your model's apparent capacity, somewhere around 50-100 skills for GPT-class models. Audit semantic overlap before adding new skills. Merge or differentiate capabilities rather than accumulating near-duplicates.

When libraries must be large, implement hierarchical routing with confusability-aware grouping. Each decision stage should involve fewer options than the capacity threshold.

The researchers also note that stronger models show higher capacity thresholds and better resistance to confusability. For inherently large or overlapping skill sets, model capability investment translates directly to accuracy gains.

The bigger picture

This work arrives at an interesting moment. Anthropic published Agent Skills as an open standard in December, with enterprise management features and a partner ecosystem. The industry is clearly betting on skill-based agent architectures.

But the scaling research suggests limits that aren't immediately obvious from small-scale experiments. A skill library that works brilliantly with 20 capabilities might fail unpredictably at 80. The failure mode, a phase transition rather than gradual decline, makes capacity planning difficult.

The cognitive science framing is provocative. If LLM skill selection genuinely exhibits bounded capacity analogous to human decision-making, that has implications beyond agent design. It suggests fundamental constraints on how these models process and retrieve semantic information under choice pressure.

The paper is explicitly a technical report with acknowledged limitations: synthetic skill libraries, selection-only evaluation without end-to-end task measurement, and coverage of only two OpenAI models. The phase transition pattern needs replication across architectures before it becomes actionable science rather than suggestive observation.

Still, for anyone building skill-based agents at scale, there's a concrete takeaway: test selection accuracy explicitly as libraries grow, and don't assume that success with 30 skills implies success with 90.