Anthropic Launches Agent Teams for Claude Code, Then Used 16 of Them to Build a C Compiler

Anthropic released Agent Teams on February 5, 2026, an experimental feature for Claude Code that lets multiple AI sessions work in parallel on the same project, each with its own context window, communicating with each other through a shared task list and direct messaging. The feature shipped alongside Claude Opus 4.6, and Anthropic brought receipts: a blog post from researcher Nicholas Carlini detailing how 16 agents autonomously wrote a 100,000-line C compiler in Rust over two weeks.

The compiler can build Linux 6.9 on x86, ARM, and RISC-V. Cost: roughly $20,000 in API fees across 2,000 sessions.

How it actually works

The documentation describes a surprisingly simple setup. One session acts as team lead, spawning "teammates" that each get their own Claude Code instance. The lead assigns tasks, teammates claim them from a shared list, and everyone can message everyone else directly. It is not a hierarchy so much as a flat coordination layer with one node that has spawning privileges.

Enable it with a single environment variable: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. Then tell Claude, in natural language, what team you want. "Create an agent team with one teammate on UX, one on architecture, one playing devil's advocate." Claude handles the rest.

There are two display modes. In-process keeps everything in your terminal (use Shift+Up/Down to switch between teammates). Split panes give each agent its own tmux or iTerm2 pane, which looks impressive but works in fewer environments. The lead can be locked into a "delegate mode" where it only orchestrates and never writes code itself, a detail that suggests Anthropic noticed leads getting distracted.

But the limitations list is telling. No session resumption: if you close the terminal mid-task, your teammates are gone and the lead will try to message ghosts. No nested teams. One team per session. Teammates sometimes forget to mark tasks complete, blocking downstream work. Shutdown is slow because each teammate finishes its current operation before dying.

These are the kinds of bugs you get from a system that was built and shelved, then shipped quickly.

The feature was hiding in plain sight

And in fact, that's more or less what happened. Developers found the full TeammateTool implementation buried in Claude Code's binary weeks before the official launch. A developer named kieranklaassen ran strings on the Claude Code binary in late January and found 13 defined operations: spawnTeam, discoverTeams, requestJoin, assignTask, broadcastMessage, voteOnDecision, and more. Fully implemented, schema-defined, feature-flagged off. Mike Kelly built a tool called claude-sneakpeek that bypassed the flags, letting early adopters try the feature before Anthropic was ready.

This is a pattern with Anthropic. The community builds workarounds for obvious product gaps (claude-flow, ccswarm, oh-my-claudecode all offered multi-agent orchestration), Anthropic absorbs the best patterns into the product, and the third-party tools become redundant overnight. Steve Yegge's "Beads" concept became Claude Code's Tasks feature. Now community-built swarm tools became TeammateTool. If you're building middleware for Claude Code, you might want to consider how disposable your work is.

The compiler stunt

Carlini's blog post is the more interesting read. He didn't use the polished Agent Teams feature that shipped to users. His setup was cruder: a bash loop that restarts Claude in a Docker container whenever it finishes a task, a bare Git repo mounted to each container, and a lock-file system where agents claim tasks by writing text files to a current_tasks/ directory. No orchestration agent at all. Each Claude instance just picks the "next most obvious" problem.

The coordination mechanism is Git itself. Agents push, pull, and resolve merge conflicts on their own. Conflicts are frequent, Carlini notes, "but Claude is smart enough to figure that out."

What's more interesting is where this approach broke down. When agents needed to compile the Linux kernel (one giant interdependent task rather than hundreds of independent test cases), all 16 agents would hit the same bug, fix it, and overwrite each other. Parallelism became useless.

Carlini's fix was clever: use GCC as an oracle. Compile most of the kernel with GCC, compile a random subset with Claude's compiler, and if the kernel boots, the problem isn't in Claude's subset. This let agents debug different files in parallel again. It's the kind of insight that required a human to design the harness, even though agents did the actual coding.

He also learned to design his test infrastructure specifically for LLM quirks. Don't print thousands of lines of output (it pollutes the context window). Include a --fast flag that runs a 1-10% random sample so agents don't burn hours on full test suites. Put the word ERROR on the same line as the reason, because Claude will grep for it. Pre-compute summary statistics so the model doesn't have to.

"I had to constantly remind myself that I was writing this test harness for Claude and not for myself," Carlini wrote.

What the compiler can't do

The 99% pass rate on GCC torture tests sounds impressive, and it is. The compiler also handles SQLite, Redis, FFmpeg, PostgreSQL, QEMU. It runs Doom, which Carlini calls "the developer's ultimate litmus test."

But the gaps are instructive. The compiler lacks a 16-bit x86 code generator needed to boot Linux out of real mode, so it calls out to GCC for that step. It has no assembler or linker of its own (those were in progress when Carlini published). The generated code is less efficient than GCC with all optimizations disabled. And Carlini is candid that the compiler has "nearly reached the limits of Opus's abilities." New features kept breaking existing functionality, and some limitations proved impossible for the model to overcome.

The $20,000 price tag also deserves scrutiny. That covers API compute only, not the engineering time Carlini spent designing the harness, writing test suites, watching for failure modes, and intervening when agents got stuck. This was not "set it and forget it." It was a human architect directing a team of tireless but occasionally confused workers.

The competitive context

Anthropic shipped Agent Teams and Opus 4.6 on February 5. OpenAI released GPT-5.3-Codex the same day, claiming 77.3% on Terminal-Bench 2.0 (Anthropic claims 65.4% for Opus 4.6, though different evaluation configurations make direct comparison tricky). OpenAI's model is 25% faster than its predecessor and, in a detail that reads like science fiction, helped debug its own training run.

The simultaneous launches were almost certainly not coincidental. Both companies are running Super Bowl ads this weekend. Both are fighting for the same developer wallet. According to VentureBeat, Claude Code hit $1 billion in annual run rate revenue in November 2025, six months after general availability. Anthropic's enterprise adoption has surged from near-zero in early 2024 to roughly 40% of companies using it in production, per an Andreessen Horowitz survey.

The race isn't about which model writes better functions anymore. It's about which platform can orchestrate autonomous, multi-day coding workflows. Agent Teams is Anthropic's bet that the answer involves letting multiple agents collaborate rather than making one agent smarter.

Who this is actually for

Right now? Power users with big token budgets and patience for experimental software. Agent Teams burns through tokens fast (every teammate is a separate billing session), and the coordination overhead can cancel out parallel gains for simple tasks. Anthropic's own docs recommend sticking with single sessions or subagents for sequential work, same-file edits, or anything with lots of dependencies.

The sweet spot is tasks that decompose cleanly: a feature spanning frontend, backend, and tests where each teammate owns different files. Debugging with competing hypotheses, where three agents test three theories simultaneously. Code reviews where separate agents check security, performance, and style. Cross-layer refactors where nobody steps on anybody else's changes.

For everything else, one Claude is still better than five.