Grok 4.20 Beta: xAI's Multi-Agent Model Finally Ships

xAI has quietly rolled out Grok 4.20 in beta on grok.com, ending a saga of missed deadlines that stretches back to at least early December 2025, when Elon Musk posted that the model was "coming out in 3 or 4 weeks." It wasn't. And then it still wasn't for another two months.

The beta introduces a multi-agent inference setup where up to four agents process a single request simultaneously, with one agent designated as the "leader" coordinating the others. If this sounds familiar, it should. Grok 4 Heavy, launched back in July 2025, already used a multi-agent architecture that Musk described as working "like a study group." The 4.20 version appears to formalize and constrain that approach, capping agents at four rather than the up to 32 that Heavy reportedly deployed.

The long road to a point release

The timeline here is worth unpacking because it tells you something about how xAI actually operates versus how Musk says it operates. On December 6, 2025, Musk quoted a post about Grok 4.1's OpenRouter numbers and dropped the "3 or 4 weeks" line. That would have put the release around late December or early January 2026.

January came and went. In late January, xAI acknowledged a delay, blaming power uptime issues at their Memphis data center caused by cold weather and, somewhat remarkably, construction equipment damaging power lines. The new target was mid-February. Even that slipped.

Meanwhile, Grok 4.20 checkpoints were showing up everywhere under codenames. On Grokipedia's tracking page, the model appeared as "Theta-hat" and "Slateflow" on LMArena, "Pearl" and "Obsidian" in DesignArena, and most notably as the anonymous "Mystery Model" that dominated Alpha Arena's stock-trading competition. That last one got Musk's attention enough to publicly confirm the model's identity in a post.

What the trading stunt actually showed

xAI made a lot of noise about Grok 4.20's performance in Alpha Arena Season 1.5, a simulated stock-trading competition run by Nof1.ai. The numbers: a 12.11% aggregate return over two weeks, turning a simulated $10,000 into roughly $11,211. Four Grok 4.20 instances occupied the top spots on the leaderboard. Models from OpenAI and Google lost money.

Sounds impressive. But Alpha Arena is a niche evaluation run by a small outfit, not an industry-standard benchmark. The trading environment uses real market data, sure, but the strategies and conditions are specific to the competition format. And Grok 4.20's edge came largely from ingesting X's data firehose (something like 68 million English posts per day, according to NextBigFuture's analysis) for real-time sentiment signals. When your model has exclusive access to a social media platform's full data stream and your competitors don't, the comparison is less "which model reasons better" and more "which model has better data."

Nobody outside xAI has run independent general-capability benchmarks on the released version yet. TestingCatalog noted on February 15 that there's "no expectation for it to push SOTA benchmarks up." If that holds, the real story is the multi-agent architecture, not raw intelligence gains.

Four agents, one leader

The multi-agent setup is the genuinely interesting part. Rather than throwing 32 agents at every problem like Grok 4 Heavy (which costs $300 a month for a reason), the 4.20 beta constrains things. Four agents. One leader. The agents communicate with each other during inference, and the leader synthesizes the final response.

This is a design choice worth watching. OpenAI's approach with o3 and its successors has been to pour compute into longer chains of thought within a single model. Google's Gemini Enterprise has a tournament-style evaluation system. xAI is betting on a kind of structured collaboration, which, if it works, could offer a better accuracy-to-cost ratio than simply scaling inference time in a single model.

Whether it actually works at the 4-agent scale is something I can not assess from the available information. The mathematical contribution that UC Irvine professor Paata Ivanisvili documented, where an early Grok 4.20 build helped refine bounds on dyadic square functions, suggests the model can handle serious technical reasoning. But a single anecdote from a math professor isn't a benchmark.

SpaceX in the room

There is a large elephant here. On February 2, SpaceX acquired xAI in what became the largest merger in history, valuing the combined entity at $1.25 trillion. xAI was burning roughly $1 billion per month at the time of the deal, according to Bloomberg. SpaceX generates about $8 billion in annual profit.

Musk's stated rationale was orbital data centers, which, okay. But the financial reality is simpler: xAI needed SpaceX's cash flow and IPO trajectory to keep the lights on. Grok 4.20's training delays from power infrastructure problems at the Memphis facility look different when you know the company was simultaneously negotiating its own absorption into a rocket company.

The timing of this beta, barely two weeks after the SpaceX deal closed, feels like xAI clearing a backlog item. The model was supposed to ship months ago. Now it is here, under new corporate ownership, and nobody at xAI has published a blog post about it yet. The last model announcement on xAI's news page is still Grok 4.1 from November 17, 2025.

What's actually next

The broader context is that Grok 4.20 lands in a crowded week. OpenAI shipped GPT-5.3-Codex on February 5. The Spark variant followed on February 12. Anthropic released Claude Sonnet 5 on February 3. xAI is not setting the pace here; it is catching up on a delayed point release while its competitors are shipping entirely new model generations.

Separately, TestingCatalog found evidence that xAI is building Grok Build, a coding IDE with parallel agent support (up to 8 agents) and an arena mode for ranking agent outputs. Perplexity may already be running Grok 4.20 under the hood for its upcoming Gamma search mode, internally nicknamed "ASI." Neither feature is publicly available.

Grok 5, a 6-trillion-parameter model, is reportedly two to four months out. That would be the real test. A point release with a multi-agent wrapper, while technically interesting, does not change the competitive picture. The question for xAI, now operating as a division of SpaceX, is whether it can ship Grok 5 before the IPO window closes and before the $1 billion monthly burn rate becomes someone else's problem.