xAI Colossus Runs at 11% Training Efficiency, Memo Shows

Long row of server racks inside a dim AI data center, with only some equipment lit

xAI's Colossus supercomputer is running at about 11% training efficiency, according to an internal memo from company president Michael Nicolls first reported by The Information. Nicolls reportedly called the number "embarrassingly low" and conceded the company is "clearly behind" its rivals.

What 11% actually means

The metric is MFU, Model FLOPs Utilization. It tracks how much of a cluster's theoretical compute capacity translates into useful training work, separate from raw uptime. Industry benchmarks from Lambda put well-tuned frontier runs in the 35% to 45% range. Nicolls set an internal target of 50%, which sounds aspirational by any standard.

The practical translation: xAI is burning roughly three to four times the wall-clock time, or three to four times the chips, to train a model the same size as a competitor's. Whatever the headline GPU count on the Colossus cluster, most of that theoretical compute is not turning into Grok.

About that GPU count

Worth flagging. A Russian-language summary of the leaked memo pegs xAI's fleet at 500,000 GPUs. Verifiable reporting still puts the operational figure around 200,000 Nvidia chips at the Memphis facility, with public ambitions for a million and a recently purchased third building toward that goal. Inflating the denominator makes 11% look worse than it is. The actual ratio is bad enough on its own.

The same post blamed the low number on HBM bandwidth bottlenecks and network synchronization across tens of thousands of GPUs, then alleged a wider industry practice of rerunning finished workloads to inflate utilization metrics. Neither claim turns up in the actual reporting. Treat accordingly.

The Cursor footnote isn't really a footnote

The leaked memo surfaced the same stretch SpaceX, which absorbed xAI earlier this year, announced a rental arrangement with Cursor. The coding startup will use tens of thousands of Colossus GPUs to train its next Composer 2.5 model. Bundled into the deal: a $60 billion option for SpaceX to acquire Cursor outright, with a $10 billion breakup fee if it walks.

Cursor needs compute it doesn't have. xAI has compute it can't keep busy. The blunt translation: if Grok can't make full use of its own iron, somebody else's model will. Leadership churn around the cluster fits the same read. Heinrich Küttler is out as infrastructure lead. Jake Palmer now runs physical infrastructure. Daniel Dueri, pulled from SpaceX, runs compute.

What to watch

Musk posted on X recently that "xAI was not built right first time around" and is being "rebuilt from the foundations up." That's a rare public admission, even by Musk standards. The next concrete signals: whether Cursor's Composer 2.5 ships on Colossus, whether xAI follows with a Grok model worth the iron, and whether MFU climbs anywhere near Nicolls's 50% target ahead of the SpaceX IPO. Musk's trial against OpenAI's Sam Altman is set to open in Oakland in the coming days.

Tags:xAIColossusElon MuskGPU utilizationMFUNvidiaCursorAI trainingSpaceX

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

xAI's Colossus Runs at 11% Training Efficiency, Memo Shows

What 11% actually means

About that GPU count

The Cursor footnote isn't really a footnote

What to watch

Liza Chan

Related Articles

GitHub Copilot Switches to Usage-Based Billing on June 1

Pentagon Signs Eight AI Companies for Classified Networks, Excludes Anthropic

Google Photos Will Catalog Your Clothes With AI This Summer

Stay Ahead of the AI Curve