Xiaomi MiMo UltraSpeed Hits 1,000 Tokens Per Second

Rack of GPUs in a data center processing data streams at high speed

Xiaomi's MiMo team just shipped UltraSpeed, a serving mode for its trillion-parameter flagship that decodes past 1,000 tokens per second. The notable part isn't the speed. It's the hardware. The whole thing runs on a single 8-GPU commodity node, no custom silicon, which the team built with inference partner TileRT.

That matters because companies like Cerebras and Groq spent years and hundreds of millions chasing this exact threshold with wafer-scale chips. Xiaomi got there with software you can rent on a cloud provider today.

The speed comes from two tricks stacked together: FP4 quantization applied only to the model's expert layers, and DFlash speculative decoding, which proposes a full block of tokens in one pass instead of one at a time. Xiaomi calls this a first at the 1T scale, peaking near 1,200 tokens/sec in demos. Worth flagging that those are the company's own numbers. No independent third-party verification is public yet.

For anyone wanting to poke at it, the FP4 weights are open-sourced as a checkpoint on Hugging Face, with select TileRT modules on GitHub. The catch on the API: it costs 3x the standard MiMo-V2.5-Pro rate for roughly 10x the speed.

Access is gated and short. A limited application-based trial runs June 9 through June 23, with enterprise users and pro developers getting priority. Quality still trails frontier models from Anthropic and OpenAI, so this is a speed play, not a reasoning upgrade.

Bottom Line

UltraSpeed runs a 1-trillion-parameter model at over 1,000 tokens/sec on a standard 8-GPU node, no custom chips required.

Quick Facts

Model: MiMo-V2.5-Pro-UltraSpeed, 1.02 trillion parameters (42B active)
Speed: over 1,000 tokens/sec, peaking near 1,200 (company-reported)
Hardware: single standard 8-GPU node, no custom silicon
Pricing: 3x standard MiMo-V2.5-Pro rate for ~10x speed
Trial window: June 9 to June 23, application-based

Tags:XiaomiMiMoAI inferenceopen sourceFP4 quantizationTileRTLLM

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Xiaomi MiMo Hits 1,000 Tokens Per Second on Standard GPUs

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Meituan Open-Sources LongCat-2.0, a 1.6T Coding Model

Moonshot Ships Kimi K3, Its Largest Model Yet

Mistral Releases Leanstral 1.5, an Apache-2.0 Lean 4 Proof Model

Stay Ahead of the AI Curve