Xiaomi's MiMo team just shipped UltraSpeed, a serving mode for its trillion-parameter flagship that decodes past 1,000 tokens per second. The notable part isn't the speed. It's the hardware. The whole thing runs on a single 8-GPU commodity node, no custom silicon, which the team built with inference partner TileRT.
That matters because companies like Cerebras and Groq spent years and hundreds of millions chasing this exact threshold with wafer-scale chips. Xiaomi got there with software you can rent on a cloud provider today.
The speed comes from two tricks stacked together: FP4 quantization applied only to the model's expert layers, and DFlash speculative decoding, which proposes a full block of tokens in one pass instead of one at a time. Xiaomi calls this a first at the 1T scale, peaking near 1,200 tokens/sec in demos. Worth flagging that those are the company's own numbers. No independent third-party verification is public yet.
For anyone wanting to poke at it, the FP4 weights are open-sourced as a checkpoint on Hugging Face, with select TileRT modules on GitHub. The catch on the API: it costs 3x the standard MiMo-V2.5-Pro rate for roughly 10x the speed.
Access is gated and short. A limited application-based trial runs June 9 through June 23, with enterprise users and pro developers getting priority. Quality still trails frontier models from Anthropic and OpenAI, so this is a speed play, not a reasoning upgrade.
Bottom Line
UltraSpeed runs a 1-trillion-parameter model at over 1,000 tokens/sec on a standard 8-GPU node, no custom chips required.
Quick Facts
- Model: MiMo-V2.5-Pro-UltraSpeed, 1.02 trillion parameters (42B active)
- Speed: over 1,000 tokens/sec, peaking near 1,200 (company-reported)
- Hardware: single standard 8-GPU node, no custom silicon
- Pricing: 3x standard MiMo-V2.5-Pro rate for ~10x speed
- Trial window: June 9 to June 23, application-based




