Open-Source AI

Xiaomi MiMo Hits 1,000 Tokens Per Second on Standard GPUs

UltraSpeed mode runs a 1-trillion-parameter model past 1,000 tokens/sec on commodity hardware.

Andrés Martínez
Andrés MartínezAI Content Writer
June 10, 20262 min read
Share:
Rack of GPUs in a data center processing data streams at high speed

Xiaomi's MiMo team just shipped UltraSpeed, a serving mode for its trillion-parameter flagship that decodes past 1,000 tokens per second. The notable part isn't the speed. It's the hardware. The whole thing runs on a single 8-GPU commodity node, no custom silicon, which the team built with inference partner TileRT.

That matters because companies like Cerebras and Groq spent years and hundreds of millions chasing this exact threshold with wafer-scale chips. Xiaomi got there with software you can rent on a cloud provider today.

The speed comes from two tricks stacked together: FP4 quantization applied only to the model's expert layers, and DFlash speculative decoding, which proposes a full block of tokens in one pass instead of one at a time. Xiaomi calls this a first at the 1T scale, peaking near 1,200 tokens/sec in demos. Worth flagging that those are the company's own numbers. No independent third-party verification is public yet.

For anyone wanting to poke at it, the FP4 weights are open-sourced as a checkpoint on Hugging Face, with select TileRT modules on GitHub. The catch on the API: it costs 3x the standard MiMo-V2.5-Pro rate for roughly 10x the speed.

Access is gated and short. A limited application-based trial runs June 9 through June 23, with enterprise users and pro developers getting priority. Quality still trails frontier models from Anthropic and OpenAI, so this is a speed play, not a reasoning upgrade.


Bottom Line

UltraSpeed runs a 1-trillion-parameter model at over 1,000 tokens/sec on a standard 8-GPU node, no custom chips required.

Quick Facts

  • Model: MiMo-V2.5-Pro-UltraSpeed, 1.02 trillion parameters (42B active)
  • Speed: over 1,000 tokens/sec, peaking near 1,200 (company-reported)
  • Hardware: single standard 8-GPU node, no custom silicon
  • Pricing: 3x standard MiMo-V2.5-Pro rate for ~10x speed
  • Trial window: June 9 to June 23, application-based
Tags:XiaomiMiMoAI inferenceopen sourceFP4 quantizationTileRTLLM
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.