OpenBMB dropped VoxCPM2 this week, a 2-billion-parameter text-to-speech model that skips the usual token-based approach entirely. The model weights are live on Hugging Face under an Apache 2.0 license. Trained on over 2 million hours of multilingual speech data, VoxCPM2 covers 30 languages and outputs 48kHz audio, which is studio-grade quality.
The headline feature: you can describe a voice in plain text (gender, age, tone, emotion) and VoxCPM2 will generate it from scratch. No reference audio needed. It also does zero-shot voice cloning from a short audio clip, with optional style controls for emotion and pacing. The GitHub repo includes streaming synthesis support and fine-tuning scripts that work with as little as 5 to 10 minutes of audio.
Built on the MiniCPM-4 backbone, VoxCPM2 uses a diffusion autoregressive architecture that works directly in continuous space rather than converting speech to discrete tokens first. OpenBMB claims this preserves acoustic detail that tokenizer-based systems lose. Benchmarks are self-reported across Seed-TTS-eval and other standard tests, though independent validation hasn't surfaced yet.
The earlier VoxCPM 1.0 reported a real-time factor of 0.17 on an RTX 4090. VoxCPM2 fine-tuning supports both full SFT and LoRA. A live demo is available on Hugging Face Spaces.
Bottom Line
VoxCPM2 is a 2B open-source TTS model covering 30 languages with voice design and cloning in a single unified architecture, though benchmark claims remain self-reported.
Quick Facts
- 2 billion parameters
- 30 languages supported
- 48kHz audio output
- Trained on 2M+ hours of speech data
- Apache 2.0 license
- Benchmarks are company-reported, not independently verified




