The Qwen team just dropped the full Qwen3-TTS model lineup on GitHub and Hugging Face. Five models in total, all Apache 2.0 licensed: Base, CustomVoice, and VoiceDesign variants in both 0.6B and 1.7B parameter classes. The release includes their 12Hz tokenizer, which compresses audio at roughly half the framerate of typical speech tokenizers while claiming better reconstruction quality.
Voice cloning works from 3 seconds of reference audio across 10 languages. The VoiceDesign model takes a different approach: describe the voice you want in natural language and it generates it. Qwen claims their 1.7B-VoiceDesign model beats GPT-4o-mini-tts and Mimo-Audio-7B-Instruct on InstructTTS-Eval benchmarks, though these are self-reported figures. On the MiniMax TTS multilingual test set, their Base model reportedly achieves lower word error rates than ElevenLabs and MiniMax across most tested languages.
The architecture uses a discrete multi-codebook language model rather than the LM+DiT approach common in recent TTS systems. They claim this avoids information bottlenecks and cascading errors. Streaming generation is supported with first-packet latency under 300ms for real-time applications. Fine-tuning documentation is included in the repo.
The Bottom Line: This gives developers a full open-source TTS stack with voice cloning and natural language voice control, removing dependency on closed APIs for these capabilities.
QUICK FACTS
- 5 models released: 0.6B and 1.7B variants of Base, CustomVoice, plus 1.7B VoiceDesign
- 10 languages supported: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- 3-second voice cloning via Base models
- 12Hz tokenizer (vs typical 25Hz or 50Hz)
- Apache 2.0 license
- vLLM day-0 support included




