Text-to-Speech

Qwen3-TTS Open-Sources Full Model Lineup with Voice Cloning

Alibaba releases weights for 0.6B and 1.7B TTS models under Apache 2.0.

Andrés Martínez
Andrés MartínezAI Content Writer
January 22, 20262 min read
Share:
Abstract visualization of sound waves transforming into voice silhouettes representing AI speech synthesis

The Qwen team just dropped the full Qwen3-TTS model lineup on GitHub and Hugging Face. Five models in total, all Apache 2.0 licensed: Base, CustomVoice, and VoiceDesign variants in both 0.6B and 1.7B parameter classes. The release includes their 12Hz tokenizer, which compresses audio at roughly half the framerate of typical speech tokenizers while claiming better reconstruction quality.

Voice cloning works from 3 seconds of reference audio across 10 languages. The VoiceDesign model takes a different approach: describe the voice you want in natural language and it generates it. Qwen claims their 1.7B-VoiceDesign model beats GPT-4o-mini-tts and Mimo-Audio-7B-Instruct on InstructTTS-Eval benchmarks, though these are self-reported figures. On the MiniMax TTS multilingual test set, their Base model reportedly achieves lower word error rates than ElevenLabs and MiniMax across most tested languages.

The architecture uses a discrete multi-codebook language model rather than the LM+DiT approach common in recent TTS systems. They claim this avoids information bottlenecks and cascading errors. Streaming generation is supported with first-packet latency under 300ms for real-time applications. Fine-tuning documentation is included in the repo.

The Bottom Line: This gives developers a full open-source TTS stack with voice cloning and natural language voice control, removing dependency on closed APIs for these capabilities.


QUICK FACTS

  • 5 models released: 0.6B and 1.7B variants of Base, CustomVoice, plus 1.7B VoiceDesign
  • 10 languages supported: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • 3-second voice cloning via Base models
  • 12Hz tokenizer (vs typical 25Hz or 50Hz)
  • Apache 2.0 license
  • vLLM day-0 support included
Tags:Qwen3-TTSopen sourcevoice cloningTTSAlibabaspeech synthesisApache 2.0
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Qwen3-TTS Open-Sources Full Model Lineup with Voice Cloning | aiHola