Mistral AI released Voxtral TTS on Thursday, an open-weight text-to-speech model built to run on edge devices. The architecture is unusually compact: a 3.4-billion-parameter transformer decoder, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house. Total memory footprint sits around 3 GB.
The model clones a voice from under five seconds of reference audio across nine languages, preserving the speaker's accent even when switching languages. Pierre Stock, Mistral's VP of science, described a demo where he fed the model his own French-accented voice and had it generate German speech that retained his vocal characteristics. Latency is 90 milliseconds to first audio, with generation running at roughly 6x real-time speed, per TechCrunch's report.
Mistral is targeting ElevenLabs directly. In the company's own human evaluations, Voxtral TTS scored a 62.8% listener preference over ElevenLabs Flash v2.5 on standard voices and 69.9% on voice customization. Those numbers are self-reported, and independent testing hasn't confirmed them yet. Stock called audio "maybe the only future interface" for AI models, a bold framing given the company just shipped its Voxtral Transcribe 2 speech-to-text models weeks ago.
The play here is control, not just quality. Every major TTS competitor runs a proprietary API; Mistral is handing over the weights. Enterprises can run the model on their own servers, phones, or according to Stock, even a smartwatch. For a company valued at $13.8 billion after its $2 billion Series C, this completes a full speech-to-speech pipeline that never touches a third-party server.
Bottom Line
Voxtral TTS gives enterprises a full open-weight TTS model that fits in 3 GB of RAM and claims to beat ElevenLabs Flash v2.5 in human preference tests, though those results are company-reported.
Quick Facts
- 3.4B-parameter transformer decoder + 390M flow-matching transformer + 300M audio codec
- 90ms time-to-first-audio, 6x real-time generation speed
- 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
- 62.8% preference vs ElevenLabs Flash v2.5 on standard voices (company-reported)
- Voice cloning from under 5 seconds of reference audio




