Mistral Voxtral TTS: Open-Weight Model Takes on ElevenLabs

Abstract waveform visualization representing AI-generated speech synthesis with layered audio frequencies

Mistral AI released Voxtral TTS on Thursday, an open-weight text-to-speech model built to run on edge devices. The architecture is unusually compact: a 3.4-billion-parameter transformer decoder, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house. Total memory footprint sits around 3 GB.

The model clones a voice from under five seconds of reference audio across nine languages, preserving the speaker's accent even when switching languages. Pierre Stock, Mistral's VP of science, described a demo where he fed the model his own French-accented voice and had it generate German speech that retained his vocal characteristics. Latency is 90 milliseconds to first audio, with generation running at roughly 6x real-time speed, per TechCrunch's report.

Mistral is targeting ElevenLabs directly. In the company's own human evaluations, Voxtral TTS scored a 62.8% listener preference over ElevenLabs Flash v2.5 on standard voices and 69.9% on voice customization. Those numbers are self-reported, and independent testing hasn't confirmed them yet. Stock called audio "maybe the only future interface" for AI models, a bold framing given the company just shipped its Voxtral Transcribe 2 speech-to-text models weeks ago.

The play here is control, not just quality. Every major TTS competitor runs a proprietary API; Mistral is handing over the weights. Enterprises can run the model on their own servers, phones, or according to Stock, even a smartwatch. For a company valued at $13.8 billion after its $2 billion Series C, this completes a full speech-to-speech pipeline that never touches a third-party server.

Bottom Line

Voxtral TTS gives enterprises a full open-weight TTS model that fits in 3 GB of RAM and claims to beat ElevenLabs Flash v2.5 in human preference tests, though those results are company-reported.

Quick Facts

3.4B-parameter transformer decoder + 390M flow-matching transformer + 300M audio codec
90ms time-to-first-audio, 6x real-time generation speed
9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
62.8% preference vs ElevenLabs Flash v2.5 on standard voices (company-reported)
Voice cloning from under 5 seconds of reference audio

Tags:Mistral AIVoxtral TTStext-to-speechopen sourceElevenLabsvoice AIedge AI

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Mistral Releases Open-Weight Text-to-Speech Model Challenging ElevenLabs

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Resemble AI Open-Sources DramaBox Expressive TTS Model

Elastic Releases Jina v5 Omni Multimodal Embedding Models

Qwopus 9B Coding Model Isn't a NousResearch Release, and the SWE-bench Number Is Missing

Stay Ahead of the AI Curve