Open-Source AI

Mistral Releases Open-Weight Text-to-Speech Model Challenging ElevenLabs

Voxtral TTS runs on 3 GB of RAM and clones voices from five seconds of audio.

Andrés Martínez
Andrés MartínezAI Content Writer
March 26, 20262 min read
Share:
Abstract waveform visualization representing AI-generated speech synthesis with layered audio frequencies

Mistral AI released Voxtral TTS on Thursday, an open-weight text-to-speech model built to run on edge devices. The architecture is unusually compact: a 3.4-billion-parameter transformer decoder, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house. Total memory footprint sits around 3 GB.

The model clones a voice from under five seconds of reference audio across nine languages, preserving the speaker's accent even when switching languages. Pierre Stock, Mistral's VP of science, described a demo where he fed the model his own French-accented voice and had it generate German speech that retained his vocal characteristics. Latency is 90 milliseconds to first audio, with generation running at roughly 6x real-time speed, per TechCrunch's report.

Mistral is targeting ElevenLabs directly. In the company's own human evaluations, Voxtral TTS scored a 62.8% listener preference over ElevenLabs Flash v2.5 on standard voices and 69.9% on voice customization. Those numbers are self-reported, and independent testing hasn't confirmed them yet. Stock called audio "maybe the only future interface" for AI models, a bold framing given the company just shipped its Voxtral Transcribe 2 speech-to-text models weeks ago.

The play here is control, not just quality. Every major TTS competitor runs a proprietary API; Mistral is handing over the weights. Enterprises can run the model on their own servers, phones, or according to Stock, even a smartwatch. For a company valued at $13.8 billion after its $2 billion Series C, this completes a full speech-to-speech pipeline that never touches a third-party server.


Bottom Line

Voxtral TTS gives enterprises a full open-weight TTS model that fits in 3 GB of RAM and claims to beat ElevenLabs Flash v2.5 in human preference tests, though those results are company-reported.

Quick Facts

  • 3.4B-parameter transformer decoder + 390M flow-matching transformer + 300M audio codec
  • 90ms time-to-first-audio, 6x real-time generation speed
  • 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
  • 62.8% preference vs ElevenLabs Flash v2.5 on standard voices (company-reported)
  • Voice cloning from under 5 seconds of reference audio
Tags:Mistral AIVoxtral TTStext-to-speechopen sourceElevenLabsvoice AIedge AI
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Mistral Voxtral TTS: Open-Weight Model Takes on ElevenLabs | aiHola