Google released Gemini 3.1 Flash TTS on Tuesday, a text-to-speech model that treats voice generation less like synthesis and more like audio directing. The core pitch: over 200 audio tags that let developers control vocal style, pacing, and delivery by embedding natural language commands directly in the input text. Tags like [whispers], [enthusiasm], and [slow] steer the output mid-sentence.
The model scored an Elo rating of 1,211 on the Artificial Analysis TTS leaderboard, which ranks models through blind human preference comparisons. That puts it second overall, behind Inworld TTS 1.5 Max (Elo 1,215) and ahead of ElevenLabs' Eleven v3. Google calls it their "most controllable" TTS yet, though the Elo gap with Inworld is slim enough to be noise. Worth noting: these scores come from third-party blind tests, not Google's own benchmarks.
The developer guide reveals a surprisingly theatrical prompting approach. Scene direction sets the environment, speaker profiles define character voices, and the whole thing can be exported as API code for consistent deployment. Native multi-speaker dialogue is built in, so developers building podcasts or audiobooks don't need to stitch separate voice calls together. It supports 70+ languages with the same tag controls, and all output carries SynthID watermarking.
Available now via the Gemini API, Google AI Studio, and Vertex AI. Pricing wasn't disclosed in detail, though Artificial Analysis placed it in its "most attractive quadrant" for cost-to-quality ratio.
Bottom Line
Gemini 3.1 Flash TTS ranks second on the independent Artificial Analysis TTS leaderboard with an Elo of 1,211, four points behind the leader.
Quick Facts
- Elo score: 1,211 on Artificial Analysis TTS leaderboard (ranked #2)
- 200+ audio tags for controlling voice style, pacing, delivery
- 70+ languages supported
- Native multi-speaker dialogue built in
- Available via Gemini API, Google AI Studio, Vertex AI, Google Vids
- All output watermarked with SynthID




