Google Launches Gemini 3.1 Flash TTS With 200+ Audio Tags

Abstract waveform visualization representing AI-generated speech with colorful audio frequency patterns

Google released Gemini 3.1 Flash TTS on Tuesday, a text-to-speech model that treats voice generation less like synthesis and more like audio directing. The core pitch: over 200 audio tags that let developers control vocal style, pacing, and delivery by embedding natural language commands directly in the input text. Tags like [whispers], [enthusiasm], and [slow] steer the output mid-sentence.

The model scored an Elo rating of 1,211 on the Artificial Analysis TTS leaderboard, which ranks models through blind human preference comparisons. That puts it second overall, behind Inworld TTS 1.5 Max (Elo 1,215) and ahead of ElevenLabs' Eleven v3. Google calls it their "most controllable" TTS yet, though the Elo gap with Inworld is slim enough to be noise. Worth noting: these scores come from third-party blind tests, not Google's own benchmarks.

The developer guide reveals a surprisingly theatrical prompting approach. Scene direction sets the environment, speaker profiles define character voices, and the whole thing can be exported as API code for consistent deployment. Native multi-speaker dialogue is built in, so developers building podcasts or audiobooks don't need to stitch separate voice calls together. It supports 70+ languages with the same tag controls, and all output carries SynthID watermarking.

Available now via the Gemini API, Google AI Studio, and Vertex AI. Pricing wasn't disclosed in detail, though Artificial Analysis placed it in its "most attractive quadrant" for cost-to-quality ratio.

Bottom Line

Gemini 3.1 Flash TTS ranks second on the independent Artificial Analysis TTS leaderboard with an Elo of 1,211, four points behind the leader.

Quick Facts

Elo score: 1,211 on Artificial Analysis TTS leaderboard (ranked #2)
200+ audio tags for controlling voice style, pacing, delivery
70+ languages supported
Native multi-speaker dialogue built in
Available via Gemini API, Google AI Studio, Vertex AI, Google Vids
All output watermarked with SynthID

Tags:Googletext-to-speechGeminispeech synthesisGoogle AI StudioDeepMindvoice AI

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Google Launches Gemini 3.1 Flash TTS With 200+ Voice Control Tags

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

NVIDIA Releases Audio Flamingo Next, an Open Audio-Language Model

ChatGPT Market Share Slides to 68% as Gemini and Claude Chip Away at OpenAI

Alibaba Releases Qwen3.6-35B-A3B Open Multimodal MoE Model

Stay Ahead of the AI Curve