Google: Gemini 3.1 Flash TTS Preview

google/gemini-3.1-flash-tts-preview

Gemini 3.1 Flash TTS Preview is a text-to-speech model from Google, and a substantial generational step up from Gemini 2.5 Flash TTS. It takes text input and produces audio output across 70+ languages — nearly 3× the language coverage of its predecessor.

The headline addition is a system of 200+ inline audio tags (e.g. [whispers], [laughs], [excited]) that let developers steer delivery, emotion, and pacing mid-sentence, alongside a "director's chair" workflow in Google AI Studio for defining per-character Audio Profiles and scene-level context. It supports up to two speakers with independent voice and style configuration per speaker, outputs PCM audio at 24 kHz / 16-bit mono, and automatically watermarks all output with SynthID. Context window is 32k tokens.

Modalities

Input Price

$1/M

Output Price

$20/M

Context

8K

Weekly Tokens

84.1M

Released

Apr 24, 2026

Google: Gemini 3.1 Flash TTS Preview