Best Text-to-Speech Models

Model rankings updated July 2026 based on real usage data.

Text-to-speech models turn written text into spoken audio for assistants, narration, accessibility tools, voiceovers, learning apps, and customer support workflows. Browse top TTS models on OpenRouter and compare voices, latency, pricing, and provider capabilities to find the best speech model for your product.

Top Text-to-Speech Models on OpenRouter

Google: Gemini 3.1 Flash TTS Preview

181M tokens

Gemini 3.1 Flash TTS Preview is a text-to-speech model from Google, and a substantial generational step up from Gemini 2.5 Flash TTS. It takes text input and produces audio output across 70+ languages — nearly 3× the language coverage of its predecessor.

The headline addition is a system of 200+ inline audio tags (e.g. [whispers], [laughs], [excited]) that let developers steer delivery, emotion, and pacing mid-sentence, alongside a "director's chair" workflow in Google AI Studio for defining per-character Audio Profiles and scene-level context. It supports up to two speakers with independent voice and style configuration per speaker, outputs PCM audio at 24 kHz / 16-bit mono, and automatically watermarks all output with SynthID. Context window is 32k tokens.

by google33K context$1/M input tokens$20/M output tokens

hexgrad: Kokoro 82M

65.1M tokens

Kokoro 82M is a lightweight, open-weight text-to-speech model from hexgrad. It converts text to speech across 8 languages (American and British English, Spanish, French, Hindi, Italian, Japanese, Portuguese, and Chinese) using 54 preset voices organized by language and gender. At 82M parameters, it is well-suited for multilingual TTS deployments where footprint and cost efficiency matter.

by hexgrad4K context$0.62/M input tokens$0/M output tokens

xAI: Grok Voice TTS 1.0

5.43M tokens

Grok Voice TTS 1.0 is a text-to-speech model from xAI. It converts text into spoken audio across 20+ languages with automatic language detection, and offers five built-in voices (Eve, Ara, Rex, Sal, Leo) covering a range of tones. Inline speech tags allow control over pauses, emphasis, pitch, speed, and vocal style. Output is available in MP3, WAV, PCM, μ-law, and A-law formats at sample rates from 8 kHz to 48 kHz, with up to 15,000 characters per request.

by x-ai15K context$15/M input tokens$0/M output tokens

Microsoft: MAI-Voice-2

2.99M tokens

MAI-Voice-2 is an expressive text-to-speech model from Microsoft. It is suited for conversational assistants, media narration, accessibility, education, and other long-form voice applications. It supports 15 languages across 18 locales, fine-grained control of tone and delivery, multi-speaker generation, and voice prompting from short audio clips without fine-tuning. The model prioritizes naturalness and expressivity over latency-critical generation.

by microsoft$22/M input tokens$0/M output tokens

Deepgram: Aura-2

343K tokens

Aura-2 is a multilingual text-to-speech model from Deepgram. It supports Deepgram’s canonical Aura-2 voice catalog for speech synthesis across multiple languages.

by deepgram$30/M input tokens$0/M output tokens

Mistral: Voxtral Mini TTS

234K tokens

Voxtral Mini TTS is Mistral's text-to-speech model featuring zero-shot voice cloning and multilingual support. It converts text input into natural-sounding audio output.

by mistralai4K context$16/M input tokens$0/M output tokens

MiniMax: Speech 2.8 HD

58K tokens

MiniMax Speech 2.8 HD is a text-to-speech model from MiniMax. It is suited for applications that generate spoken audio from text and accepts arbitrary MiniMax voice IDs.

by minimax$100/M input tokens$0/M output tokens

Canopy Labs: Orpheus 3B

49K tokens

Orpheus 3B is an English text-to-speech model from Canopy Labs, fine-tuned for natural prosody and expressive delivery. It offers 7 preset voices and is suited for narration, voice assistants, and interactive applications where naturalistic speech is a priority.

by canopylabs4K context$7/M input tokens$0/M output tokens

Sesame: CSM 1B

37K tokens

CSM 1B is a conversational speech model from Sesame. It accepts text input and produces English speech output, with voice options spanning conversational and read-speech styles. At 1B parameters, it is suited for dialogue-oriented applications such as voice assistants and interactive agents.

by sesame4K context$7/M input tokens$0/M output tokens

MiniMax: Speech 2.8 Turbo

34K tokens

MiniMax Speech 2.8 Turbo is a text-to-speech model from MiniMax. It is suited for applications that generate spoken audio from text and accepts arbitrary MiniMax voice IDs.

by minimax$60/M input tokens$0/M output tokens