Search/
Skip to content
/
OpenRouter
© 2026 OpenRouter, Inc

Product

  • Chat
  • Rankings
  • Apps
  • Models
  • Providers
  • Pricing
  • Enterprise
  • Labs

Company

  • About
  • Announcements
  • CareersHiring
  • Privacy
  • Terms of Service
  • Support
  • State of AI
  • Works With OR
  • Data

Developer

  • Documentation
  • API Reference
  • SDK
  • Status

Connect

  • Discord
  • GitHub
  • LinkedIn
  • X
  • YouTube
Collections/Text-to-Speech

Best Text-to-Speech Models

Model rankings updated April 2026 based on real usage data.

Text-to-speech models turn written text into spoken audio for assistants, narration, accessibility tools, voiceovers, learning apps, and customer support workflows. Browse top TTS models on OpenRouter and compare voices, latency, pricing, and provider capabilities to find the best speech model for your product.

Top Text-to-Speech Models on OpenRouter

Favicon for google

Google: Gemini 3.1 Flash TTS Preview

22.4M tokens

Gemini 3.1 Flash TTS Preview is a text-to-speech model from Google, and a substantial generational step up from Gemini 2.5 Flash TTS. It takes text input and produces audio output across 70+ languages — nearly 3× the language coverage of its predecessor.

The headline addition is a system of 200+ inline audio tags (e.g. [whispers], [laughs], [excited]) that let developers steer delivery, emotion, and pacing mid-sentence, alongside a "director's chair" workflow in Google AI Studio for defining per-character Audio Profiles and scene-level context. It supports up to two speakers with independent voice and style configuration per speaker, outputs PCM audio at 24 kHz / 16-bit mono, and automatically watermarks all output with SynthID. Context window is 32k tokens.

by google8K context$1/M input tokens$20/M output tokens
Favicon for openai

OpenAI: GPT-4o Mini TTS

4.01M tokens

GPT-4o Mini TTS is OpenAI's cost-efficient text-to-speech model. It converts text input into natural-sounding audio output, supporting a variety of voices and tones.

by openai4K context$0.60/M input tokens$0/M output tokens
Favicon for hexgrad

hexgrad: Kokoro 82M

1.37M tokens

Kokoro 82M is a lightweight, open-weight text-to-speech model from hexgrad. It converts text to speech across 8 languages (American and British English, Spanish, French, Hindi, Italian, Japanese, Portuguese, and Chinese) using 54 preset voices organized by language and gender. At 82M parameters, it is well-suited for multilingual TTS deployments where footprint and cost efficiency matter.

by hexgrad4K context$0.62/M input tokens$0/M output tokens
Favicon for mistralai

Mistral: Voxtral Mini TTS

109K tokens

Voxtral Mini TTS is Mistral's text-to-speech model featuring zero-shot voice cloning and multilingual support. It converts text input into natural-sounding audio output.

by mistralai4K context$16/M input tokens$0/M output tokens
Favicon for canopylabs

Canopy Labs: Orpheus 3B

42K tokens

Orpheus 3B is an English text-to-speech model from Canopy Labs, fine-tuned for natural prosody and expressive delivery. It offers 7 preset voices and is suited for narration, voice assistants, and interactive applications where naturalistic speech is a priority.

by canopylabs4K context$7/M input tokens$0/M output tokens
Favicon for sesame

Sesame: CSM 1B

22K tokens

CSM 1B is a conversational speech model from Sesame. It accepts text input and produces English speech output, with voice options spanning conversational and read-speech styles. At 1B parameters, it is suited for dialogue-oriented applications such as voice assistants and interactive agents.

by sesame4K context$7/M input tokens$0/M output tokens
Favicon for zyphra

Zyphra: Zonos v0.1 Transformer

3K tokens

Zonos v0.1 Transformer is a text-to-speech model from Zyphra built on a pure transformer architecture. It offers the same American and British English voice coverage as the Hybrid variant, and is suited for deployments where a transformer-only inference stack is preferred.

by zyphra4K context$7/M input tokens$0/M output tokens
Favicon for zyphra

Zyphra: Zonos v0.1 Hybrid

2K tokens

Zonos v0.1 Hybrid is a text-to-speech model from Zyphra built on a hybrid architecture. It produces English speech output with coverage across American and British accents in male and female voices. It is suited for English-language voice applications requiring accent and gender variety.

by zyphra4K context$7/M input tokens$0/M output tokens