Best Speech-to-Text and Transcription Models

Model rankings updated July 2026 based on real usage data.

Speech-to-text models convert spoken audio into text for transcription, captions, voice notes, meeting summaries, call analysis, and speech-driven applications. This collection helps you compare the best transcription models on OpenRouter by accuracy, speed, language support, and cost for your audio workflows.

Top Speech-to-Text Models on OpenRouter

OpenAI: GPT-4o Mini Transcribe

186M tokens

GPT-4o Mini Transcribe is OpenAI's smaller, cost-efficient speech-to-text model built on GPT-4o Mini audio capabilities. It's priced per token (input and output), making it suitable for high-volume transcription workflows that benefit from token-level billing transparency at a lower cost point.

by openai128K context$1.25/M input tokens$5/M output tokens

OpenAI: GPT-4o Transcribe

68.1M tokens

GPT-4o Transcribe is OpenAI's high-quality speech-to-text model built on GPT-4o audio capabilities. It's priced per token (input and output), making it suitable for workflows that benefit from token-level billing transparency.

by openai128K context$2.50/M input tokens$10/M output tokens

Mistral: Voxtral Mini Transcribe

9.84M tokens

Voxtral Mini Transcribe is Mistral's speech-to-text model, derived from the Voxtral Mini family. It accepts audio input and returns transcribed text via the standard transcription API. Suited for transcribing meetings, voice notes, podcasts, and other spoken content.

by mistralai$3,000/M input tokens$0/M output tokens

Deepgram: Nova-3

Deepgram Nova-3 general-purpose speech-to-text model with monolingual and multilingual transcription support.

by deepgram$4,300/M input tokens$0/M output tokens

Microsoft: MAI-Transcribe 1.5

MAI-Transcribe 1.5 is a multilingual speech-to-text model from Microsoft AI. It is suited for captions, call transcription, subtitling, accessibility, and other voice-enabled applications, with reliable transcription across 43 languages, diverse accents, and noisy real-world audio. It supports automatic language identification and keyword biasing for domain-specific terminology, and improves long-form transcription speed over MAI-Transcribe-1. Speaker diarization is not supported.

by microsoft$360,000/M input tokens$0/M output tokens

NVIDIA: Parakeet TDT 0.6B v3

Parakeet TDT 0.6B v3 is NVIDIA's 600M-parameter multilingual speech-to-text model built on the FastConformer-TDT architecture. Trained on the Granary dataset (670,000+ hours of audio), it supports automatic language detection across all official EU languages and achieves a 6.34% average word error rate on the HuggingFace Open ASR Leaderboard. Returns transcribed text with punctuation and segment timestamps.

by nvidia$1,500/M input tokens$0/M output tokens

Qwen: Qwen3 ASR Flash

Qwen3-ASR-Flash is Alibaba's automatic speech recognition service, built on the Qwen3-Omni foundation and trained on tens of millions of hours of multimodal speech data. The model handles 11 languages — including Chinese (with Cantonese, Sichuanese, Minnan, and Wu dialects), English, Arabic, French, German, Spanish, Italian, Portuguese, Russian, Japanese, and Korean — with automatic language detection so no manual configuration is needed for mixed-language audio.

The model is designed for difficult acoustic conditions: it transcribes lyrics over background music, handles noisy and far-field recordings, filters silence and non-speech audio, and accepts arbitrary context text (names, jargon, domain terminology) to bias recognition toward specific vocabulary.

by qwen$35/M input tokens$0/M output tokens

Google: Chirp 3

Chirp 3 is Google's latest multilingual speech-to-text model. It offers enhanced transcription accuracy across 24 GA languages and 77+ preview languages, with support for automatic language detection, automatic punctuation, and a built-in denoiser for cleaner audio processing.

by google$16,000/M input tokens$0/M output tokens

OpenAI: Whisper Large V3 Turbo

Whisper Large V3 Turbo is an optimized version of OpenAI's Whisper Large V3 speech recognition model, designed for speed and cost efficiency. It supports transcription across 99+ languages with a 12% word error rate, and accepts common audio formats including mp3, mp4, wav, webm, flac, and ogg. Achieves real-time speed factors up to 216x, making it well-suited for latency-sensitive and high-throughput transcription workloads.

by openai$40,000/M input tokens$0/M output tokens

OpenAI: Whisper Large V3

Whisper Large V3 is OpenAI's open-source automatic speech recognition model offering both audio transcription and translation. It supports 99+ languages and accepts common audio formats including mp3, mp4, wav, webm, flac, and ogg. With 1,550M parameters, it achieves a 10.3% word error rate and is well-suited for noise-robust, multilingual transcription in demanding conditions. Supports timestamp granularities at word and segment levels.

by openai$1,500/M input tokens$0/M output tokens