Model rankings updated April 2026 based on real usage data.
Speech-to-text models convert spoken audio into text for transcription, captions, voice notes, meeting summaries, call analysis, and speech-driven applications. This collection helps you compare the best transcription models on OpenRouter by accuracy, speed, language support, and cost for your audio workflows.
GPT-4o Transcribe is OpenAI's high-quality speech-to-text model built on GPT-4o audio capabilities. It's priced per token (input and output), making it suitable for workflows that benefit from token-level billing transparency.
Whisper is OpenAI's open-source automatic speech recognition model, available via API as whisper-1. It supports transcription and translation across 50+ languages from audio files up to 25 MB. Accepts formats including mp3, mp4, wav, and webm. Priced per minute of audio duration, billed to the nearest second.