Best Audio Generation Models

Model rankings updated July 2026 based on real usage data.

Audio generation models create audio output from text or other prompts, powering use cases like music generation, sound design, voice-enabled assistants, and multimodal applications that respond with audio. This collection highlights some of the best audio generation models available on OpenRouter, making it easier to compare quality, pricing, and latency across providers through a single API.

Top Audio Generation Models on OpenRouter

OpenAI: GPT Audio Mini

211M tokens

A cost-efficient version of GPT Audio. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Input is priced at $0.60 per million tokens and output is priced at $2.40 per million tokens.

by openai128K context$0.60/M input tokens$2.40/M output tokens$0.60/M audio tokens

OpenAI: GPT Audio

50M tokens

The gpt-audio model is OpenAI's first generally available audio model. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Audio is priced at $32 per million input tokens and $64 per million output tokens.

by openai128K context$2.50/M input tokens$10/M output tokens$32/M audio tokens

Google: Lyria 3 Pro Preview

26.9M tokens

Full-length songs are priced at $0.08 per song. Lyria 3 is Google's family of music generation models, available through the Gemini API. With Lyria 3, you can generate high-quality, 48kHz stereo audio from text prompts or from images. These models deliver structural coherence, including vocals, timed lyrics, and full instrumental arrangements. Lyria 3 Pro can generate full-length songs with verses, choruses, bridges.

by google1.05M context$0/M input tokens$0/M output tokens

Google: Lyria 3 Clip Preview

8.13M tokens

30 second duration clips are priced at $0.04 per clip. Lyria 3 is Google's family of music generation models, available through the Gemini API. With Lyria 3, you can generate high-quality, 48kHz stereo audio from text prompts or from images. These models deliver structural coherence, including vocals, timed lyrics, and full instrumental arrangements. Lyria 3 Clip can generate short clips, loops, previews.

by google1.05M context$0/M input tokens$0/M output tokens