> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://openrouter.ai/docs/guides/overview/multimodal/llms.txt. > For full documentation content, see https://openrouter.ai/docs/guides/overview/multimodal/llms-full.txt. # Multimodal # Multimodal Capabilities OpenRouter supports multiple input and output modalities beyond text, allowing you to send images, PDFs, audio, and video files to compatible models, or generate speech from text through our unified API. This enables rich multimodal interactions for a wide variety of use cases. ## Supported Modalities ### Images Send images to vision-capable models for analysis, description, OCR, and more. OpenRouter supports multiple image formats and both URL-based and base64-encoded images. [Learn more about image inputs →](/docs/features/multimodal/images) ### Image Generation Generate images from text prompts using AI models with image output capabilities. OpenRouter supports various image generation models that can create high-quality images based on your descriptions. [Learn more about image generation →](/docs/features/multimodal/image-generation) ### PDFs Process PDF documents with any model on OpenRouter. Our intelligent PDF parsing system extracts text and handles both text-based and scanned documents. [Learn more about PDF processing →](/docs/features/multimodal/pdfs) ### Audio Send audio files to speech-capable models for transcription, analysis, and processing, or receive audio responses from models with audio output capabilities. OpenRouter supports common audio formats for both input and output. [Learn more about audio →](/docs/features/multimodal/audio) ### Video Send video files to video-capable models for analysis, description, object detection, and action recognition. OpenRouter supports multiple video formats for comprehensive video understanding tasks. [Learn more about video inputs →](/docs/features/multimodal/videos) ### Video Generation Generate videos from text prompts using AI models with video output capabilities. OpenRouter supports an asynchronous video generation API with configurable resolution, aspect ratio, duration, and optional reference images. [Learn more about video generation →](/docs/features/multimodal/video-generation) ### Text-to-Speech Generate speech audio from text using a dedicated OpenAI-compatible endpoint. OpenRouter supports multiple TTS providers and voices with output in MP3 or PCM format. [Learn more about text-to-speech →](/docs/features/multimodal/tts) ### Speech-to-Text Transcribe audio into text using a dedicated endpoint. OpenRouter supports multiple STT providers and models, returning structured JSON with transcribed text and usage statistics. [Learn more about speech-to-text →](/docs/features/multimodal/stt) ## Getting Started Most multimodal inputs use the same `/api/v1/chat/completions` endpoint with the `messages` parameter. Different content types are specified in the message content array: * **Images**: Use `image_url` content type * **PDFs**: Use `file` content type with PDF data * **Audio**: Use `input_audio` content type * **Video**: Use `video_url` content type You can combine multiple modalities in a single request, and the number of files you can send varies by provider and model. **Text-to-Speech** uses a separate dedicated endpoint at `/api/v1/audio/speech`. See the [TTS documentation](/docs/features/multimodal/tts) for details. **Speech-to-Text** uses a separate dedicated endpoint at `/api/v1/audio/transcriptions`. See the [STT documentation](/docs/features/multimodal/stt) for details. ## Model Compatibility Not all models support every modality. OpenRouter automatically filters available models based on your request content: * **Vision models**: Required for image processing * **File-compatible models**: Can process PDFs natively or through our parsing system * **Audio-capable models**: Required for audio input processing * **Video-capable models**: Required for video input processing Use our [Models page](https://openrouter.ai/models) to find models that support your desired input modalities. ## Input Format Support OpenRouter supports both **direct URLs** and **base64-encoded data** for multimodal inputs: ### URLs (Recommended for public content) * **Images**: `https://example.com/image.jpg` * **PDFs**: `https://example.com/document.pdf` * **Audio**: Not supported via URL (base64 only) * **Video**: Provider-specific (e.g., YouTube links for Gemini on AI Studio) ### Base64 Encoding (Required for local files) * **Images**: `data:image/jpeg;base64,{base64_data}` * **PDFs**: `data:application/pdf;base64,{base64_data}` * **Audio**: Raw base64 string with format specification * **Video**: `data:video/mp4;base64,{base64_data}` URLs are more efficient for large files as they don't require local encoding and reduce request payload size. Base64 encoding is required for local files or when the content is not publicly accessible. **Note for video URLs**: Video URL support varies by provider. For example, Google Gemini on AI Studio only supports YouTube links. See the [video inputs documentation](/docs/features/multimodal/videos) for provider-specific details. ## Frequently Asked Questions Yes! You can send text, images, PDFs, audio, and video in the same request. The model will process all inputs together. * **Images**: Typically priced per image or as input tokens * **PDFs**: Free text extraction, paid OCR processing, or native model pricing * **Audio input**: Priced as input tokens based on duration * **Audio output**: Priced as completion tokens * **Video**: Priced as input tokens based on duration and resolution Video support varies by model. Use the [Models page](/models?fmt=cards\&input_modalities=video) to filter for video-capable models. Check each model's documentation for specific video format and duration limits. Video generation uses an asynchronous API at `/api/v1/videos`. You submit a prompt, receive a job ID, then poll until the video is ready to download. See the [video generation documentation](/docs/features/multimodal/video-generation) for details. Text-to-speech uses a dedicated endpoint at `/api/v1/audio/speech`. Send text and receive a raw audio byte stream. The endpoint is compatible with the OpenAI Audio Speech API, so you can use OpenAI client libraries. See the [TTS documentation](/docs/features/multimodal/tts) for details. Speech-to-text uses a dedicated endpoint at `/api/v1/audio/transcriptions`. Send base64-encoded audio and receive a JSON response with the transcribed text and usage statistics. See the [STT documentation](/docs/features/multimodal/stt) for details. # Image Inputs Requests with images, to multimodel models, are available via the `/api/v1/chat/completions` API with a multi-part `messages` parameter. The `image_url` can either be a URL or a base64-encoded image. Note that multiple images can be sent in separate content array entries. The number of images you can send in a single request varies per provider and per model. Due to how the content is parsed, we recommend sending the text prompt first, then the images. If the images must come first, we recommend putting it in the system prompt. OpenRouter supports both **direct URLs** and **base64-encoded data** for images: * **URLs**: More efficient for publicly accessible images as they don't require local encoding * **Base64**: Required for local files or private images that aren't publicly accessible ### Using Image URLs Here's how to send an image using a URL: ### Using Base64 Encoded Images For locally stored images, you can send them using base64 encoding. Here's how to do it: Supported image content types are: * `image/png` * `image/jpeg` * `image/webp` * `image/gif` # Image Generation OpenRouter supports image generation via the [Chat Completions](/docs/api/api-reference/chat/send-chat-completion-request) and [Responses](/docs/api/reference/responses/overview) endpoints. You can find the supported models, their capabilities, and pricing by filtering our [model list by image output](https://openrouter.ai/models?output_modalities=image). ## Model Discovery You can find image generation models in several ways: ### Via the API Use the `output_modalities` query parameter on the [Models API](/docs/api-reference/models/get-models) to programmatically discover image generation models: ```bash # List only image generation models curl "https://openrouter.ai/api/v1/models?output_modalities=image" # List models that support both text and image output curl "https://openrouter.ai/api/v1/models?output_modalities=text,image" ``` See [Models - Query Parameters](/docs/guides/overview/models#query-parameters) for the full list of supported modality values. ### On the Models Page Visit the [Models page](/models) and filter by output modalities to find models capable of image generation. Look for models that list `"image"` in their output modalities. ### In the Chatroom When using the [Chatroom](/chat), click the **Image** button to automatically filter and select models with image generation capabilities. If no image-capable model is active, you'll be prompted to add one. ## API Usage To generate images, send a request to the `/api/v1/chat/completions` endpoint with the `modalities` parameter. The value depends on the model's capabilities: * **Models that output both text and images** (e.g., Gemini): Use `modalities: ["image", "text"]` * **Models that only output images** (e.g., Sourceful, Flux): Use `modalities: ["image"]` ### Basic Image Generation ### Image Configuration Options Some image generation models support additional configuration through the `image_config` parameter. #### Aspect Ratio Set `image_config.aspect_ratio` to request specific aspect ratios for generated images. **Supported aspect ratios:** * `1:1` → 1024×1024 (default) * `2:3` → 832×1248 * `3:2` → 1248×832 * `3:4` → 864×1184 * `4:3` → 1184×864 * `4:5` → 896×1152 * `5:4` → 1152×896 * `9:16` → 768×1344 * `16:9` → 1344×768 * `21:9` → 1536×672 **Extended aspect ratios** (supported by [`google/gemini-3.1-flash-image-preview`](/models/google/gemini-3.1-flash-image-preview) only): * `1:4` → Tall, narrow format ideal for scrolling carousels and vertical UI elements * `4:1` → Wide, short format for hero banners and horizontal layouts * `1:8` → Extra-tall format for notification headers and narrow vertical spaces * `8:1` → Extra-wide format for wide-format banners and panoramic layouts #### Image Size Set `image_config.image_size` to control the resolution of generated images. **Supported sizes:** * `1K` → Standard resolution (default) * `2K` → Higher resolution * `4K` → Highest resolution * `0.5K` → Lower resolution, optimized for efficiency (supported by [`google/gemini-3.1-flash-image-preview`](/models/google/gemini-3.1-flash-image-preview) only) You can combine both `aspect_ratio` and `image_size` in the same request: #### Strength (Recraft only) Set `image_config.strength` to control how much the output image differs from the input image during image-to-image generation. This parameter only applies when input images are provided in `messages`. It is only supported by Recraft models. * **Range**: `0.0` to `1.0` * **Default**: `0.2` * Lower values produce outputs closer to the input image; higher values allow more creative deviation. **Example:** ```json { "image_config": { "strength": 0.7 } } ``` #### Text Layout (Recraft V3 only) Use `image_config.text_layout` to place text at specific positions on the generated image. Each entry specifies the text to render and a bounding box defined by four corner points in normalized coordinates (0 to 1). This parameter is only supported by Recraft V3 (`recraft/recraft-v3`) for both text-to-image and image-to-image requests. Recraft V4 and V4 Pro do not support `text_layout`. Each text layout entry is an object with: * `text` (required): The text string to render * `bbox` (required): Array of 4 `[x, y]` coordinate pairs defining the bounding box corners (top-left, top-right, bottom-right, bottom-left), with values from 0 to 1 **Example:** ```json { "image_config": { "text_layout": [ { "text": "Hello", "bbox": [[0.3, 0.45], [0.6, 0.45], [0.6, 0.55], [0.3, 0.55]] }, { "text": "World", "bbox": [[0.35, 0.6], [0.65, 0.6], [0.65, 0.7], [0.35, 0.7]] } ] } } ``` #### Style (Recraft V3 only) Use `image_config.style` to apply a specific artistic style to the generated image. This parameter is only supported by Recraft V3 (`recraft/recraft-v3`). Recraft V4 and V4 Pro do not support styles. See the [full list of available styles](https://www.recraft.ai/docs/api-reference/styles#list-of-styles) in Recraft's documentation. Note that vector styles are not supported. **Example:** ```json { "image_config": { "style": "Photorealism" } } ``` #### RGB Colors (Recraft only) Use `image_config.rgb_colors` to specify a color palette that influences the generated image. Each color is a `[r, g, b]` array of three integers (0 to 255). This parameter is supported by Recraft models for both text-to-image and image-to-image requests. **Example:** ```json { "image_config": { "rgb_colors": [ [255, 0, 0], [0, 128, 0] ] } } ``` #### Background RGB Color (Recraft only) Use `image_config.background_rgb_color` to set a specific background color for the generated image. The value is a `[r, g, b]` array of three integers (0 to 255). This parameter is supported by Recraft models for both text-to-image and image-to-image requests. **Example:** ```json { "image_config": { "background_rgb_color": [0, 0, 255] } } ``` You can combine `rgb_colors` and `background_rgb_color` in the same request: ```json { "image_config": { "rgb_colors": [[255, 0, 0]], "background_rgb_color": [255, 255, 255] } } ``` #### Font Inputs (Sourceful only) Use `image_config.font_inputs` to render custom text with specific fonts in generated images. The text you want to render must also be included in your prompt for best results. This parameter is only supported by Sourceful models (`sourceful/riverflow-v2-fast` and `sourceful/riverflow-v2-pro`). Each font input is an object with: * `font_url` (required): URL to the font file * `text` (required): Text to render with the font **Limits:** * Maximum 2 font inputs per request * Additional cost: \$0.03 per font input **Example:** ```json { "image_config": { "font_inputs": [ { "font_url": "https://example.com/fonts/custom-font.ttf", "text": "Hello World" } ] } } ``` **Tips for best results:** * Include the text in your prompt along with details about font name, color, size, and position * The `text` parameter should match exactly what's in your prompt - avoid extra wording or quotation marks * Use line breaks or double spaces to separate headlines and sub-headers when using the same font * Works best with short, clear headlines and sub-headlines #### Super Resolution References (Sourceful only) Use `image_config.super_resolution_references` to enhance low-quality elements in your input image using high-quality reference images. The output image will match the size of your input image, so use larger input images for better results. This parameter is only supported by Sourceful models (`sourceful/riverflow-v2-fast` and `sourceful/riverflow-v2-pro`) when using image-to-image generation (i.e., when input images are provided in `messages`). **Limits:** * Maximum 4 reference URLs per request * Only works with image-to-image requests (ignored when there are no images in `messages`) * Additional cost: \$0.20 per reference **Example:** ```json { "image_config": { "super_resolution_references": [ "https://example.com/reference1.jpg", "https://example.com/reference2.jpg" ] } } ``` **Tips for best results:** * Supply an input image where the elements to enhance are present but low quality * Use larger input images for better output quality (output matches input size) * Use high-quality reference images that show what you want the enhanced elements to look like ### Streaming Image Generation Image generation also works with streaming responses: ## Response Format When generating images, the assistant message includes an `images` field containing the generated images: ```json { "choices": [ { "message": { "role": "assistant", "content": "I've generated a beautiful sunset image for you.", "images": [ { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..." } } ] } } ] } ``` ### Image Format * **Format**: Images are returned as base64-encoded data URLs * **Types**: Typically PNG format (`data:image/png;base64,`) * **Multiple Images**: Some models can generate multiple images in a single response * **Size**: Image dimensions vary by model capabilities ## Model Compatibility Not all models support image generation. To use this feature: 1. **Check Output Modalities**: Ensure the model has `"image"` in its `output_modalities` 2. **Set Modalities Parameter**: Use `["image", "text"]` for models that output both, or `["image"]` for image-only models 3. **Use Compatible Models**: Examples include: * `google/gemini-3.1-flash-image-preview` (supports extended aspect ratios and 0.5K resolution) * `google/gemini-2.5-flash-image` * `black-forest-labs/flux.2-pro` * `black-forest-labs/flux.2-flex` * `sourceful/riverflow-v2-standard-preview` * Other models with image generation capabilities ## Best Practices * **Clear Prompts**: Provide detailed descriptions for better image quality * **Model Selection**: Choose models specifically designed for image generation * **Error Handling**: Check for the `images` field in responses before processing * **Rate Limits**: Image generation may have different rate limits than text generation * **Storage**: Consider how you'll handle and store the base64 image data ## Troubleshooting **No images in response?** * Verify the model supports image generation (`output_modalities` includes `"image"`) * Ensure you've set the `modalities` parameter correctly: `["image", "text"]` for models that output both, or `["image"]` for image-only models * Check that your prompt is requesting image generation **Model not found?** * Use the [Models page](/models) to find available image generation models * Filter by output modalities to see compatible models # PDF Inputs OpenRouter supports PDF processing through the `/api/v1/chat/completions` API. PDFs can be sent as **direct URLs** or **base64-encoded data URLs** in the messages array, via the file content type. This feature works on **any** model on OpenRouter. **URL support**: Send publicly accessible PDFs directly without downloading or encoding **Base64 support**: Required for local files or private documents that aren't publicly accessible PDFs also work in the chat room for interactive testing. When a model supports file input natively, the PDF is passed directly to the model. When the model does not support file input natively, OpenRouter will parse the file and pass the parsed results to the requested model. You can send both PDFs and other file types in the same request. ## Plugin Configuration To configure PDF processing, use the `plugins` parameter in your request. OpenRouter provides several PDF processing engines with different capabilities and pricing: ```typescript { plugins: [ { id: 'file-parser', pdf: { engine: 'cloudflare-ai', // or 'mistral-ocr' or 'native' }, }, ], } ``` ## Pricing OpenRouter provides several PDF processing engines: 1. "{PDFParserEngine.MistralOCR}": Best for scanned documents or PDFs with images (\${MISTRAL_OCR_COST.toString()} per 1,000 pages). 2. "{PDFParserEngine.CloudflareAI}": Converts PDFs to markdown using Cloudflare Workers AI (Free). 3. "{PDFParserEngine.Native}": Only available for models that support file input natively (charged as input tokens). The `"pdf-text"` engine is deprecated and automatically redirected to `"cloudflare-ai"`. Existing requests using `"pdf-text"` will continue to work. If you don't explicitly specify an engine, OpenRouter will default first to the model's native file processing capabilities, and if that's not available, we will use the "{DEFAULT_PDF_ENGINE}" engine. ## OCR Image Limits When the "{PDFParserEngine.MistralOCR}" engine extracts images from a PDF, OpenRouter requests at most **8 images per PDF** from Mistral via the OCR API's `image_limit` parameter, and forwards no more than 8 images per request to the downstream model. Surplus images are dropped while all extracted text is preserved in full. This cap exists because per-prompt image limits vary significantly across providers — some reject requests with more than 8 images outright, and even providers with higher caps often fail with context-length errors when a long PDF emits one image per page. Capping at 8 keeps requests within the limits of every supported provider. If your downstream model does not accept image input at all, OCR-extracted images are stripped entirely and only the parsed text is forwarded. ## Using PDF URLs For publicly accessible PDFs, you can send the URL directly without needing to download and encode the file: PDF URLs work with all processing engines. For Mistral OCR, the URL is passed directly to the service. For other engines, OpenRouter fetches the PDF and processes it internally. ## Using Base64 Encoded PDFs For local PDF files or when you need to send PDF content directly, you can base64 encode the file: ## Skip Parsing Costs When you send a PDF to the API, the response may include file annotations in the assistant's message. These annotations contain structured information about the PDF document that was parsed. By sending these annotations back in subsequent requests, you can avoid re-parsing the same PDF document multiple times, which saves both processing time and costs. Here's how to reuse file annotations: When you include the file annotations from a previous response in your subsequent requests, OpenRouter will use this pre-parsed information instead of re-parsing the PDF, which saves processing time and costs. This is especially beneficial for large documents or when using the `mistral-ocr` engine which incurs additional costs. ## File Annotations Schema When OpenRouter parses a PDF, the response includes file annotations in the assistant message. Here is the TypeScript type for the annotation schema: ```typescript type FileAnnotation = { type: 'file'; file: { hash: string; // Unique hash identifying the parsed file name?: string; // Original filename (optional) content: ContentPart[]; // Parsed content from the file }; }; type ContentPart = | { type: 'text'; text: string } | { type: 'image_url'; image_url: { url: string } }; ``` The `content` array contains the parsed content from the PDF, which may include text blocks and images (as base64 data URLs). The `hash` field uniquely identifies the parsed file content and is used to skip re-parsing when you include the annotation in subsequent requests. ## Response Format The API will return a response in the following format: ```json { "id": "gen-1234567890", "provider": "DeepInfra", "model": "google/gemma-3-27b-it", "object": "chat.completion", "created": 1234567890, "choices": [ { "message": { "role": "assistant", "content": "The document discusses...", "annotations": [ { "type": "file", "file": { "hash": "abc123...", "name": "document.pdf", "content": [ { "type": "text", "text": "Parsed text content..." }, { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } } ] } } ] } } ], "usage": { "prompt_tokens": 1000, "completion_tokens": 100, "total_tokens": 1100 } } ``` ## Error Responses with Parsed Annotations If OpenRouter successfully parses your PDF but every inference provider then fails to generate a completion, the error response still includes the parsed annotations under `error.metadata.file_annotations`. The shape matches the success-path `FileAnnotation` documented above, so you can hand the same array straight back to OpenRouter on a retry to skip re-parsing. This applies to the "{PDFParserEngine.MistralOCR}" and "{PDFParserEngine.CloudflareAI}" engines, which parse the PDF before sending it to a model. The "{PDFParserEngine.Native}" engine doesn't produce annotations because the file is forwarded directly to the model. ```json { "error": { "code": 502, "message": "Provider returned an error", "metadata": { "file_annotations": [ { "type": "file", "file": { "hash": "abc123...", "name": "document.pdf", "content": [ { "type": "text", "text": "Parsed text content..." } ] } } ] } } } ``` When you read annotations from both the success and error paths, dedupe by `file.hash` — the hash is stable across both shapes for the same parsed file: ```typescript function isFileAnnotation(value: unknown): value is FileAnnotation { if (typeof value !== 'object' || value === null) return false; const candidate = value as { type?: unknown; file?: { hash?: unknown } }; return ( candidate.type === 'file' && typeof candidate.file?.hash === 'string' ); } function extractFileAnnotations(response: unknown): FileAnnotation[] { if (typeof response !== 'object' || response === null) return []; const root = response as { choices?: Array<{ message?: { annotations?: unknown[] } }>; error?: { metadata?: { file_annotations?: unknown[] } }; }; const fromMessage = root.choices?.[0]?.message?.annotations ?? []; const fromError = root.error?.metadata?.file_annotations ?? []; const seen = new Set(); const out: FileAnnotation[] = []; for (const a of [...fromMessage, ...fromError]) { if (isFileAnnotation(a) && !seen.has(a.file.hash)) { seen.add(a.file.hash); out.push(a); } } return out; } ``` # Audio OpenRouter supports both sending audio files to compatible models and receiving audio responses via the API. This guide covers how to work with audio inputs and outputs. ## Audio Inputs Send audio files to compatible models for transcription, analysis, and processing. Audio input requests use the `/api/v1/chat/completions` API with the `input_audio` content type. Audio files must be base64-encoded and include the format specification. **Note**: Audio files must be **base64-encoded** - direct URLs are not supported for audio content. You can search for models that support audio input by filtering to audio input modality on our [Models page](/models?fmt=cards\&input_modalities=audio). ### Sending Audio Files Here's how to send an audio file for processing: ### Supported Audio Input Formats Supported audio formats vary by provider. Common formats include: * `wav` - WAV audio * `mp3` - MP3 audio * `aiff` - AIFF audio * `aac` - AAC audio * `ogg` - OGG Vorbis audio * `flac` - FLAC audio * `m4a` - M4A audio * `pcm16` - PCM16 audio * `pcm24` - PCM24 audio **Note:** Check your model's documentation to confirm which audio formats it supports. Not all models support all formats. ## Audio Output OpenRouter supports receiving audio responses from models that have audio output capabilities. To request audio output, include the `modalities` and `audio` parameters in your request. You can search for models that support audio output by filtering to audio output modality on our [Models page](/models?fmt=cards\&output_modalities=audio). ### Requesting Audio Output To receive audio output, set `modalities` to `["text", "audio"]` and provide the `audio` configuration with your desired voice and format: ### Streaming Chunk Format Audio output requires streaming (`stream: true`). Audio data and transcript are delivered incrementally via the `delta.audio` field in each chunk: ```json { "choices": [ { "delta": { "audio": { "data": "", "transcript": "Hello" } } } ] } ``` ### Audio Configuration Options The `audio` parameter accepts the following options: | Option | Description | | -------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `voice` | The voice to use for audio generation (e.g., `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`). Available voices vary by model. | | `format` | The audio format for the output (e.g., `wav`, `mp3`, `flac`, `opus`, `pcm16`). Available formats vary by model. | # Video Inputs OpenRouter supports sending video files to compatible models via the API. This guide will show you how to work with video using our API. OpenRouter supports both **direct URLs** and **base64-encoded data URLs** for videos: * **URLs**: Efficient for publicly accessible videos as they don't require local encoding * **Base64 Data URLs**: Required for local files or private videos that aren't publicly accessible **Important:** Video URL support varies by provider. OpenRouter only sends video URLs to providers that explicitly support them. For example, Google Gemini on AI Studio only supports YouTube links (not Vertex AI). **API Only:** Video inputs are currently only supported via the API. Video uploads are not available in the OpenRouter chatroom interface at this time. ## Video Inputs Requests with video files to compatible models are available via the `/api/v1/chat/completions` API with the `video_url` content type. The `url` can either be a URL or a base64-encoded data URL. Note that only models with video processing capabilities will handle these requests. You can search for models that support video by filtering to video input modality on our [Models page](/models?fmt=cards\&input_modalities=video). ### Using Video URLs Here's how to send a video using a URL. Note that for Google Gemini on AI Studio, only YouTube links are supported: ### Using Base64 Encoded Videos For locally stored videos, you can send them using base64 encoding as data URLs: ## Supported Video Formats OpenRouter supports the following video formats: * `video/mp4` * `video/mpeg` * `video/mov` * `video/webm` ## Common Use Cases Video inputs enable a wide range of applications: * **Video Summarization**: Generate text summaries of video content * **Object and Activity Recognition**: Identify objects, people, and actions in videos * **Scene Understanding**: Describe settings, environments, and contexts * **Sports Analysis**: Analyze gameplay, movements, and tactics * **Surveillance**: Monitor and analyze security footage * **Educational Content**: Analyze instructional videos and provide insights ## Best Practices ### File Size Considerations Video files can be large, which affects both upload time and processing costs: * **Compress videos** when possible to reduce file size without significant quality loss * **Trim videos** to include only relevant segments * **Consider resolution**: Lower resolutions (e.g., 720p vs 4K) reduce file size while maintaining usability for most analysis tasks * **Frame rate**: Lower frame rates can reduce file size for videos where high temporal resolution isn't critical ### Optimal Video Length Different models may have different limits on video duration: * Check model-specific documentation for maximum video length * For long videos, consider splitting into shorter segments * Focus on key moments rather than sending entire long-form content ### Quality vs. Size Trade-offs Balance video quality with practical considerations: * **High quality** (1080p+, high bitrate): Best for detailed visual analysis, object detection, text recognition * **Medium quality** (720p, moderate bitrate): Suitable for most general analysis tasks * **Lower quality** (480p, lower bitrate): Acceptable for basic scene understanding and action recognition ## Provider-Specific Video URL Support Video URL support varies significantly by provider: * **Google Gemini (AI Studio)**: Only supports YouTube links (e.g., `https://www.youtube.com/watch?v=...`) * **Google Gemini (Vertex AI)**: Does not support video URLs - use base64-encoded data URLs instead * **Other providers**: Check model-specific documentation for video URL support ## Troubleshooting **Video not processing?** * Verify the model supports video input (check `input_modalities` includes `"video"`) * If using a video URL, confirm the provider supports video URLs (see Provider-Specific Video URL Support above) * For Gemini on AI Studio, ensure you're using a YouTube link, not a direct video file URL * If the video URL isn't working, try using a base64-encoded data URL instead * Check that the video format is supported * Verify the video file isn't corrupted **Large file errors?** * Compress the video to reduce file size * Reduce video resolution or frame rate * Trim the video to a shorter duration * Check model-specific file size limits * Consider using a video URL (if supported by the provider) instead of base64 encoding for large files **Poor analysis results?** * Ensure video quality is sufficient for the task * Provide clear, specific prompts about what to analyze * Consider if the video duration is appropriate for the model * Check if the video content is clearly visible and well-lit # Video Generation OpenRouter supports video generation from text prompts (and optional reference images) via a dedicated asynchronous API. You can find the supported models, their capabilities, and pricing by filtering our [model list by video output](https://openrouter.ai/models?output_modalities=video). Adding video generation to an app? The [Video Generation Cookbook](/docs/cookbook/video-generation/choose-video-model) breaks this workflow into step-by-step recipes for choosing a model, submitting text-to-video jobs, using images, passing provider options, and handling webhooks. For reusable agent knowledge across projects, install the [openrouter-video skill](https://github.com/OpenRouterTeam/skills/tree/main/skills/openrouter-video). ## Model Discovery You can find video generation models in several ways: ### Via the Video Models API Use the dedicated video models endpoint to list all available video generation models along with their supported parameters: ```bash curl "https://openrouter.ai/api/v1/videos/models" ``` The response returns a `data` array where each model includes: ```json { "data": [ { "id": "google/veo-3.1", "canonical_slug": "google/veo-3.1", "name": "Google: Veo 3.1", "description": "...", "created": 1719792000, "supported_resolutions": ["720p", "1080p"], "supported_aspect_ratios": ["16:9", "9:16", "1:1"], "supported_sizes": ["1280x720", "1920x1080"], "pricing_skus": { "per-video-second": "0.50", "per-video-second-1080p": "0.75" }, "allowed_passthrough_parameters": ["output_config"] } ] } ``` | Field | Description | | -------------------------------- | --------------------------------------------------------------------------------- | | `id` | Model slug to use in generation requests | | `canonical_slug` | Permanent model identifier | | `supported_resolutions` | List of supported output resolutions (e.g., `720p`, `1080p`) | | `supported_aspect_ratios` | List of supported aspect ratios (e.g., `16:9`, `9:16`) | | `supported_sizes` | List of supported pixel dimensions (e.g., `1280x720`) | | `pricing_skus` | Pricing information per SKU | | `allowed_passthrough_parameters` | Provider-specific parameters that can be passed through via the `provider` option | Use this endpoint to check which resolutions, aspect ratios, and passthrough parameters are supported by each model before submitting a generation request. ### Via the Models API You can also use the `output_modalities` query parameter on the [Models API](/docs/api-reference/models/get-models) to discover video generation models: ```bash # List only video generation models curl "https://openrouter.ai/api/v1/models?output_modalities=video" ``` ### On the Models Page Visit the [Models page](/models) and filter by output modalities to find models capable of video generation. Look for models that list `"video"` in their output modalities. ## How It Works Unlike text or image generation, video generation is **asynchronous** because generating video takes significantly longer. The workflow is: 1. **Submit** a generation request to `POST /api/v1/videos` 2. **Receive** a job ID and polling URL immediately 3. **Poll** the polling URL (`GET /api/v1/videos/{jobId}`) until the status is `completed` 4. **Download** the video from the content URL (`GET /api/v1/videos/{jobId}/content`) ## API Usage ### Submitting a Video Generation Request ### Request Parameters | Parameter | Type | Required | Description | | ------------------ | ------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------- | | `model` | string | Yes | The model to use for video generation (e.g., `google/veo-3.1`) | | `prompt` | string | Yes | Text description of the video to generate | | `duration` | integer | No | Duration of the generated video in seconds | | `resolution` | string | No | Resolution of the output video (e.g., `720p`, `1080p`) | | `aspect_ratio` | string | No | Aspect ratio of the output video (e.g., `16:9`, `9:16`) | | `size` | string | No | Exact pixel dimensions in `WIDTHxHEIGHT` format (e.g., `1280x720`). Interchangeable with `resolution` + `aspect_ratio` | | `frame_images` | array | No | Images for first/last frames (image-to-video) | | `input_references` | array | No | Reference images for style guidance (reference-to-video) | | `generate_audio` | boolean | No | Whether to generate audio alongside the video. Defaults to `true` for models that support audio output | | `seed` | integer | No | Seed for deterministic generation (not guaranteed by all providers) | | `callback_url` | string | No | URL to receive a webhook notification when the job completes. Overrides the workspace-level default callback URL if set. Must be HTTPS | | `provider` | object | No | Provider-specific passthrough configuration | ### Supported Resolutions * `480p` * `720p` * `1080p` * `1K` * `2K` * `4K` ### Supported Aspect Ratios * `16:9` — Widescreen landscape * `9:16` — Vertical/portrait * `1:1` — Square * `4:3` — Standard landscape * `3:4` — Standard portrait * `21:9` — Ultra-wide * `9:21` — Ultra-tall ### Using Images There are two ways to provide images, each triggering a different generation mode: * **`frame_images`** — Specifies first or last frame images for **image-to-video** generation. Each entry must include a `frame_type` of `first_frame` or `last_frame`. * **`input_references`** — Provides style or content reference images for **reference-to-video** generation. The model uses these as visual guidance rather than exact frames. If both fields are provided, `frame_images` takes precedence and the request is treated as image-to-video. #### Image-to-Video (frame\_images) ```json { "model": "alibaba/wan-2.7", "prompt": "A character walking through a forest", "frame_images": [ { "type": "image_url", "image_url": { "url": "https://example.com/first-frame.png" }, "frame_type": "first_frame" } ], "resolution": "1080p" } ``` #### Reference-to-Video (input\_references) ```json { "model": "alibaba/wan-2.7", "prompt": "A colossal solar flare beside a planet", "input_references": [ { "type": "image_url", "image_url": { "url": "https://example.com/style-ref.png" } } ], "resolution": "1080p" } ``` ### Provider-Specific Options You can pass provider-specific options using the `provider` parameter. Options are keyed by provider slug, and only the options for the matched provider are forwarded: ```json { "model": "google/veo-3.1", "prompt": "A time-lapse of a flower blooming", "provider": { "options": { "google-vertex": { "parameters": { "personGeneration": "allow", "negativePrompt": "blurry, low quality" } } } } } ``` Use the [Video Models API](#via-the-video-models-api) to check which passthrough parameters each model supports via the `allowed_passthrough_parameters` field. ## Response Format ### Submit Response (202 Accepted) When you submit a video generation request, you receive an immediate response with the job details: ```json { "id": "abc123", "polling_url": "https://openrouter.ai/api/v1/videos/abc123", "status": "pending" } ``` ### Poll Response When polling the job status, the response includes additional fields as the job progresses: ```json { "id": "abc123", "generation_id": "gen-1234567890-abcdef", "polling_url": "https://openrouter.ai/api/v1/videos/abc123", "status": "completed", "unsigned_urls": [ "https://openrouter.ai/api/v1/videos/abc123/content?index=0" ], "usage": { "cost": 0.25, "is_byok": false } } ``` ### Job Statuses | Status | Description | | ------------- | ----------------------------------------------- | | `pending` | The job has been submitted and is queued | | `in_progress` | The video is being generated | | `completed` | The video is ready to download | | `failed` | The generation failed (check the `error` field) | ### Downloading the Video Once the job status is `completed`, the `unsigned_urls` array contains URLs to download the generated video content. You can also use the content endpoint directly: ```bash curl "https://openrouter.ai/api/v1/videos/{jobId}/content?index=0" \ -H "Authorization: Bearer $OPENROUTER_API_KEY" \ --output video.mp4 ``` The `index` query parameter defaults to `0` and can be used if the model generates multiple video outputs. ## Webhooks Instead of polling for job status, you can receive a webhook notification when a video generation job completes. There are two ways to configure a callback URL: 1. **Per-request**: Pass `callback_url` in the request body. This takes priority over the workspace default. 2. **Workspace default**: Set a default callback URL in your [workspace settings](/workspaces). This applies to all video generation requests that don't specify their own `callback_url`. ### Webhook Payload When a job reaches a terminal state, a POST request is sent to the callback URL with an event envelope. Each delivery also carries an `X-OpenRouter-Idempotency-Key` header of the form `-` for safe retry deduplication. `video.generation.completed`: ```json { "type": "video.generation.completed", "created_at": "2026-04-24T12:00:00.000Z", "data": { "id": "abc123", "status": "completed", "generation_id": "gen-xyz789", "model": "google/veo-3.1", "unsigned_urls": [ "https://openrouter.ai/api/v1/videos/abc123/content?index=0" ], "usage": { "cost": 0.5, "is_byok": false } } } ``` `video.generation.failed`: ```json { "type": "video.generation.failed", "created_at": "2026-04-24T12:00:00.000Z", "data": { "id": "abc123", "status": "failed", "generation_id": "gen-xyz789", "model": "google/veo-3.1", "error": "Content policy violation" } } ``` `video.generation.cancelled`: ```json { "type": "video.generation.cancelled", "created_at": "2026-04-24T12:00:00.000Z", "data": { "id": "abc123", "status": "cancelled", "generation_id": "gen-xyz789", "model": "google/veo-3.1", "error": "Job was cancelled" } } ``` `video.generation.expired`: ```json { "type": "video.generation.expired", "created_at": "2026-04-24T12:00:00.000Z", "data": { "id": "abc123", "status": "expired", "generation_id": "gen-xyz789", "model": "google/veo-3.1", "error": "Job exceeded maximum time to live" } } ``` `generation_id` and `model` in `data` may be `null` when a job fails before those values are assigned (e.g. an early validation failure). ### Signing Secret You can configure a signing secret in your [workspace settings](/workspaces) to verify that webhook payloads are authentically from OpenRouter. When a signing secret is configured, each webhook delivery includes an `X-OpenRouter-Signature` header. The signature includes a timestamp and an HMAC hash: ``` X-OpenRouter-Signature: t=1234567890,v1=a1b2c3d4... ``` ### Verifying Signatures To verify the signature on your webhook receiver: 1. Extract the timestamp (`t`) and signature hash (`v1`) from the header 2. Construct the signed payload: `{timestamp},{raw_request_body}` (joined with a comma) 3. Compute the HMAC-SHA256 of the signed payload using your signing secret as the key 4. Compare the hex-encoded result with the `v1` value ```typescript import crypto from 'crypto'; const FIVE_MINUTES_IN_SECONDS = 300; function verifyWebhookSignature( rawBody: string, signatureHeader: string, secret: string, ): boolean { const parts = signatureHeader.split(','); const timestamp = parts.find((p) => p.startsWith('t='))?.slice(2); const hash = parts.find((p) => p.startsWith('v1='))?.slice(3); if (!timestamp || !hash) { return false; } // Reject timestamps older than 5 minutes to prevent replay attacks const age = Math.floor(Date.now() / 1000) - Number(timestamp); if (Number.isNaN(age) || age > FIVE_MINUTES_IN_SECONDS) { return false; } const signedPayload = `${timestamp},${rawBody}`; const expected = crypto .createHmac('sha256', secret) .update(signedPayload) .digest('hex'); if (expected.length !== hash.length) { return false; } return crypto.timingSafeEqual( Buffer.from(expected), Buffer.from(hash), ); } ``` Use the **raw request body** (the exact bytes received) for verification. Parsing and re-serializing JSON may change key ordering or number formatting, which will cause verification to fail. ## Best Practices * **Detailed Prompts**: Provide specific, descriptive prompts for better video quality. Include details about motion, camera angles, lighting, and scene composition * **Appropriate Resolution**: Higher resolutions take longer to generate and cost more. Choose the resolution that fits your use case * **Polling Interval**: Use a reasonable polling interval (e.g., 30 seconds) to avoid excessive API calls. Video generation typically takes 30 seconds to several minutes depending on the model and parameters * **Error Handling**: Always check the job status for `failed` state and handle the `error` field appropriately * **Reference Images**: When using reference images, ensure they are high quality and relevant to the desired video output ## Zero Data Retention Video generation is **not eligible** for [Zero Data Retention (ZDR)](/docs/guides/features/zdr). Because video generation is asynchronous, the generated video output must be retained by the provider for a short period of time so that it can be retrieved after generation is complete. This temporary retention is inherent to the async polling workflow and cannot be bypassed. If you have ZDR enforcement enabled (either via [account settings](/settings/privacy) or the per-request `zdr` parameter), video generation requests will not be routed. ## Troubleshooting **Job stays in `pending` for a long time?** * Video generation can take several minutes depending on the model, resolution, and server load * Continue polling at regular intervals **Generation failed?** * Check the `error` field in the poll response for details * Verify the model supports video generation (`output_modalities` includes `"video"`) * Ensure your prompt is appropriate and within model guidelines * Check that any reference images are accessible and in supported formats **Model not found?** * Use the [Video Models API](#via-the-video-models-api) or the [Models page](/models) to find available video generation models * Verify the model slug is correct (e.g., `google/veo-3.1`) # Text-to-Speech OpenRouter supports text-to-speech (TTS) via a dedicated `/api/v1/audio/speech` endpoint that is compatible with the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech). Send text and receive a raw audio byte stream in your chosen format. ## Model Discovery You can find TTS models in several ways: ### Via the API Use the `output_modalities` query parameter on the [Models API](/docs/api-reference/models/get-models) to discover TTS models: ```bash # List only TTS models curl "https://openrouter.ai/api/v1/models?output_modalities=speech" ``` ### On the Models Page Visit the [Models page](/models) and filter by output modalities to find models capable of speech synthesis. Look for models that list `"speech"` in their output modalities. ## API Usage Send a `POST` request to `/api/v1/audio/speech` with the text you want to synthesize. The response is a raw audio byte stream — not JSON — so you can pipe it directly to a file or audio player. ### Basic Example ### Request Parameters | Parameter | Type | Required | Description | | ----------------- | ------ | -------- | -------------------------------------------------------------------------------------------------------------------------------- | | `model` | string | Yes | The TTS model to use (e.g., `openai/gpt-4o-mini-tts-2025-12-15`, `mistralai/voxtral-mini-tts-2603`) | | `input` | string | Yes | The text to synthesize into speech | | `voice` | string | Yes | Voice identifier. Available voices vary by model — check each model's page on the [Models page](/models) for supported voices | | `response_format` | string | No | Audio output format: `mp3` or `pcm`. Defaults to `pcm` | | `speed` | number | No | Playback speed multiplier. Only used by models that support it (e.g., OpenAI TTS). Ignored by other providers. Defaults to `1.0` | | `provider` | object | No | Provider-specific passthrough configuration | ### Provider-Specific Options You can pass provider-specific options using the `provider` parameter. Options are keyed by provider slug, and only the options for the matched provider are forwarded: ```json { "model": "openai/gpt-4o-mini-tts-2025-12-15", "input": "Hello world", "voice": "alloy", "provider": { "options": { "openai": { "instructions": "Speak in a warm, friendly tone." } } } } ``` ## Response Format The TTS endpoint returns a **raw audio byte stream**, not JSON. The response includes the following headers: | Header | Description | | ----------------- | --------------------------------------------------------------------------------------- | | `Content-Type` | The MIME type of the audio. `audio/mpeg` for `mp3` format, `audio/pcm` for `pcm` format | | `X-Generation-Id` | The unique generation ID for the request, useful for tracking and debugging | ### Output Formats | Format | Content-Type | Description | | ------ | ------------ | --------------------------------------------------------------------------------- | | `mp3` | `audio/mpeg` | Compressed audio, smaller file size. Good for storage and playback | | `pcm` | `audio/pcm` | Uncompressed raw audio. Lower latency, suitable for real-time streaming pipelines | ## Pricing TTS models are priced **per character** of input text. Pricing varies by model and provider. You can check the per-character cost for each model on the [Models page](/models) or via the [Models API](/docs/api-reference/models/get-models). ## OpenAI SDK Compatibility The TTS endpoint is fully compatible with the OpenAI SDK. You can use the OpenAI client libraries by pointing them at OpenRouter's base URL: ## Best Practices * **Choose the right format**: Use `mp3` for storage and general playback. Use `pcm` for real-time streaming pipelines where latency matters * **Voice selection**: Different providers offer different voices. Check the model's documentation or experiment with available voices to find the best fit for your use case * **Input length**: For very long texts, consider splitting the input into smaller segments and concatenating the audio output. This can improve reliability and reduce latency for the first audio chunk * **Speed parameter**: The `speed` parameter is only supported by certain providers (e.g., OpenAI). It is silently ignored by providers that don't support it ## Troubleshooting **Empty or corrupted audio file?** * Verify the `response_format` matches how you're saving the file (e.g., don't save `pcm` output with a `.mp3` extension) * Check the response status code — non-200 responses return JSON error bodies, not audio **Model not found?** * Use the [Models page](/models) to find available TTS models * Verify the model slug is correct (e.g., `openai/gpt-4o-mini-tts-2025-12-15`, not `gpt-4o-mini-tts`) **Voice not available?** * Available voices vary by provider. Check the provider's documentation for supported voice identifiers * Each model has its own set of voices — check the model's page on the [Models page](/models) for the full list # Speech-to-Text OpenRouter supports speech-to-text (STT) via a dedicated `/api/v1/audio/transcriptions` endpoint. Send base64-encoded audio and receive a JSON response with the transcribed text and usage statistics. ## Model Discovery You can find STT models in several ways: ### Via the API Use the `output_modalities` query parameter on the [Models API](/docs/api-reference/models/get-models) to discover STT models: ```bash # List only STT models curl "https://openrouter.ai/api/v1/models?output_modalities=transcription" ``` ### On the Models Page Visit the [Models page](/models) and filter by output modalities to find models capable of audio transcription. You can also browse the [Speech-to-Text collection](/collections/speech-to-text-models) for a curated list. ## API Usage Send a `POST` request to `/api/v1/audio/transcriptions` with a JSON body containing base64-encoded audio. The response is JSON with the transcribed text and optional usage statistics. ### Basic Example ### Request Parameters | Parameter | Type | Required | Description | | -------------------- | ------ | -------- | ------------------------------------------------------------------------------------- | | `model` | string | Yes | The STT model to use (e.g., `openai/whisper-1`) | | `input_audio` | object | Yes | Audio data to transcribe | | `input_audio.data` | string | Yes | Base64-encoded audio data (raw bytes, not a data URI) | | `input_audio.format` | string | Yes | Audio format (e.g., `wav`, `mp3`, `flac`, `m4a`, `ogg`, `webm`, `aac`) | | `language` | string | No | ISO-639-1 language code (e.g., `"en"`, `"ja"`). Auto-detected if omitted | | `temperature` | number | No | Sampling temperature between 0 and 1. Lower values produce more deterministic results | | `provider` | object | No | Provider-specific passthrough configuration | ### Provider-Specific Options You can pass provider-specific options using the `provider` parameter. Options are keyed by provider slug, and only the options for the matched provider are forwarded: ```json { "model": "openai/whisper-large-v3", "input_audio": { "data": "UklGRiQA...", "format": "wav" }, "provider": { "options": { "groq": { "prompt": "Expected vocabulary: OpenRouter, API, transcription" } } } } ``` ## Response Format The STT endpoint returns a JSON response with the transcribed text: ```json { "text": "Hello, this is a test of speech-to-text transcription.", "usage": { "seconds": 9.2, "total_tokens": 113, "input_tokens": 83, "output_tokens": 30, "cost": 0.000508 } } ``` ### Response Fields | Field | Type | Description | | --------------------- | ------ | -------------------------------------------- | | `text` | string | The transcribed text | | `usage.seconds` | number | Duration of the input audio in seconds | | `usage.total_tokens` | number | Total number of tokens used (input + output) | | `usage.input_tokens` | number | Number of input tokens billed | | `usage.output_tokens` | number | Number of output tokens generated | | `usage.cost` | number | Total cost of the request in USD | ### Response Headers | Header | Description | | ----------------- | ----------------------------------------------------------------------- | | `X-Generation-Id` | Unique generation ID for the request, useful for tracking and debugging | ## Supported Audio Formats Supported audio formats vary by provider. Common formats include: | Format | MIME Type | Description | | ------ | ------------ | ---------------------------------------- | | `wav` | `audio/wav` | Uncompressed audio, highest quality | | `mp3` | `audio/mpeg` | Compressed audio, widely compatible | | `flac` | `audio/flac` | Lossless compressed audio | | `m4a` | `audio/mp4` | MPEG-4 audio | | `ogg` | `audio/ogg` | Ogg Vorbis audio | | `webm` | `audio/webm` | WebM audio, common in browser recordings | | `aac` | `audio/aac` | Advanced Audio Coding | ## Pricing STT models use different pricing strategies depending on the provider: * **Duration-based** (e.g., OpenAI Whisper): Priced per second of audio input * **Token-based** (e.g., newer OpenAI models): Priced per input/output token, similar to text models You can check the cost for each model on the [Models page](/models) or via the [Models API](/docs/api-reference/models/get-models). The `usage.cost` field in the response shows the actual cost for each request. ## BYOK (Bring Your Own Key) STT supports [BYOK](/docs/guides/overview/auth/byok), allowing you to use your own provider API keys. When configured, requests are routed directly to the provider using your key, and OpenRouter charges only its platform fee rather than the per-usage model cost. ## Playground You can test STT models directly in the browser using the [OpenRouter Playground](/playground). Navigate to any STT model's page and use the playground tab to upload an audio file and see the transcription result. ## Differences from Audio Input OpenRouter supports two ways to process audio: 1. **Speech-to-Text** (this page): A dedicated `/api/v1/audio/transcriptions` endpoint optimized for transcription. Returns structured JSON with the transcribed text and usage data. Best for converting audio to text. 2. **Audio input via Chat Completions** ([Audio docs](/docs/features/multimodal/audio)): Send audio as part of a `/api/v1/chat/completions` request using the `input_audio` content type. The model processes the audio alongside text and responds conversationally. Best for audio analysis, question answering about audio content, or combining audio with other modalities. ## Best Practices * **Choose the right format**: WAV provides the best quality for transcription. MP3 and other compressed formats work well but may slightly reduce accuracy for borderline audio * **File size**: For very long audio files, consider splitting them into smaller segments. The upstream provider timeout is 60 seconds, so very large files may time out * **Base64 encoding**: Audio must be sent as base64-encoded data (raw bytes, not a data URI). Most programming languages have built-in base64 encoding utilities ## Troubleshooting **Empty or incorrect transcription?** * Verify the audio format matches the `format` field in your request * Ensure the audio quality is sufficient for transcription **Request timing out?** * Large audio files may exceed the 60-second timeout. Split long recordings into smaller segments * Compressed formats (MP3, AAC) produce smaller payloads and transfer faster **Model not found?** * Use the [Models page](/models) or the [Models API](/docs/api-reference/models/get-models) with `output_modalities=transcription` to find available STT models * Verify the model slug is correct (e.g., `openai/whisper-1`, not `whisper-1`) **Authentication error?** * Ensure you're using a valid API key from [your OpenRouter dashboard](/settings/keys) * The STT endpoint uses the same authentication as the Chat Completions API