> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://openrouter.ai/docs/guides/best-practices/llms.txt. > For full documentation content, see https://openrouter.ai/docs/guides/best-practices/llms-full.txt. # Best Practices # Latency and Performance OpenRouter is designed with performance as a top priority. OpenRouter is heavily optimized to add as little latency as possible to your requests. ## Minimal Overhead OpenRouter is designed to add minimal latency to your requests. This is achieved through: * Edge computing using Cloudflare Workers to stay as close as possible to your application * Efficient caching of user and API key data at the edge * Optimized routing logic that minimizes processing time ## Performance Considerations ### Cache Warming When OpenRouter's edge caches are cold (typically during the first 1-2 minutes of operation in a new region), you may experience slightly higher latency as the caches warm up. This normalizes once the caches are populated. ### Credit Balance Checks To maintain accurate billing and prevent overages, OpenRouter performs additional database checks when: * A user's credit balance is low (single digit dollars) * An API key is approaching its configured credit limit OpenRouter expires caches more aggressively under these conditions to ensure proper billing, which increases latency until additional credits are added. ### Model Fallback When using [model routing](/docs/routing/auto-model-selection) or [provider routing](/docs/guides/routing/provider-selection), if the primary model or provider fails, OpenRouter will automatically try the next option. A failed initial completion unsurprisingly adds latency to the specific request. OpenRouter tracks provider failures, and will attempt to intelligently route around unavailable providers so that this latency is not incurred on every request. ## Best Practices To achieve optimal performance with OpenRouter: 1. **Maintain Healthy Credit Balance** * Set up auto-topup with a higher threshold and amount * This helps avoid forced credit checks and reduces the risk of hitting zero balance * Recommended minimum balance: \$10-20 to ensure smooth operation 2. **Use Provider Preferences** * If you have specific latency requirements (whether time to first token, or time to last), there are [provider routing](/docs/guides/routing/provider-selection) features to help you achieve your performance and cost goals. # Prompt Caching To save on inference costs, you can enable prompt caching on supported providers and models. Most providers automatically enable prompt caching, but note that some (see Alibaba and Anthropic below) require you to enable it on a per-message basis. When using caching (whether automatically in supported models, or via the `cache_control` property), OpenRouter uses provider sticky routing to maximize cache hits — see [Provider Sticky Routing](#provider-sticky-routing) below for details. ## Provider Sticky Routing To maximize cache hit rates, OpenRouter uses **provider sticky routing** to route your subsequent requests to the same provider endpoint after a cached request. This works automatically with both implicit caching (e.g. OpenAI, DeepSeek, Gemini 2.5) and explicit caching (e.g. Anthropic `cache_control` breakpoints). **How it works:** * After a request that uses prompt caching, OpenRouter remembers which provider served your request. * Subsequent requests for the same model are routed to the same provider, keeping your cache warm. * Sticky routing only activates when the provider's cache read pricing is cheaper than regular prompt pricing, ensuring you always benefit from cost savings. * If the sticky provider becomes unavailable, OpenRouter automatically falls back to the next-best provider. * Sticky routing is not used when you specify a manual [provider order](/docs/api-reference/provider-preferences) via `provider.order` — in that case, your explicit ordering takes priority. **Sticky routing granularity:** Sticky routing is tracked at the account level, per model, and per conversation. OpenRouter identifies conversations by hashing the first system (or developer) message and the first non-system message in each request, so requests that share the same opening messages are routed to the same provider. This means different conversations naturally stick to different providers, improving load-balancing and throughput while keeping caches warm within each conversation. ## Inspecting cache usage To see how much caching saved on each generation, you can: 1. Click the detail button on the [Activity](/activity) page 2. Use the `/api/v1/generation` API, [documented here](/docs/api/api-reference/generations/get-generation) 3. Check the `prompt_tokens_details` object in the [usage response](/docs/cookbook/administration/usage-accounting) included with every API response The `cache_discount` field in the response body will tell you how much the response saved on cache usage. Some providers, like Anthropic, will have a negative discount on cache writes, but a positive discount (which reduces total cost) on cache reads. ### Usage object fields The usage object in API responses includes detailed cache metrics in the `prompt_tokens_details` field: ```json { "usage": { "prompt_tokens": 10339, "completion_tokens": 60, "total_tokens": 10399, "prompt_tokens_details": { "cached_tokens": 10318, "cache_write_tokens": 0 } } } ``` The key fields are: * `cached_tokens`: Number of tokens read from the cache (cache hit). When this is greater than zero, you're benefiting from cached content. * `cache_write_tokens`: Number of tokens written to the cache. This appears on the first request when establishing a new cache entry. ## OpenAI Caching price changes: * **Cache writes**: no cost * **Cache reads**: (depending on the model) charged at 0.25x or 0.50x the price of the original input pricing [Click here to view OpenAI's cache pricing per model.](https://platform.openai.com/docs/pricing) Prompt caching with OpenAI is automated and does not require any additional configuration. There is a minimum prompt size of 1024 tokens. [Click here to read more about OpenAI prompt caching and its limitation.](https://platform.openai.com/docs/guides/prompt-caching) ## Grok Caching price changes: * **Cache writes**: no cost * **Cache reads**: charged at {GROK_CACHE_READ_MULTIPLIER}x the price of the original input pricing [Click here to view Grok's cache pricing per model.](https://docs.x.ai/docs/models#models-and-pricing) Prompt caching with Grok is automated and does not require any additional configuration. ## Moonshot AI Caching price changes: * **Cache writes**: no cost * **Cache reads**: charged at {MOONSHOT_CACHE_READ_MULTIPLIER}x the price of the original input pricing Prompt caching with Moonshot AI is automated and does not require any additional configuration. ## Groq Caching price changes: * **Cache writes**: no cost * **Cache reads**: charged at {GROQ_CACHE_READ_MULTIPLIER}x the price of the original input pricing Prompt caching with Groq is automated and does not require any additional configuration. Currently available on Kimi K2 models. [Click here to view Groq's documentation.](https://console.groq.com/docs/prompt-caching) ## Alibaba Qwen Caching price changes for explicit caching: * **Cache writes**: charged at {ALIBABA_CACHE_WRITE_MULTIPLIER}x the price of the original input pricing * **Cache reads**: charged at {ALIBABA_CACHE_READ_MULTIPLIER}x the price of the original input pricing Alibaba prompt caching requires explicit cache breakpoints. Add `cache_control: { "type": "ephemeral" }` to content blocks you want to cache, using the same syntax as Anthropic explicit caching. Cache writes use a 5-minute TTL. Alibaba explicit caching is available on `deepseek/deepseek-v3.2`, `qwen/qwen3-max`, `qwen/qwen-plus`, `qwen/qwen3.6-plus`, `qwen/qwen3-coder-plus`, and `qwen/qwen3-coder-flash`. Snapshot endpoints, including `qwen/qwen3.5-plus-02-15` and `qwen/qwen3.5-flash-02-23`, do not support explicit caching. ### Example ```json { "model": "qwen/qwen3-coder-plus", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Use the reference below when answering." }, { "type": "text", "text": "HUGE TEXT BODY", "cache_control": { "type": "ephemeral" } }, { "type": "text", "text": "Summarize the main implementation details." } ] } ] } ``` ## Anthropic Claude Caching price changes: * **Cache writes (5-minute TTL)**: charged at {ANTHROPIC_CACHE_WRITE_MULTIPLIER}x the price of the original input pricing * **Cache writes (1-hour TTL)**: charged at 2x the price of the original input pricing * **Cache reads**: charged at {ANTHROPIC_CACHE_READ_MULTIPLIER}x the price of the original input pricing There are two ways to enable prompt caching with Anthropic: * **Automatic caching**: Add a single `cache_control` field at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and advances it forward as conversations grow. Best for multi-turn conversations. * **Explicit cache breakpoints**: Place `cache_control` directly on individual content blocks for fine-grained control over exactly what gets cached. There is a limit of four explicit breakpoints. It is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc. **Automatic caching** (top-level `cache_control`) is only supported when requests are routed to the **Anthropic** provider directly. Amazon Bedrock and Google Vertex AI currently do not support top-level `cache_control` — when it is present, OpenRouter will only route to the Anthropic provider and exclude Bedrock and Vertex endpoints. Explicit per-block `cache_control` breakpoints work across all Anthropic-compatible providers including Bedrock and Vertex. By default, the cache expires after 5 minutes, but you can extend this to 1 hour by specifying `"ttl": "1h"` in the `cache_control` object. [Click here to read more about Anthropic prompt caching and its limitation.](https://platform.claude.com/docs/en/build-with-claude/prompt-caching) ### Supported models The following Claude models support prompt caching (both automatic and explicit): * Claude Opus 4.7 * Claude Opus 4.6 * Claude Opus 4.5 * Claude Opus 4.1 * Claude Opus 4 * Claude Sonnet 4.6 * Claude Sonnet 4.5 * Claude Sonnet 4 * Claude Sonnet 3.7 (deprecated) * Claude Haiku 4.5 * Claude Haiku 3.5 ### Minimum token requirements Each model has a minimum cacheable prompt length: * **4096 tokens**: Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.5, Claude Haiku 4.5 * **2048 tokens**: Claude Sonnet 4.6, Claude Haiku 3.5 * **1024 tokens**: Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7 Prompts shorter than these minimums will not be cached. ### Cache TTL Options OpenRouter supports two cache TTL values for Anthropic: * **5 minutes** (default): `"cache_control": { "type": "ephemeral" }` * **1 hour**: `"cache_control": { "type": "ephemeral", "ttl": "1h" }` The 1-hour TTL is useful for longer sessions where you want to maintain cached content across multiple requests without incurring repeated cache write costs. The 1-hour TTL costs more for cache writes (2x base input price vs 1.25x for 5-minute TTL) but can save money over extended sessions by avoiding repeated cache writes. The 1-hour TTL for explicit cache breakpoints is supported across all Claude model providers (Anthropic, Amazon Bedrock, and Google Vertex AI). ### Examples #### Automatic caching (recommended for multi-turn conversations) With automatic caching, add `cache_control` at the top level of the request. The system automatically caches all content up to the last cacheable block: ```json { "model": "anthropic/claude-sonnet-4.6", "cache_control": { "type": "ephemeral" }, "messages": [ { "role": "system", "content": "You are a historian studying the fall of the Roman Empire. You know the following book very well: HUGE TEXT BODY" }, { "role": "user", "content": "What triggered the collapse?" } ] } ``` As the conversation grows, the cache breakpoint automatically advances to cover the growing message history. Automatic caching with 1-hour TTL: ```json { "model": "anthropic/claude-sonnet-4.6", "cache_control": { "type": "ephemeral", "ttl": "1h" }, "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the meaning of life?" } ] } ``` #### Explicit cache breakpoints (fine-grained control) System message caching example (default 5-minute TTL): ```json { "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are a historian studying the fall of the Roman Empire. You know the following book very well:" }, { "type": "text", "text": "HUGE TEXT BODY", "cache_control": { "type": "ephemeral" } } ] }, { "role": "user", "content": [ { "type": "text", "text": "What triggered the collapse?" } ] } ] } ``` User message caching example with 1-hour TTL: ```json { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Given the book below:" }, { "type": "text", "text": "HUGE TEXT BODY", "cache_control": { "type": "ephemeral", "ttl": "1h" } }, { "type": "text", "text": "Name all the characters in the above book" } ] } ] } ``` ## DeepSeek Caching price changes: * **Cache writes**: charged at the same price as the original input pricing * **Cache reads**: charged at {DEEPSEEK_CACHE_READ_MULTIPLIER}x the price of the original input pricing Prompt caching with DeepSeek is automated and does not require any additional configuration. ## Google Gemini ### Implicit Caching Gemini 2.5 Pro and 2.5 Flash models now support **implicit caching**, providing automatic caching functionality similar to OpenAI’s automatic caching. Implicit caching works seamlessly — no manual setup or additional `cache_control` breakpoints required. Pricing Changes: * No cache write or storage costs. * Cached tokens are charged at {GOOGLE_CACHE_READ_MULTIPLIER}x the original input token cost. Note that the TTL is on average 3-5 minutes, but will vary. There is a minimum of {GOOGLE_CACHE_MIN_TOKENS_2_5_FLASH} tokens for Gemini 2.5 Flash, and {GOOGLE_CACHE_MIN_TOKENS_2_5_PRO} tokens for Gemini 2.5 Pro for requests to be eligible for caching. [Official announcement from Google](https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/) To maximize implicit cache hits, keep the initial portion of your message arrays consistent between requests. Push variations (such as user questions or dynamic context elements) toward the end of your prompt/requests. ### Pricing Changes for Cached Requests: * **Cache Writes:** Charged at the input token cost plus 5 minutes of cache storage, calculated as follows: ``` Cache write cost = Input token price + (Cache storage price × (5 minutes / 60 minutes)) ``` * **Cache Reads:** Charged at {GOOGLE_CACHE_READ_MULTIPLIER}× the original input token cost. ### Supported Models and Limitations: Only certain Gemini models support caching. Please consult Google's [Gemini API Pricing Documentation](https://ai.google.dev/gemini-api/docs/pricing) for the most current details. Cache Writes have a 5 minute Time-to-Live (TTL) that does not update. After 5 minutes, the cache expires and a new cache must be written. Gemini models have typically have a 4096 token minimum for cache write to occur. Cached tokens count towards the model's maximum token usage. Gemini 2.5 Pro has a minimum of {GOOGLE_CACHE_MIN_TOKENS_2_5_PRO} tokens, and Gemini 2.5 Flash has a minimum of {GOOGLE_CACHE_MIN_TOKENS_2_5_FLASH} tokens. ### How Gemini Prompt Caching works on OpenRouter: OpenRouter simplifies Gemini cache management, abstracting away complexities: * You **do not** need to manually create, update, or delete caches. * You **do not** need to manage cache names or TTL explicitly. ### How to Enable Gemini Prompt Caching: Gemini caching in OpenRouter requires you to insert `cache_control` breakpoints explicitly within message content, similar to Anthropic. We recommend using caching primarily for large content pieces (such as CSV files, lengthy character cards, retrieval augmented generation (RAG) data, or extensive textual sources). There is not a limit on the number of `cache_control` breakpoints you can include in your request. OpenRouter will use only the last breakpoint for Gemini caching across normal message content. Including multiple breakpoints is safe and can help maintain compatibility with Anthropic, but only the final one will be used for Gemini. Gemini has a single `systemInstruction` field, and cached Gemini content treats that `systemInstruction` as immutable. On OpenRouter, this means `cache_control` inside the first `system` or `developer` message can cache the normalized system prompt, but it cannot preserve an uncached dynamic tail inside that same message. If you need part of your prompt to stay dynamic, move that dynamic content into a later `user` message instead of appending it after a cached block in the first `system` message. ### Examples: #### System Message Caching Example ```json { "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are a historian studying the fall of the Roman Empire. Below is an extensive reference book:" }, { "type": "text", "text": "HUGE TEXT BODY HERE", "cache_control": { "type": "ephemeral" } } ] }, { "role": "user", "content": [ { "type": "text", "text": "What triggered the collapse?" } ] } ] } ``` This pattern works when the cached system content is stable across requests. If you need a dynamic prompt segment, place it in a later `user` message rather than as uncached trailing content in the first `system` message. #### User Message Caching Example ```json { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Based on the book text below:" }, { "type": "text", "text": "HUGE TEXT BODY HERE", "cache_control": { "type": "ephemeral" } }, { "type": "text", "text": "List all main characters mentioned in the text above." } ] } ] } ``` # Uptime Optimization OpenRouter continuously monitors the health and availability of AI providers to ensure maximum uptime for your applications. We track response times, error rates, and availability across all providers in real-time, and route based on this feedback. ## How It Works OpenRouter tracks response times, error rates, and availability across all providers in real-time. This data helps us make intelligent routing decisions and provides transparency about service reliability. ## Uptime Example: Claude 4 Sonnet ## Uptime Example: Llama 3.3 70B Instruct ## Customizing Provider Selection While our smart routing helps maintain high availability, you can also customize provider selection using request parameters. This gives you control over which providers handle your requests while still benefiting from automatic fallback when needed. Learn more about customizing provider selection in our [Provider Routing documentation](/docs/guides/routing/provider-selection). # Reasoning Tokens For models that support it, the OpenRouter API can return **Reasoning Tokens**, also known as thinking tokens. OpenRouter normalizes the different ways of customizing the amount of reasoning tokens that the model will use, providing a unified interface across different providers. Reasoning tokens provide a transparent look into the reasoning steps taken by a model. Reasoning tokens are considered output tokens and charged accordingly. Reasoning tokens are included in the response by default if the model decides to output them. Reasoning tokens will appear in the `reasoning` field of each message, unless you decide to exclude them. While most models and providers make reasoning tokens available in the response, some (like the OpenAI o-series) do not. ## Controlling Reasoning Tokens You can control reasoning tokens in your requests using the `reasoning` parameter: ```json { "model": "your-model", "messages": [], "reasoning": { // One of the following (not both): "effort": "high", // Can be "xhigh", "high", "medium", "low", "minimal" or "none" (OpenAI-style) "max_tokens": 2000, // Specific token limit (Anthropic-style) // Optional: Default is false. All models support this. "exclude": false, // Set to true to exclude reasoning tokens from response // Or enable reasoning with the default parameters: "enabled": true // Default: inferred from `effort` or `max_tokens` } } ``` The `reasoning` config object consolidates settings for controlling reasoning strength across different models. See the Note for each option below to see which models are supported and how other models will behave. ### Max Tokens for Reasoning Currently supported by:
  • Gemini thinking models
  • Anthropic reasoning models (by using the reasoning.max\_tokens{' '} parameter)
  • Some Alibaba Qwen thinking models (mapped to thinking_budget )
For Alibaba, support varies by model — please check the individual model descriptions to confirm whether reasoning.max\_tokens (via thinking\_budget) is available.
For models that support reasoning token allocation, you can control it like this: * `"max_tokens": 2000` - Directly specifies the maximum number of tokens to use for reasoning For models that only support `reasoning.effort` (see below), the `max_tokens` value will be used to determine the effort level. ### Reasoning Effort Level Currently supported by OpenAI reasoning models (o1 series, o3 series, GPT-5 series) and Grok models * `"effort": "xhigh"` - Allocates the largest portion of tokens for reasoning (approximately 95% of max\_tokens) * `"effort": "high"` - Allocates a large portion of tokens for reasoning (approximately 80% of max\_tokens) * `"effort": "medium"` - Allocates a moderate portion of tokens (approximately 50% of max\_tokens) * `"effort": "low"` - Allocates a smaller portion of tokens (approximately 20% of max\_tokens) * `"effort": "minimal"` - Allocates an even smaller portion of tokens (approximately 10% of max\_tokens) * `"effort": "none"` - Disables reasoning entirely For models that only support `reasoning.max_tokens`, the effort level will be set based on the percentages above. ### Excluding Reasoning Tokens If you want the model to use reasoning internally but not include it in the response: * `"exclude": true` - The model will still use reasoning, but it won't be returned in the response Reasoning tokens will appear in the `reasoning` field of each message. ### Enable Reasoning with Default Config To enable reasoning with the default parameters: * `"enabled": true` - Enables reasoning at the "medium" effort level with no exclusions. ### Examples #### Basic Usage with Reasoning Tokens #### Using Max Tokens for Reasoning For models that support direct token allocation (like Anthropic models), you can specify the exact number of tokens to use for reasoning: #### Excluding Reasoning Tokens from Response If you want the model to use reasoning internally but not include it in the response: #### Advanced Usage: Reasoning Chain-of-Thought This example shows how to use reasoning tokens in a more complex workflow. It injects one model's reasoning into another model to improve its response quality: ## Preserving Reasoning To preserve reasoning context across multiple turns, you can pass it back to the API in one of two ways: 1. **`message.reasoning`** (string): Pass the plaintext reasoning as a string field on the assistant message 2. **`message.reasoning_details`** (array): Pass the full reasoning\_details block Use `reasoning_details` when working with models that return special reasoning types (such as encrypted or summarized) - this preserves the full structure needed for those models. For models that only return raw reasoning strings, you can use the simpler `reasoning` field. You can also use `reasoning_content` as an alias - it functions identically to `reasoning`. Preserving reasoning is currently supported by these proprietary models:
  • All OpenAI reasoning models (o1 series, o3 series, GPT-5 series and newer)
  • All Anthropic reasoning models (Claude 3.7 series and newer)
  • All Gemini Reasoning models
  • All xAI reasoning models
And these open source models:
  • Alibaba: Qwen3.5 and newer
  • MiniMax: MiniMax M2 and newer
  • MoonShot: Kimi K2 Thinking and newer
  • NVIDIA: Nemotron 3 Nano and newer
  • Prime Intellect: INTELLECT-3
  • Xiaomi: MiMo-V2-Flash and newer
  • Z.ai: GLM 4.5 and newer
Note: standard interleaved thinking only. The preserved thinking feature for Z.ai models is currently not supported.
The `reasoning_details` functionality works identically across all supported reasoning models. You can easily switch between OpenAI reasoning models (like `openai/gpt-5.2`) and Anthropic reasoning models (like `anthropic/claude-sonnet-4.5`) without changing your code structure. Preserving reasoning blocks is useful specifically for tool calling. When models like Claude invoke tools, it is pausing its construction of a response to await external information. When tool results are returned, the model will continue building that existing response. This necessitates preserving reasoning blocks during tool use, for a couple of reasons: **Reasoning continuity**: The reasoning blocks capture the model's step-by-step reasoning that led to tool requests. When you post tool results, including the original reasoning ensures the model can continue its reasoning from where it left off. **Context maintenance**: While tool results appear as user messages in the API structure, they're part of a continuous reasoning flow. Preserving reasoning blocks maintains this conceptual flow across multiple API calls. When providing reasoning\_details blocks, the entire sequence of consecutive reasoning blocks must match the outputs generated by the model during the original request; you cannot rearrange or modify the sequence of these blocks. ### Example: Preserving Reasoning Blocks with OpenRouter and Claude For more detailed information about thinking encryption, redacted blocks, and advanced use cases, see [Anthropic's documentation on extended thinking](https://docs.anthropic.com/en/docs/build-with-claude/tool-use#extended-thinking). For more information about OpenAI reasoning models, see [OpenAI's reasoning documentation](https://platform.openai.com/docs/guides/reasoning#keeping-reasoning-items-in-context). ## Reasoning Details API Shape When reasoning models generate responses, the reasoning information is structured in a standardized format through the `reasoning_details` array. This section documents the API response structure for reasoning details in both streaming and non-streaming responses. ### reasoning\_details Array Structure The `reasoning_details` field contains an array of reasoning detail objects. Each object in the array represents a specific piece of reasoning information and follows one of three possible types. The location of this array differs between streaming and non-streaming responses. * **Non-streaming responses**: `reasoning_details` appears in `choices[].message.reasoning_details` * **Streaming responses**: `reasoning_details` appears in `choices[].delta.reasoning_details` for each chunk #### Common Fields All reasoning detail objects share these common fields: * `id` (string | null): Unique identifier for the reasoning detail * `format` (string): The format of the reasoning detail, with possible values: * `"unknown"` - Format is not specified * `"openai-responses-v1"` - OpenAI responses format version 1 * `"azure-openai-responses-v1"` - Azure OpenAI responses format version 1 * `"xai-responses-v1"` - xAI responses format version 1 * `"anthropic-claude-v1"` - Anthropic Claude format version 1 (default) * `"google-gemini-v1"` - Google Gemini format version 1 * `index` (number, optional): Sequential index of the reasoning detail #### Reasoning Detail Types **1. Summary Type (`reasoning.summary`)** Contains a high-level summary of the reasoning process: ```json { "type": "reasoning.summary", "summary": "The model analyzed the problem by first identifying key constraints, then evaluating possible solutions...", "id": "reasoning-summary-1", "format": "anthropic-claude-v1", "index": 0 } ``` **2. Encrypted Type (`reasoning.encrypted`)** Contains encrypted reasoning data that may be redacted or protected: ```json { "type": "reasoning.encrypted", "data": "eyJlbmNyeXB0ZWQiOiJ0cnVlIiwiY29udGVudCI6IltSRURBQ1RFRF0ifQ==", "id": "reasoning-encrypted-1", "format": "anthropic-claude-v1", "index": 1 } ``` **3. Text Type (`reasoning.text`)** Contains raw text reasoning with optional signature verification: ```json { "type": "reasoning.text", "text": "Let me think through this step by step:\n1. First, I need to understand the user's question...", "signature": "sha256:abc123def456...", "id": "reasoning-text-1", "format": "anthropic-claude-v1", "index": 2 } ``` ### Response Examples #### Non-Streaming Response In non-streaming responses, `reasoning_details` appears in the message: ```json { "choices": [ { "message": { "role": "assistant", "content": "Based on my analysis, I recommend the following approach...", "reasoning_details": [ { "type": "reasoning.summary", "summary": "Analyzed the problem by breaking it into components", "id": "reasoning-summary-1", "format": "anthropic-claude-v1", "index": 0 }, { "type": "reasoning.text", "text": "Let me work through this systematically:\n1. First consideration...\n2. Second consideration...", "signature": null, "id": "reasoning-text-1", "format": "anthropic-claude-v1", "index": 1 } ] } } ] } ``` #### Streaming Response In streaming responses, `reasoning_details` appears in delta chunks as the reasoning is generated: ```json { "choices": [ { "delta": { "reasoning_details": [ { "type": "reasoning.text", "text": "Let me think about this step by step...", "signature": null, "id": "reasoning-text-1", "format": "anthropic-claude-v1", "index": 0 } ] } } ] } ``` **Streaming Behavior Notes:** * Each reasoning detail chunk is sent as it becomes available * The `reasoning_details` array in each chunk may contain one or more reasoning objects * For encrypted reasoning, the content may appear as `[REDACTED]` in streaming responses * The complete reasoning sequence is built by concatenating all chunks in order ## Legacy Parameters For backward compatibility, OpenRouter still supports the following legacy parameters: * `include_reasoning: true` - Equivalent to `reasoning: {}` * `include_reasoning: false` - Equivalent to `reasoning: { exclude: true }` However, we recommend using the new unified `reasoning` parameter for better control and future compatibility. ## Provider-Specific Reasoning Implementation ### Anthropic Models with Reasoning Tokens The latest Claude models, such as [anthropic/claude-3.7-sonnet](https://openrouter.ai/anthropic/claude-3.7-sonnet), support working with and returning reasoning tokens. You can enable reasoning on Anthropic models **only** using the unified `reasoning` parameter with either `effort` or `max_tokens`. **Note:** The `:thinking` variant is no longer supported for Anthropic models. Use the `reasoning` parameter instead. #### Reasoning Max Tokens for Anthropic Models When using Anthropic models with reasoning: * When using the `reasoning.max_tokens` parameter, that value is used directly with a minimum of 1024 tokens. * When using the `reasoning.effort` parameter, the budget\_tokens are calculated based on the `max_tokens` value. The reasoning token allocation is capped at 128,000 tokens maximum and 1024 tokens minimum. The formula for calculating the budget\_tokens is: `budget_tokens = max(min(max_tokens * {effort_ratio}, 128000), 1024)` effort\_ratio is 0.95 for xhigh effort, 0.8 for high effort, 0.5 for medium effort, 0.2 for low effort, and 0.1 for minimal effort. **Important**: `max_tokens` must be strictly higher than the reasoning budget to ensure there are tokens available for the final response after thinking. Please note that reasoning tokens are counted as output tokens for billing purposes. Using reasoning tokens will increase your token usage but can significantly improve the quality of model responses. #### Example: Streaming with Anthropic Reasoning Tokens ### Google Gemini 3 Models with Thinking Levels Gemini 3 models (such as [google/gemini-3.1-pro-preview](https://openrouter.ai/google/gemini-3.1-pro-preview) and [google/gemini-3-flash-preview](https://openrouter.ai/google/gemini-3-flash-preview)) use Google's `thinkingLevel` API instead of the older `thinkingBudget` API used by Gemini 2.5 models. OpenRouter maps the `reasoning.effort` parameter directly to Google's `thinkingLevel` values: | OpenRouter `reasoning.effort` | Google `thinkingLevel` | | ----------------------------- | ---------------------- | | `"minimal"` | `"minimal"` | | `"low"` | `"low"` | | `"medium"` | `"medium"` | | `"high"` | `"high"` | | `"xhigh"` | `"high"` (mapped down) | When using `thinkingLevel`, the actual number of reasoning tokens consumed is determined internally by Google. There are no publicly documented token limit breakpoints for each level. For example, setting `effort: "low"` might result in several hundred reasoning tokens depending on the complexity of the task. This is expected behavior and reflects how Google implements thinking levels internally. If a model doesn't support a specific effort level (for example, if a model only supports `low` and `high`), OpenRouter will map your requested effort to the nearest supported level. #### Using max\_tokens with Gemini 3 If you specify `reasoning.max_tokens` explicitly, OpenRouter will pass it through as `thinkingBudget` to Google's API. However, for Gemini 3 models, Google internally maps this budget value to a `thinkingLevel`, so you will not get precise token control. The actual token consumption is still determined by Google's thinkingLevel implementation, not by the specific budget value you provide. #### Example: Using Thinking Levels with Gemini 3