Gemini 2.5 Flash API - Pricing, Quickstart & Provider Comparison
OpenRouter ·
What Is Gemini 2.5 Flash?
Gemini 2.5 Flash is Google’s primary model for high-volume, latency-sensitive tasks that require reasoning. It’s the first Flash-class model with built-in thinking, a hybrid reasoning mode you can toggle on or off at will. That distinction makes it meaningfully different from 2.0 Flash and worth evaluating against models that cost significantly more.
Key Capabilities
Gemini 2.5 Flash supports the following input types: text, code, images, audio, video, and documents. For document inputs, two constraints apply in production: maximum file size is 50MB per document (files exceeding this must be split into sub-50MB chunks before submission). Supported document MIME types are limited to application/pdf and text/plain only.
What it does not support: audio generation, image generation, and the Live API. If you need image generation, use Gemini 2.5 Flash Image, which is a separate model.
What “Thinking” Means in Practice
The thinking budget is a parameter that controls how much internal reasoning the model performs before generating a response. This is built into the model’s architecture during inference. Setting the budget to 0 disables it entirely, producing the fastest and cheapest output. Setting it to -1 enables dynamic mode, where the model adjusts reasoning depth based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (see Configuring via OpenRouter below). Higher fixed budgets increase output quality on complex tasks at the cost of additional latency and token spend, billed at the output rate.
Gemini 2.5 Flash API Pricing
The table below shows verified per-million-token rates across the three access methods. All pricing data sourced from ai.google.dev/gemini-api/docs/pricing and openrouter.ai/google/gemini-2.5-flash. Verify OpenRouter and Vertex AI numbers against their live pages on the day of writing; rates update without notice.
Verification date: May 2026
| Provider | Input $/1M | Output $/1M (incl. thinking) | Cache Read | Cache Storage | Audio Input |
|---|---|---|---|---|---|
| Google AI Studio (paid) | $0.30 | $2.50 | $0.03 | $1.00/M/hr | $1.00 |
| Vertex AI | See Vertex AI pricing | See Vertex AI pricing | See Vertex AI pricing | See Vertex AI pricing | See Vertex AI pricing |
| OpenRouter | $0.30 | $2.50 | $0.03 | Verify on live page | $1.00 |
Google AI Studio’s paid tier and OpenRouter carry the same per-token rates for text input and output as of May 2026. Same price per token. What’s wrapped around the API call is where they split.
OpenRouter sits between your code and 3 Google providers (AI Studio, Vertex Global, Vertex). If one goes down, your requests reroute to a healthy one. No code changes.
Your integration isn’t welded to Gemini. Change the model string and you’re calling Claude, GPT-4o, Llama, or any of 300+ models. Same base URL, same SDK, same API key. Swap models in seconds without rewriting your client.
Billing collapses into one dashboard: one invoice, one API key, across every model and provider. No juggling separate accounts with Google, Anthropic, and OpenAI.
For teams shipping to production, OpenRouter layers on enterprise controls (provisioning, per-key spend limits, usage analytics, team management). Guardrails and content filtering are configurable per request, so you can enforce safety policies without building your own moderation stack. Prompt logging and observability come baked into the dashboard for debugging production traffic.
OpenRouter charges a 5.5% platform fee on pay-as-you-go (PAYG) credit purchases. That covers the failover, routing, billing, and tooling above. Google AI Studio is the direct path with no intermediary fee, but you’re on your own for failover, model portability, and cross-provider billing. Vertex AI pricing differs; check the Vertex AI pricing page for current rates before plugging them into production cost estimates.
For real-time Gemini 2.5 Flash pricing and uptime across providers, including live cache rates and effective pricing by provider, see the OpenRouter model page. For caching strategies that reduce repeated context costs, see cache pricing details.
Thinking Token Billing
Thinking tokens are billed at the same rate as output tokens. At budget 0, there is no thinking cost. At the maximum budget (24,576 tokens), thinking overhead can exceed the cost of the visible response itself. To estimate the cost for a given workload, multiply your expected thinking tokens by the output rate and add them to your standard output token cost.
Free Access Options
Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service for the full data usage policy. If your use case involves user data or requires data not to be used for model training, you must use the paid tier.
OpenRouter does not include Gemini 2.5 Flash in its free tier. A minimum $5 credit balance is required.
Vertex AI provides $300 in trial credits for new Google Cloud accounts, which can be applied toward Gemini 2.5 Flash usage during the evaluation.
API Quickstart: First Request in Under 5 Minutes
The OpenRouter path requires no Google Cloud account and works with any OpenAI-compatible SDK. The Google direct path requires a Google account and the google-genai SDK. For additional SDK examples and configuration options, see the OpenRouter quickstart.
Step 1: Get Your API Key
OpenRouter path: get your OpenRouter API key. No Google Cloud account required.
Google direct path: Get a key at aistudio.google.com/apikey.
Step 2: Set the Base URL (OpenRouter Path)
The OpenRouter base URL is https://openrouter.ai/api/v1. All three code examples below use this endpoint.
Step 3: Make Your First Request
cURL:
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer <your-openrouter-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-flash",
"messages": [{"role": "user", "content": "Explain the difference between attention mechanisms in transformers."}]
}'
Python (OpenAI SDK):
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<your-openrouter-key>",
)
response = client.chat.completions.create(
model="google/gemini-2.5-flash",
messages=[{"role": "user", "content": "Explain the difference between attention mechanisms in transformers."}]
)
print(response.choices[0].message.content)
TypeScript (OpenAI SDK):
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://openrouter.ai/api/v1",
apiKey: "<your-openrouter-key>",
});
const response = await client.chat.completions.create({
model: "google/gemini-2.5-flash",
messages: [{ role: "user", content: "Explain the difference between attention mechanisms in transformers." }],
});
console.log(response.choices[0].message.content);
Google Direct Path
If you already have a Google AI Studio API key and prefer the direct path with no intermediary:
from google import genai
client = genai.Client(api_key="<your-google-api-key>")
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain the difference between attention mechanisms in transformers.",
)
print(response.text)
The direct path uses the google-genai SDK, which is not OpenAI-compatible. Switching from OpenRouter to the direct path requires changing both your client library and request structure. There is no provider failover on the direct path.
Thinking Budget: Control Reasoning Quality and Cost
The thinking budget is the most important configuration decision you’ll make with this model. Set it wrong and you either overpay for reasoning you don’t need or leave accuracy on the table for tasks that require it. For the full parameter reference, see configure the thinking budget.
Budget Levels and Trade-offs
Set the thinkingBudget parameter in your request config. The range is 0 to 24,576 tokens.
Budget 0: Thinking disabled. Fastest response, lowest cost, no reasoning overhead. Use for high-volume classification, extraction, and summarization where structured reasoning is unnecessary.
Budget -1 (dynamic): The model auto-selects its reasoning depth based on prompt complexity. This is the default on Google’s direct API. Via OpenRouter, you must explicitly set max_tokens to -1 to get dynamic mode; omitting the reasoning config disables thinking. Recommended for most workloads that need reasoning; it avoids paying for heavy reasoning on simple prompts while engaging it when the task requires it.
Budget 1,024 to 8,192: Moderate to heavy reasoning. Use for multi-step analysis, structured coding tasks, and research-style questions.
Budget 24,576 (maximum): Maximum reasoning depth, maximum cost. Use for complex math, scientific problems, and hard-coding challenges where accuracy justifies the overhead.
Critical Constraints
Two constraints will produce errors in production if you aren’t aware of them before writing your first request:
-
thinkingBudgetandthinkingLevelcannot be used in the same request.thinkingBudgetis for Gemini 2.5 series models.thinkingLevelis for Gemini 3 series models. Using both returns a 400 error. -
Structured JSON output and Search Grounding are mutually exclusive. You cannot enable both in the same request.
Configuring via OpenRouter
Use the extra_body parameter with the reasoning key to set the thinking budget through OpenRouter’s API:
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<your-openrouter-key>",
)
response = client.chat.completions.create(
model="google/gemini-2.5-flash",
messages=[{"role": "user", "content": "Solve this step by step: if f(x) = 3x^2 + 2x - 5, find all roots."}],
extra_body={"reasoning": {"max_tokens": 8192}}
)
print(response.choices[0].message.content)
To disable thinking entirely, set max_tokens to 0. To use dynamic mode, set max_tokens to -1.
Cross-Provider Performance
OpenRouter routes Gemini 2.5 Flash through three Google providers and tracks real-time throughput, Time to First Token (TTFT), end-to-end latency, and uptime for each. The differences between providers are significant enough to affect the choice of provider for latency-sensitive workloads.
All numbers below require live verification against openrouter.ai/google/gemini-2.5-flash.
Performance by Provider
Source: OpenRouter live model page.
| Provider | Avg Throughput | Avg TTFT | Avg E2E Latency | Uptime |
|---|---|---|---|---|
| Google Vertex (Global) | ~75 tok/s | ~0.63s | Verify on live page | Verify on live page |
| Google AI Studio | Verify on live page | Verify on live page | Verify on live page | Verify on live page |
| Google Vertex | Verify on live page | Verify on live page | Verify on live page | Verify on live page |
The Vertex Global provider shows the highest throughput in recent data. AI Studio historically shows the best uptime. Standard Vertex shows the highest latency of the three. When you route through OpenRouter without specifying a provider, it automatically distributes traffic to the healthiest option based on real-time signals.
For real-time Gemini 2.5 Flash pricing and uptime, see the OpenRouter model page.
Gemini 2.5 Flash vs Flash Lite vs Pro
Choose based on your workload requirements:
Use Gemini 2.5 Flash for most agentic and reasoning workloads. It’s the default recommendation when you need thinking capability without incurring Pro-level costs.
Use Gemini 2.5 Flash Lite for high-volume classification, extraction, or translation tasks where thinking isn’t required and cost per request is the primary constraint. Thinking is disabled by default on Flash Lite.
Use Gemini 2.5 Pro for complex reasoning tasks where accuracy justifies a 5 to 10x cost premium over Flash: frontier mathematics, hard-coding challenges, and multi-step scientific analysis.
Technical Specifications
The table below is the canonical reference for Gemini 2.5 Flash. For the authoritative version, see the Google AI for Developers model page (updated 2026-04-01) and the Vertex AI docs (updated 2026-04-03).
| Property | Value |
|---|---|
| Model ID | gemini-2.5-flash |
| OpenRouter model string | google/gemini-2.5-flash |
| Context window | 1,048,576 tokens |
| Max output | 65,536 tokens |
| Input types | Text, images, video, audio, code, documents (PDF and text/plain only, 50MB max) |
| Output types | Text |
| Thinking budget range | 0 to 24,576 tokens (default: dynamic / -1) |
| Knowledge cutoff | January 2025 |
| GA release | June 17, 2025 |
| Discontinuation | October 16, 2026 |
| Supported capabilities | Function calling, structured outputs, code execution, Search Grounding, Batch API, context caching (implicit and explicit), file search, URL context |
| Not supported | Audio generation, image generation, Live API, thinkingLevel parameter |
Deprecation notice: Gemini 2.5 Flash is scheduled for discontinuation on October 16, 2026, on Vertex AI. If you’re building for production use cases that extend beyond that date, plan a migration to a successor model and monitor ai.google.dev/gemini-api/docs/models for updates.
Frequently Asked Questions
Is Gemini 2.5 Flash free to use?
Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service before using it with user data. OpenRouter does not include Gemini 2.5 Flash in its free tier; a minimum $5 credit balance is required. Vertex AI provides $300 in trial credits for new Google Cloud accounts.
What is the thinking budget in Gemini 2.5 Flash?
The thinkingBudget parameter (range: 0 to 24,576 tokens, or -1 for dynamic) controls how much internal reasoning the model performs before responding. Budget 0 disables thinking: fastest and cheapest. Budget -1 enables dynamic mode: the model auto-adjusts based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (e.g. extra_body={"reasoning": {"max_tokens": -1}} for dynamic, or any positive budget). Higher fixed budgets improve output quality on complex tasks but increase latency and cost, billed at the output token rate.
How does Gemini 2.5 Flash compare to GPT-4o?
Flash supports a 1M-token context window, versus 128K for GPT-4o, and includes configurable thinking not available in GPT-4o. Flash’s per-token pricing is lower. GPT-4o has broader third-party ecosystem support and a longer production track record. Direct benchmark comparisons on the same evaluations aren’t published across both models in this guide; use the OpenRouter rankings for current third-party evaluation data.
Can I use Gemini 2.5 Flash for image generation?
No. Gemini 2.5 Flash outputs text only. Image input is supported; the model can process and reason about images. For image generation, use Gemini 2.5 Flash Image, a separate model with its own pricing.
What providers serve Gemini 2.5 Flash on OpenRouter?
Three: Google AI Studio, Google Vertex Global, and Google Vertex. OpenRouter routes to the healthiest provider automatically based on real-time throughput and uptime data. You can pin to a specific provider using OpenRouter’s provider routing controls.
What is the difference between Gemini 2.5 Flash and Flash Lite?
Flash includes configurable thinking (budget 0 to 24,576) and higher-quality output. Flash Lite is optimized for ultra-low latency and cost, with thinking disabled by default (though it can be enabled). Use Flash when reasoning capability matters; use Lite for high-volume tasks where cost per request is the primary constraint.