Gemini 2.5 Flash API - Pricing, Quickstart & Provider Comparison

OpenRouter ·

Gemini 2.5 Flash API - Pricing, Quickstart & Provider Comparison

What Is Gemini 2.5 Flash?

Gemini 2.5 Flash is Google’s primary model for high-volume, latency-sensitive tasks that require reasoning. It’s the first Flash-class model with built-in thinking, a hybrid reasoning mode you can toggle on or off at will. That distinction makes it meaningfully different from 2.0 Flash and worth evaluating against models that cost significantly more.

Key Capabilities

Gemini 2.5 Flash supports the following input types: text, code, images, audio, video, and documents. For document inputs, two constraints apply in production: maximum file size is 50MB per document (files exceeding this must be split into sub-50MB chunks before submission). Supported document MIME types are limited to application/pdf and text/plain only.

What it does not support: audio generation, image generation, and the Live API. If you need image generation, use Gemini 2.5 Flash Image, which is a separate model.

What “Thinking” Means in Practice

The thinking budget is a parameter that controls how much internal reasoning the model performs before generating a response. This is built into the model’s architecture during inference. Setting the budget to 0 disables it entirely, producing the fastest and cheapest output. Setting it to -1 enables dynamic mode, where the model adjusts reasoning depth based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (see Configuring via OpenRouter below). Higher fixed budgets increase output quality on complex tasks at the cost of additional latency and token spend, billed at the output rate.

Gemini 2.5 Flash API Pricing

The table below shows verified per-million-token rates across the three access methods. All pricing data sourced from ai.google.dev/gemini-api/docs/pricing and openrouter.ai/google/gemini-2.5-flash. Verify OpenRouter and Vertex AI numbers against their live pages on the day of writing; rates update without notice.

Verification date: May 2026

ProviderInput $/1MOutput $/1M (incl. thinking)Cache ReadCache StorageAudio Input
Google AI Studio (paid)$0.30$2.50$0.03$1.00/M/hr$1.00
Vertex AISee Vertex AI pricingSee Vertex AI pricingSee Vertex AI pricingSee Vertex AI pricingSee Vertex AI pricing
OpenRouter$0.30$2.50$0.03Verify on live page$1.00

Google AI Studio’s paid tier and OpenRouter carry the same per-token rates for text input and output as of May 2026. Same price per token. What’s wrapped around the API call is where they split.

OpenRouter sits between your code and 3 Google providers (AI Studio, Vertex Global, Vertex). If one goes down, your requests reroute to a healthy one. No code changes.

Your integration isn’t welded to Gemini. Change the model string and you’re calling Claude, GPT-4o, Llama, or any of 300+ models. Same base URL, same SDK, same API key. Swap models in seconds without rewriting your client.

Billing collapses into one dashboard: one invoice, one API key, across every model and provider. No juggling separate accounts with Google, Anthropic, and OpenAI.

For teams shipping to production, OpenRouter layers on enterprise controls (provisioning, per-key spend limits, usage analytics, team management). Guardrails and content filtering are configurable per request, so you can enforce safety policies without building your own moderation stack. Prompt logging and observability come baked into the dashboard for debugging production traffic.

OpenRouter charges a 5.5% platform fee on pay-as-you-go (PAYG) credit purchases. That covers the failover, routing, billing, and tooling above. Google AI Studio is the direct path with no intermediary fee, but you’re on your own for failover, model portability, and cross-provider billing. Vertex AI pricing differs; check the Vertex AI pricing page for current rates before plugging them into production cost estimates.

For real-time Gemini 2.5 Flash pricing and uptime across providers, including live cache rates and effective pricing by provider, see the OpenRouter model page. For caching strategies that reduce repeated context costs, see cache pricing details.

Thinking Token Billing

Thinking tokens are billed at the same rate as output tokens. At budget 0, there is no thinking cost. At the maximum budget (24,576 tokens), thinking overhead can exceed the cost of the visible response itself. To estimate the cost for a given workload, multiply your expected thinking tokens by the output rate and add them to your standard output token cost.

Free Access Options

Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service for the full data usage policy. If your use case involves user data or requires data not to be used for model training, you must use the paid tier.

OpenRouter does not include Gemini 2.5 Flash in its free tier. A minimum $5 credit balance is required.

Vertex AI provides $300 in trial credits for new Google Cloud accounts, which can be applied toward Gemini 2.5 Flash usage during the evaluation.

API Quickstart: First Request in Under 5 Minutes

The OpenRouter path requires no Google Cloud account and works with any OpenAI-compatible SDK. The Google direct path requires a Google account and the google-genai SDK. For additional SDK examples and configuration options, see the OpenRouter quickstart.

Step 1: Get Your API Key

OpenRouter path: get your OpenRouter API key. No Google Cloud account required.

Google direct path: Get a key at aistudio.google.com/apikey.

Step 2: Set the Base URL (OpenRouter Path)

The OpenRouter base URL is https://openrouter.ai/api/v1. All three code examples below use this endpoint.

Step 3: Make Your First Request

cURL:

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer <your-openrouter-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-flash",
    "messages": [{"role": "user", "content": "Explain the difference between attention mechanisms in transformers."}]
  }'

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="<your-openrouter-key>",
)

response = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[{"role": "user", "content": "Explain the difference between attention mechanisms in transformers."}]
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK):

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: "<your-openrouter-key>",
});

const response = await client.chat.completions.create({
  model: "google/gemini-2.5-flash",
  messages: [{ role: "user", content: "Explain the difference between attention mechanisms in transformers." }],
});

console.log(response.choices[0].message.content);

Google Direct Path

If you already have a Google AI Studio API key and prefer the direct path with no intermediary:

from google import genai

client = genai.Client(api_key="<your-google-api-key>")

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain the difference between attention mechanisms in transformers.",
)

print(response.text)

The direct path uses the google-genai SDK, which is not OpenAI-compatible. Switching from OpenRouter to the direct path requires changing both your client library and request structure. There is no provider failover on the direct path.

Thinking Budget: Control Reasoning Quality and Cost

The thinking budget is the most important configuration decision you’ll make with this model. Set it wrong and you either overpay for reasoning you don’t need or leave accuracy on the table for tasks that require it. For the full parameter reference, see configure the thinking budget.

Budget Levels and Trade-offs

Set the thinkingBudget parameter in your request config. The range is 0 to 24,576 tokens.

Budget 0: Thinking disabled. Fastest response, lowest cost, no reasoning overhead. Use for high-volume classification, extraction, and summarization where structured reasoning is unnecessary.

Budget -1 (dynamic): The model auto-selects its reasoning depth based on prompt complexity. This is the default on Google’s direct API. Via OpenRouter, you must explicitly set max_tokens to -1 to get dynamic mode; omitting the reasoning config disables thinking. Recommended for most workloads that need reasoning; it avoids paying for heavy reasoning on simple prompts while engaging it when the task requires it.

Budget 1,024 to 8,192: Moderate to heavy reasoning. Use for multi-step analysis, structured coding tasks, and research-style questions.

Budget 24,576 (maximum): Maximum reasoning depth, maximum cost. Use for complex math, scientific problems, and hard-coding challenges where accuracy justifies the overhead.

Critical Constraints

Two constraints will produce errors in production if you aren’t aware of them before writing your first request:

  1. thinkingBudget and thinkingLevel cannot be used in the same request. thinkingBudget is for Gemini 2.5 series models. thinkingLevel is for Gemini 3 series models. Using both returns a 400 error.

  2. Structured JSON output and Search Grounding are mutually exclusive. You cannot enable both in the same request.

Configuring via OpenRouter

Use the extra_body parameter with the reasoning key to set the thinking budget through OpenRouter’s API:

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="<your-openrouter-key>",
)

response = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[{"role": "user", "content": "Solve this step by step: if f(x) = 3x^2 + 2x - 5, find all roots."}],
    extra_body={"reasoning": {"max_tokens": 8192}}
)

print(response.choices[0].message.content)

To disable thinking entirely, set max_tokens to 0. To use dynamic mode, set max_tokens to -1.

Cross-Provider Performance

OpenRouter routes Gemini 2.5 Flash through three Google providers and tracks real-time throughput, Time to First Token (TTFT), end-to-end latency, and uptime for each. The differences between providers are significant enough to affect the choice of provider for latency-sensitive workloads.

All numbers below require live verification against openrouter.ai/google/gemini-2.5-flash.

Performance by Provider

Source: OpenRouter live model page.

ProviderAvg ThroughputAvg TTFTAvg E2E LatencyUptime
Google Vertex (Global)~75 tok/s~0.63sVerify on live pageVerify on live page
Google AI StudioVerify on live pageVerify on live pageVerify on live pageVerify on live page
Google VertexVerify on live pageVerify on live pageVerify on live pageVerify on live page

The Vertex Global provider shows the highest throughput in recent data. AI Studio historically shows the best uptime. Standard Vertex shows the highest latency of the three. When you route through OpenRouter without specifying a provider, it automatically distributes traffic to the healthiest option based on real-time signals.

For real-time Gemini 2.5 Flash pricing and uptime, see the OpenRouter model page.

Gemini 2.5 Flash vs Flash Lite vs Pro

Choose based on your workload requirements:

Use Gemini 2.5 Flash for most agentic and reasoning workloads. It’s the default recommendation when you need thinking capability without incurring Pro-level costs.

Use Gemini 2.5 Flash Lite for high-volume classification, extraction, or translation tasks where thinking isn’t required and cost per request is the primary constraint. Thinking is disabled by default on Flash Lite.

Use Gemini 2.5 Pro for complex reasoning tasks where accuracy justifies a 5 to 10x cost premium over Flash: frontier mathematics, hard-coding challenges, and multi-step scientific analysis.

Technical Specifications

The table below is the canonical reference for Gemini 2.5 Flash. For the authoritative version, see the Google AI for Developers model page (updated 2026-04-01) and the Vertex AI docs (updated 2026-04-03).

PropertyValue
Model IDgemini-2.5-flash
OpenRouter model stringgoogle/gemini-2.5-flash
Context window1,048,576 tokens
Max output65,536 tokens
Input typesText, images, video, audio, code, documents (PDF and text/plain only, 50MB max)
Output typesText
Thinking budget range0 to 24,576 tokens (default: dynamic / -1)
Knowledge cutoffJanuary 2025
GA releaseJune 17, 2025
DiscontinuationOctober 16, 2026
Supported capabilitiesFunction calling, structured outputs, code execution, Search Grounding, Batch API, context caching (implicit and explicit), file search, URL context
Not supportedAudio generation, image generation, Live API, thinkingLevel parameter

Deprecation notice: Gemini 2.5 Flash is scheduled for discontinuation on October 16, 2026, on Vertex AI. If you’re building for production use cases that extend beyond that date, plan a migration to a successor model and monitor ai.google.dev/gemini-api/docs/models for updates.

Frequently Asked Questions

Is Gemini 2.5 Flash free to use?

Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service before using it with user data. OpenRouter does not include Gemini 2.5 Flash in its free tier; a minimum $5 credit balance is required. Vertex AI provides $300 in trial credits for new Google Cloud accounts.

What is the thinking budget in Gemini 2.5 Flash?

The thinkingBudget parameter (range: 0 to 24,576 tokens, or -1 for dynamic) controls how much internal reasoning the model performs before responding. Budget 0 disables thinking: fastest and cheapest. Budget -1 enables dynamic mode: the model auto-adjusts based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (e.g. extra_body={"reasoning": {"max_tokens": -1}} for dynamic, or any positive budget). Higher fixed budgets improve output quality on complex tasks but increase latency and cost, billed at the output token rate.

How does Gemini 2.5 Flash compare to GPT-4o?

Flash supports a 1M-token context window, versus 128K for GPT-4o, and includes configurable thinking not available in GPT-4o. Flash’s per-token pricing is lower. GPT-4o has broader third-party ecosystem support and a longer production track record. Direct benchmark comparisons on the same evaluations aren’t published across both models in this guide; use the OpenRouter rankings for current third-party evaluation data.

Can I use Gemini 2.5 Flash for image generation?

No. Gemini 2.5 Flash outputs text only. Image input is supported; the model can process and reason about images. For image generation, use Gemini 2.5 Flash Image, a separate model with its own pricing.

What providers serve Gemini 2.5 Flash on OpenRouter?

Three: Google AI Studio, Google Vertex Global, and Google Vertex. OpenRouter routes to the healthiest provider automatically based on real-time throughput and uptime data. You can pin to a specific provider using OpenRouter’s provider routing controls.

What is the difference between Gemini 2.5 Flash and Flash Lite?

Flash includes configurable thinking (budget 0 to 24,576) and higher-quality output. Flash Lite is optimized for ultra-low latency and cost, with thinking disabled by default (though it can be enabled). Use Flash when reasoning capability matters; use Lite for high-volume tasks where cost per request is the primary constraint.