Peculiar Pricing of Context Caching and Potential Plans for Prefix Caching Support

Junity · July 13, 2025, 10:33pm

For instance, the standard request price for Gemini 1.5 Pro is $1.25 per million tokens.

Then, the price for context caching is $4.50 per million tokens per hour. Cached input requests (cache hits) are priced at $0.31 per million tokens.

This creates a peculiar situation. Instead of making a standard request at $1.25 per million tokens, I could first create a 1-minute cache. For a request of, say, 1 million tokens, this 1-minute cache would cost $4.50 / 60 = $0.075. Then, if I make the request, it will definitely be a cache hit, costing $0.31. The total cost would be $0.385. This is significantly lower than the standard request price. It implies that I could make all my requests much cheaper by first creating a 1-minute cache and then making the actual request.

Secondly, I find the current implementation of context caching very difficult to use, especially in multi-turn sessions. My goal is to have the user’s input and the AI’s response added to a new cache after each turn.

In Anthropic’s Claude, this can be achieved by setting a breakpoint at the latest message, which automatically caches the preceding conversation. I believe this approach is much more flexible. Are there any plans to support this kind of prefix/prompt caching?

ruthseki · July 15, 2025, 7:59pm

Hi Junity,

Welcome to Google Cloud Community!

Explicit caching lets you upload content once and reuse it across requests. You pay:

$4.50 per million tokens per hour for storage
$0.31 per million tokens for cache hits

Standard input tokens cost $1.25 per million. Therefore, yes, caching + hit = ~$0.385 per million tokens, a 70% discount

However, this assumes that you always hit the cache (which isn’t guaranteed unless you tightly control prompt structure and timing). Your cached content is large enough to justify the overhead. You’re not incurring extra costs from non-cached tokens or output tokens

In addition, Gemini’s caching isn’t optimized for dynamic, evolving conversations. It treats cached content as a static prefix, and there’s no built-in way to append new turns to the cache after each exchange.

By contrast, Anthropic’s Claude prompt caching allows you to set a cache_control breakpoint at any point in the prompt. This means:

You can cache everything up to the latest message
The cache is refreshed automatically when reused
It’s ideal for chat-style interactions where context grows turn by turn

As of now, there’s no public roadmap confirming Gemini will adopt Claude-style prefix caching. I suggest keeping an eye to release notes for future updates.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

mertbozkir · September 1, 2025, 8:01am

Isn’t it context caching charged by hour basis?

Edit: Oh wait, I saw 5 minute ttl on the documentation.

mertbozkir · September 1, 2025, 12:06pm

Here are my thoughts.

Your first request will not be count as cached, so it will be normal price as far as I know. Then it will save the state with caching mechanism
Are there any settings to keep the cache alive? even though we’re giving ttl → 1 hour. why not if I hit at 55 min, refresh the cache or don’t kill it? As you stated Anthropic uses this way, does Gemini has this feature? @ruthseki thx!

Topic		Replies	Views
Context Caching optimization & safety filter latency for massive-context workloads (gemini-experimental) Generative AI & Foundational Models gemini , vertex-ai-studio , vertex-ai-tuning	0	17	April 17, 2026
Vertex AI caching only system prompt Custom ML & MLOps gemini-in-looker , vertex-ai-platform	2	453	April 24, 2025
Control your Generative AI costs with the Vertex API’s context caching Custom ML & MLOps gemini-in-looker , vertex-ai-platform	0	84	November 18, 2024

Peculiar Pricing of Context Caching and Potential Plans for Prefix Caching Support

AI Suggested topics