For instance, the standard request price for Gemini 1.5 Pro is $1.25 per million tokens.
Then, the price for context caching is $4.50 per million tokens per hour. Cached input requests (cache hits) are priced at $0.31 per million tokens.
This creates a peculiar situation. Instead of making a standard request at $1.25 per million tokens, I could first create a 1-minute cache. For a request of, say, 1 million tokens, this 1-minute cache would cost $4.50 / 60 = $0.075. Then, if I make the request, it will definitely be a cache hit, costing $0.31. The total cost would be $0.385. This is significantly lower than the standard request price. It implies that I could make all my requests much cheaper by first creating a 1-minute cache and then making the actual request.
Secondly, I find the current implementation of context caching very difficult to use, especially in multi-turn sessions. My goal is to have the user’s input and the AI’s response added to a new cache after each turn.
In Anthropic’s Claude, this can be achieved by setting a breakpoint at the latest message, which automatically caches the preceding conversation. I believe this approach is much more flexible. Are there any plans to support this kind of prefix/prompt caching?
Explicit caching lets you upload content once and reuse it across requests. You pay:
$4.50 per million tokens per hour for storage
$0.31 per million tokens for cache hits
Standard input tokens cost $1.25 per million. Therefore, yes, caching + hit = ~$0.385 per million tokens, a 70% discount
However, this assumes that you always hit the cache (which isn’t guaranteed unless you tightly control prompt structure and timing). Your cached content is large enough to justify the overhead. You’re not incurring extra costs from non-cached tokens or output tokens
In addition, Gemini’s caching isn’t optimized for dynamic, evolving conversations. It treats cached content as a static prefix, and there’s no built-in way to append new turns to the cache after each exchange.
It’s ideal for chat-style interactions where context grows turn by turn
As of now, there’s no public roadmap confirming Gemini will adopt Claude-style prefix caching. I suggest keeping an eye to release notes for future updates.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Your first request will not be count as cached, so it will be normal price as far as I know. Then it will save the state with caching mechanism
Are there any settings to keep the cache alive? even though we’re giving ttl → 1 hour. why not if I hit at 55 min, refresh the cache or don’t kill it? As you stated Anthropic uses this way, does Gemini has this feature? @ruthseki thx!