Gemini 2.5 Pro – Extremely High Latency on Large Prompts (100K–500K Tokens)

speedy · May 5, 2025, 10:29pm

Hi all,

I’m using the model gemini-2.5-pro-preview-03-25 through Vertex AI’s generateContent() API, and facing very high response latency even on one-shot prompts.

Current Latency Behavior:

Prompt with 100K tokens → ~2 minutes
Prompt with 500K tokens → 10 minutes+
Tried other Gemini models too — similar results

This makes real-time or near-real-time processing impossible.

What I’ve tried:

Using generateContent() directly (not streaming)
Tried multiple models (Gemini Pro / 1.5 / 2.0)
Same issue in us-central1
Prompts are clean, no loops or excessive system instructions

My Questions:

Is there any way to reduce this latency (e.g. faster hardware, premium tier, inference priority)?
Is this expected for Gemini at this scale?
Is there a recommended best practice to split large prompts or improve runtime performance?

Would greatly appreciate guidance or confirmation from someone on the Gemini/Vertex team.

Thanks!

JesusMF23 · May 7, 2025, 9:02am

we are experiencing the same issue, using gemini 2.5 (either pro or flash), gemini 2.0 flash & flash lite, whatever the request is taking on average 15 seconds for prompts containing around 2k tokens, whatever the location we use. This is making gemini imposible to work for production cases.

ruthseki · May 8, 2025, 2:30pm

Hi @speedy ,

Welcome to Google Cloud Community!

Gemini-2.5-pro-preview-03-25 is currently in Pre-GA Offerings Terms. See release notes for more information.

Note that at Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months.

Also, starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

With regards to your questions:

Is there any way to reduce this latency?

Yes, there are several approaches, ranging from infrastructure to prompt engineering:

Vertex AI Provisioned Throughput (Dedicated Resources):
- This is likely your most impactful infrastructure-level solution for consistent lower latency on large prompts.
- Instead of pay-as-you-go on shared resources, you purchase a dedicated instance (or instances) of the model with a guaranteed number of “Tokens per Second” (TPS) capacity.
- This provides more predictable performance and can significantly reduce latency because your requests aren’t contending with others for resources.
- See Purchase Provisioned Throughput for preview models and you may also refer to this article for information.
Optimize Output Length (max_output_tokens):
- The time taken is a function of both input and output tokens. If you’re generating very long responses, that adds to the latency. Be as specific as possible with max_output_tokens if you don’t need an exhaustive output.
Caching (If applicable):
- If you expect to send the exact same large prompt multiple times, you could implement a caching layer yourself. However, this is unlikely for truly dynamic large-context scenarios.
- Vertex AI also has some internal caching, but for such large, unique prompts, its benefit might be limited for your first call.

Is this expected for Gemini at this scale?

Yes, to a large extent, this latency is expected for single, monolithic prompts of that size on standard, pay-as-you-go endpoints.

Computational Cost: Processing hundreds of thousands of tokens is computationally intensive. The attention mechanism, a core part of Transformer architectures (which Gemini is based on), scales quadratically with sequence length in its naive form (O(n^2)). While models like Gemini 1.5 Pro use more advanced techniques to handle 1M+ tokens, the fundamental work of processing that much information is still substantial.
Shared Resources: On standard Vertex AI endpoints, you’re typically using shared resources. While Google has massive infrastructure, a request that large will still queue and consume significant resources for a period.

Is there a recommended best practice to split large prompts or improve runtime performance?

Absolutely. This is crucial for dealing with massive contexts, even with models that can technically handle them in one go. “Can handle” doesn’t always mean “should handle in one monolithic block for optimal performance.”

Chunking & Iterative Processing (MapReduce-like approach):
- For Summarization/Extraction over large documents:
  1. Split the document into manageable chunks (e.g., 10K-50K tokens each, experiment with size).
  2. Process each chunk individually (e.g., “Summarize this chunk,” “Extract key entities from this chunk”).
  3. Combine the results from the chunks. If you have many summaries, you might do a second pass to summarize the summaries.
Retrieval Augmented Generation (RAG) – If appropriate:
- If your large prompt is essentially a knowledge base you want to query, RAG is often more efficient.
- Instead of stuffing all 500K tokens into the prompt every time: 4. Pre-process and embed your large document(s) into a vector database. 5. When a user query comes in, retrieve only the most relevant snippets (e.g., a few K tokens) from the vector database. 6. Provide these relevant snippets as context to Gemini along with the user’s query.
- When this is NOT a fit: If your task requires the model to reason about the interconnections across the entire 500K document simultaneously (e.g., finding very subtle long-range dependencies or writing a novel based on a massive outline).
Selective Context / Sliding Window (for specific tasks):
- If you’re performing a task like editing or Q&A that only needs local context at any given time, you don’t need to feed the whole document.
- For example, if editing paragraph 500 of a document, provide paragraphs 495-505 as context.
Instruction Placement:
- For very long prompts, models can sometimes pay more attention to the beginning and end of the context. Place your most critical instructions or questions towards the end of the prompt. This is sometimes referred to as “recency bias.”

In summary, start by estimating the cost/benefit of provisioned throughput, and in parallel, prototype chunking strategies.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Kirill_Leytsikhovich · September 10, 2025, 2:31pm

Hi!

Can you tell me how do I increase the limit of 1 million tokens within one chat in Gemmini Pro 2.5 after I linked the card. I can’t find how to do it.

Or I can’t increase the chat limit in Gemmini Pro 2.5?

Topic		Replies	Views
Gemini 1.0 Pro tekon count not 32K Custom ML & MLOps gemini-in-looker , vertex-ai-platform	4	146	March 26, 2024
429 Resource exhausted on VertexAI with just one large request Custom ML & MLOps gemini-in-looker , vertex-ai-platform	6	486	November 20, 2024
Why fine-tuned gemini-1.5-flash is so slow? Custom ML & MLOps vertex-ai-platform	3	100	December 6, 2024

Gemini 2.5 Pro – Extremely High Latency on Large Prompts (100K–500K Tokens)

AI Suggested topics