Need Guidance on Optimizing API Response times for google gemini 2.5 pro Inference

Hey Guys!
I am very new to the developer community in APIgee and the google developer community. Haven’t used the service yet.
However, I have an application where I just make a REST API call with multi-modal prompt input (around 1032 tokens for image and 10-12 tokens for text) and inference the gemini-2.5-pro model for a specific output (JSON schema 2-3 tokens).

I am seeing the per request time to average out at 20 seconds.

Here are my questions:

  1. Is there a better way to make API calls to get faster response (somewhere around 5 seconds, not sure if this is too big of an ask)?
  2. What are some resources that I can look into that can guide me on this topic?

Thank you very much for taking out time to help and point me in the correct direction!

Hello @Krunal_Bhatt ,

Welcome to the community - are you planning on using Apigee as a part of this workflow at some point? I can think of several differentiators given the above (ie: semantic/context caching) that would drastically improve performance given specific use cases

Given your specific question, I would recommend looking into the following:

  • Try utilizing differing Gemini models to evaluate performance - flash will be noticeably quicker if your major concern is performance (and given the above prompt, flash may be better suited)
  • There could be networking latency associated with your requests, where is your client versus what regional endpoint are you calling into?
  • Are you using provisioned throughput? If not, your request could be awaiting bandwidth

Best
Matt

Thank you for the response @hartmann !!

I am not sure if I “need” to use APIgee as a part of this workflow. Is there any other conventional way (to not use APIgee) to get faster responses?

  1. To your point about evaluating different models - what is a good way to evaluate a prompt across various models? Is there some sort of benchmark/metric that can be utilized to evaluate the performance? I surely will look into flash and its performance. Did you recommend flash because the prompt is a multimodal prompt?

  2. I have checked for network latency, it is minimal. edit: just noticed the regional part. I didnt set the region but I assume that will have a part since the regions are far apart (assumption is it defaults to central Iowa?)

  3. I am not using provisioned throughput. Is that something where I will need to allocate a resource (typically some compute?) or is it just getting some pre-allocated bandwidth?

Regards
Krunal

@Krunal_Bhatt

My pleasure!

Apigee is typically recommended as an AI Gateway in architectures pointing to 1-N LLMs/Agents (1p, 3p, etc) - more can be seen here where needed as per feature differentiators (token/rate limiting, AuthN/AuthZ, Semantic Caching, etc): https://cloud.google.com/solutions/apigee-ai?e=48754805

On the above:

  1. Prompt tuning can be orchestrated through Vertex’s evaluation service: Gen AI evaluation service overview  |  Generative AI on Vertex AI  |  Google Cloud Documentation - through the above you can tune differing prompts and determine best performance/responses across 1-N models. I recommended looking into flash given its lightweight nature, but the tuning noted above would be able to explicitly prove any added performance
  2. Understood, just wanted to validate
  3. Provisioned throughput more or less reserves model capacity specific to your project set (versus sharing pooled resources). You can see more on this use case here: Provisioned Throughput overview  |  Generative AI on Vertex AI  |  Google Cloud Documentation

Hope this helps!

2 Likes