Hey Guys!
I am very new to the developer community in APIgee and the google developer community. Haven’t used the service yet.
However, I have an application where I just make a REST API call with multi-modal prompt input (around 1032 tokens for image and 10-12 tokens for text) and inference the gemini-2.5-pro model for a specific output (JSON schema 2-3 tokens).
I am seeing the per request time to average out at 20 seconds.
Here are my questions:
- Is there a better way to make API calls to get faster response (somewhere around 5 seconds, not sure if this is too big of an ask)?
- What are some resources that I can look into that can guide me on this topic?
Thank you very much for taking out time to help and point me in the correct direction!
Hello @Krunal_Bhatt ,
Welcome to the community - are you planning on using Apigee as a part of this workflow at some point? I can think of several differentiators given the above (ie: semantic/context caching) that would drastically improve performance given specific use cases
Given your specific question, I would recommend looking into the following:
- Try utilizing differing Gemini models to evaluate performance - flash will be noticeably quicker if your major concern is performance (and given the above prompt, flash may be better suited)
- There could be networking latency associated with your requests, where is your client versus what regional endpoint are you calling into?
- Are you using provisioned throughput? If not, your request could be awaiting bandwidth
Best
Matt
Thank you for the response @hartmann !!
I am not sure if I “need” to use APIgee as a part of this workflow. Is there any other conventional way (to not use APIgee) to get faster responses?
-
To your point about evaluating different models - what is a good way to evaluate a prompt across various models? Is there some sort of benchmark/metric that can be utilized to evaluate the performance? I surely will look into flash and its performance. Did you recommend flash because the prompt is a multimodal prompt?
-
I have checked for network latency, it is minimal. edit: just noticed the regional part. I didnt set the region but I assume that will have a part since the regions are far apart (assumption is it defaults to central Iowa?)
-
I am not using provisioned throughput. Is that something where I will need to allocate a resource (typically some compute?) or is it just getting some pre-allocated bandwidth?
Regards
Krunal
@Krunal_Bhatt
My pleasure!
Apigee is typically recommended as an AI Gateway in architectures pointing to 1-N LLMs/Agents (1p, 3p, etc) - more can be seen here where needed as per feature differentiators (token/rate limiting, AuthN/AuthZ, Semantic Caching, etc): https://cloud.google.com/solutions/apigee-ai?e=48754805
On the above:
- Prompt tuning can be orchestrated through Vertex’s evaluation service: Gen AI evaluation service overview | Generative AI on Vertex AI | Google Cloud Documentation - through the above you can tune differing prompts and determine best performance/responses across 1-N models. I recommended looking into flash given its lightweight nature, but the tuning noted above would be able to explicitly prove any added performance
- Understood, just wanted to validate
- Provisioned throughput more or less reserves model capacity specific to your project set (versus sharing pooled resources). You can see more on this use case here: Provisioned Throughput overview | Generative AI on Vertex AI | Google Cloud Documentation
Hope this helps!
2 Likes