We are getting random resource exhausted limits while using gemini 2.5flash, gemini2.5pro and gemini 3 models.
The error message suggests that we are on standard pay as you go service “Error 429, Message: Resource exhausted. Please try again later. Please refer to …vertex-ai/generative-ai/docs/error-code-429 for more details., Status: RESOURCE_EXHAUSTED, Details: ”
Based on all the documents and Standard PayGo | Generative AI on Vertex AI | Google Cloud Documentation and Generative AI on Vertex AI quotas and system limits | Google Cloud Documentation
I have a hard time understanding what limit are we exactly exhausting
Few specific question:
- How do i find out which quota we are explictly exceeding?
- Based on docs, when i visit Vertex AI → Model Garden monitoring , i dont see any data that shows me exceeds utilization.
- Even though we are definitely using vertex, i am only able to see quotas under service:Generative Language API . All the quotas here shown are well below 10% utlization , so this cant be the resource exhausted errors
- Why am i not able to see any quotas when i am on service:vertex api or service:aiapiplatform
1 Like
@Somak_Dutta: the Error 429 in this case likely indicates that there was heavy concurrent traffic hitting the shared quota pool. I wrote a LinkedIn post about this if you’re interested. You can also check out the Retry strategy documentation: Retry strategy | Generative AI on Vertex AI | Google Cloud Documentation.
thank you @ericdong we are building a retry strategy, but it would great if there is way to understand exactly which shared quota pool is exhausted? is there a way we can do that?
also i didnt really understand why despite being in vertex ai, my quota utilization is visible under service:Generative Language API and not service:aiapiplatform. Is there a specific reason i am unaware of ? i did rigorously go through vertex docs , but i dont see any reasoning aroudn this present.( or none that i could find)
We tried using Claude models (e.g. anthropic-claude-opus-4-1) on Vertex AI and immediately received:
Error 429
Quota exceeded for aiplatform.googleapis.com/online_prediction_input_tokens_per_minute_per_base_model
Status: RESOURCE_EXHAUSTED
This happens even with minimal traffic.
We then switched to Gemini models, and they work perfectly under the same project and billing setup.
@Somak_Dutta yes, you can use Vertex AI Monitoring to look into model usage, identify latency issues, and troubleshoot errors. The public documentation is here: Monitor models | Generative AI on Vertex AI | Google Cloud Documentation.
Instructions specific to your needs:
- Go to Google Cloud Console, select your project
- Go to Monitoring > Dashboards
- Search for pre-built dashboard “Vertex AI Model Garden (Monitoring)”
- Select your model, monitor metrics:
- Requests per second
- Error rate
- Latencies
- Token counts
- You also can use the filters to filter out Location or response_code, etc.
Hope it helps.
Hi @ericdong ,
We’ve been facing something similar on Vertex API. It’s been running in one of our production app for the last few months, and even at low RPM, we’ve gotten over 55% error rates as noted on the Vertex website when using the google.cloud.aiplatform.v1.PredictionService.GenerateContent method.
We’ve tried retries, but it would often time out. I’ve even tried priority calling (the new priority feature in the headers), and still was getting constant 429.
I’m on a paid billing plan for the last months, and it’s unusable - we constantly have a fallback from another provider, but we would prefer using vertex.
Could you provide any guidance?
Edit: We’re also running on global in order to try and avoid 429s
1 Like
@John_Beauce: The only way to guarantee capability is through Provisioned Throughput. If PT is not an option for you, implementing a robust exponential backoff solution - automatically retry the request after a short, increasing delay. Availability can change quickly, so a retry is often successful.
I believe that capacity has been worked on to meet the recent incredible demand.
@ericdong Thanks for your response, appreciate it.
Given this, what exactly is the purpose of Priority headers? I was getting 429s even when calling it with priority, but I thought it was an in-between for those that didn’t want to get PT but wanted to avoid 429s.
@John_Beauce: Your understand is correct. Priority PayGo is prioritized over Standard and Flex in terms of availability so it can significantly reduce the chance of 429s comparing to Standard PayGo but it can’t totally eliminate 429s.
hi @ericdong in regards to Retry strategy | Generative AI on Vertex AI | Google Cloud Documentation
wanted to bring your attention to this specific open issue I had raised sometime around Jan this year Why does Workload Identity Federation for using Vertex Backend only applicable when HTTPClient is not used · Issue #660 · googleapis/go-genai · GitHub
This is in regards to the document you shared earlier. We were using github hashicorp go-retryablehttp oss library for retying specific error codes earlier when we were using the open gemini endpoints. However moving to vertex we needed to drop it.
Any chance you can take a look on the issue
Sorry i realize i might be asking a lot, just in case you get time.