[429 RESOURCE_EXHAUSTED] - Resource Exhausted on Vertex AI Models

Hello everyone,

The project I am working on is currently using the Google GenAI library with Vertex AI to prompt Gemini 2.5 on the paid tier. We already have exponential backoff + retry logic implemented for handling transient errors. However, over the past two days, the retry logic has become insufficient - we’re repeatedly hitting 429 errors and eventually exhausting all retries without ever receiving a successful response from Gemini.

This seems to correlate with the Gemini 3.0 announcements, and we suspect increased usage or demand may be contributing to the issue.

We are considering Provisioned Throughput as a way to mitigate this, but it would introduce additional costs. Before we go down that route, I wanted to ask:

Has anyone found effective strategies or configurations for dealing with persistent 429 errors from Gemini on Vertex AI?

Additionally, we would like to ensure that all requests run only on servers located in the United States.

Any Insights on whether this spike in 429 errors is expected or temporary would also be appreciated.

Thanks!

Hello, I face the same issue. Have you found a solution?

Unfortunately, it has been about two months, and we still haven’t found a solid solution that doesn’t require a major architectural change or change to the pricing model.

We’ve implemented changes in our codebase where Gemini is invoked, randomly selecting servers across different regions in an attempt to better distribute load. However, we don’t yet have strong evidence that this has meaningfully improved the situation, as we’re still seeing frequent 429 RESOURCE_EXHAUSTED errors.

Notably, we didn’t experience this level of instability with Gemini 1.5 Pro. These issues began after we were required to migrate to Gemini 2.5 Flash.

At this point, our next step is to move to Gemini 3.0 and monitor whether it offers better stability and resource handling.

We recommend migrating to Provisioned Throughput (PT). Our data shows that PT significantly stabilizes performance, and reserving this capacity will provide the specific reliability guarantees you require.

Hello @Javin_Liu

Did you try limiting your API calls?

For example, instead of having a queue grow with 429 errors being retried, making the queue and the number of calls grow larger and larger, you can try to maintain a steady flow of x calls per second (or minute), counting awaiting retries.

If you have already done that or if you really need to serve a high demand, you can use another model as a fallback solution when 429 errors occur to lower the load on the primary model.

Any insights on whether this spike in 429 errors is expected or temporary would also be appreciated.

Since you’re using Pay As You Go, you are on a shared pool of resources that seems to be, sometimes, overwhelmed by demand.

The luxury method, if you need it and can have it, is Provisioned Throughput as mentioned by @Mamatha.