Google Vertex not suitable for small production workloads in practice? - Error 429: Resources Exhausted

We’re frequently getting ‘429: resource exhausted errors’ on Vertex AI when making low volume Gemini queries via the global endpoint. Even with exponential backoff and 10+ retries over 10+ minutes, requests often fail completely. Quota limits are nowhere near being reached. The error message itself is slightly misleading as it is basically a ‘server is too busy’ error.

I’m worried because if this happens in the testing phase, how would it handle bigger production workloads?

I’ve tried or looked into the following things:

  • Use of the global endpoint, as I would assume it redirects to the region that is least busy.
  • The Provisioned Throughput with a minimum of 2000 EUR p/m is not realistic for the workload it is about.
  • Retry windows of up to 30 minutes, but it still fails sometimes.

Really curious to hear about what we can do to prevent this from happening. I understand potential compute capacity problems, but even with big retry gaps it is not fixed.

Also, does the normal Gemini endpoint (not Vertex) get priority over Vertex? Seems weird as Vertex runs business critical solutions, but it seems to happen on Vertex and not the normal endpoint.

Any help is greatly appreciated!

4 Likes

Hi @Mike_W 429 Resource Exhausted usually means shared model capacity is busy not that you hit your quota. The global endpoint does not guarantee free capacity it only balances traffic. Try using a specific region instead of global and add steady rate limiting instead of retry bursts. For guaranteed stability you need Provisioned Throughput otherwise you are using shared resources. Vertex and the public Gemini API run on different capacity pools so behavior can differ. If the issue continues open a Cloud Support ticket with request IDs so Google can check regional saturation.

Hi, thanks for responding.

The shared model capacity is indeed the problem I am facing. That was what I was trying to explain :).

Changing the region is currently not possible, as the model is not available on a different endpoint. (I would’ve hoped it would be available by now, but unfortunately it takes very long).

I understand the Provisioned Throughput would fix this, but as I’ve mentioned, it is not feasible for me. Only if they change the minimum budget.

Retry logic is using an exponential interval, but it hasn’t helped. Sending one query in a 10-minute window can still result in the error.

It seems to me that only bigger businesses can use Google Vertex on production workloads? I was hoping I was missing something.

1 Like

I’m experiencing a similar problem, and I hope Google can listen to paid users like us.

What’s most frustrating is this contradiction: in real-world usage, free Google AI Studio often appears more reliable for image generation than my paid Vertex AI usage. I understand free usage isn’t unlimited, but it still performs significantly better for me than the paid API path. That feels irrational, especially when Vertex AI is positioned as enterprise-grade.

2 Likes

Facing a similar issue. I get the 429 Resource_exhausted running a pared down script from windows shell to query a base model - or my own fine-tuned model - on Google Cloud. But running the exact same code on Colab runs smoothly. The difference being the authentication, with the window shell script using my pay-as-you-go project authentication, whereas the colab is of unspecified project authentication. And the window shell script uses google.oauth2.credentials.Credentials credential type, whereas colab uses google.auth.compute_engine.credentials.Credentials.

Any solution to get the script on windows to reliably return queries? I’m working on a small project, so don’t have budget of Provisioned Throughput.

1 Like

Update on this. Now the problem does not exhibit itself. No change to my code, it just does not throw the exception anymore since a few days, even when i put it under stress.

I suspect Google did something behind the scenes to fix, or perhaps there was a peak in global use of google servers/services that caused it. Either way, it would be helpful to know about those background work or peak usage events, to mitigate countless hours looking through logs and trying debugging code. Any pointers in this regard would be appreciated.

Yea it is likely that their servers are less busy now.

to be fair: I tried Claude and had similar error messages.

The providers just can’t seem to handle the load.

But as someone mentioned before: for me it would make sense to give priority to Vertex users as they often run business critical loads, instead of the free plans on for example AI studio.

Basically what is more important: someone creating pictures of tung tung sahur (brainrot) or people that are creating business/legal documents?

I hope Google will prefer the (paying) second group.