Provisioned Throughput for Gemini

Hello Everyone,

Before describing the problem, I want to mention that we have already opened several tickets via premium support, and despite the best efforts of the support engineer, nobody has been able to provide us with a solution or a clear explanation of the problem yet.

  • We are using Gemini-1.5-flash on Vertex AI and face constantly and unpredictably Error 429 (resource exhausted).
  • We know this is happening due to shared resources, so we ordered "Provisioned Throughput".
  • I can see the order in the model garden, it is active, in the correct region and model
  • Endpoint:
https://europe-west3-aiplatform.googleapis.com/v1/"
"projects/***/locations/europe-west3/"
"publishers/google/models/gemini-1.5-flash-002:generateContent

Now, according to public documentation and what we learned from support engineers:

  • Provisioned Throughput is used automatically (default)
  • We can control it by adding this header:
X-Vertex-AI-LLM-Request-Type: dedicated

or

X-Vertex-AI-LLM-Request-Type: shared"

Results:
a) Not adding the header
We run randomly into 429, even for very small prompts and at RPM below 50
b)X-Vertex-AI-LLM-Request-Type: dedicated
We get 429 immediately for every single request

a) & b)
Response header is never present.

According to documentation:
If a request was processed using Provisioned Throughput, the following HTTP header is present in the response. This line of code applies only to the generateContent API call.

  {"X-Vertex-AI-LLM-Request-Type": xxx}

Conclusion:
We do NOT use provisioned throughput

Anyone here used provisioned throughput already? What are we missing?

2 Likes