Hello Everyone,
Before describing the problem, I want to mention that we have already opened several tickets via premium support, and despite the best efforts of the support engineer, nobody has been able to provide us with a solution or a clear explanation of the problem yet.
- We are using Gemini-1.5-flash on Vertex AI and face constantly and unpredictably Error 429 (resource exhausted).
- We know this is happening due to shared resources, so we ordered "Provisioned Throughput".
- I can see the order in the model garden, it is active, in the correct region and model
- Endpoint:
https://europe-west3-aiplatform.googleapis.com/v1/"
"projects/***/locations/europe-west3/"
"publishers/google/models/gemini-1.5-flash-002:generateContent
Now, according to public documentation and what we learned from support engineers:
- Provisioned Throughput is used automatically (default)
- We can control it by adding this header:
X-Vertex-AI-LLM-Request-Type: dedicated
or
X-Vertex-AI-LLM-Request-Type: shared"
Results:
a) Not adding the header
We run randomly into 429, even for very small prompts and at RPM below 50
b)X-Vertex-AI-LLM-Request-Type: dedicated
We get 429 immediately for every single request
a) & b)
Response header is never present.
According to documentation:
If a request was processed using Provisioned Throughput, the following HTTP header is present in the response. This line of code applies only to the generateContent API call.
{"X-Vertex-AI-LLM-Request-Type": xxx}
Conclusion:
We do NOT use provisioned throughput
Anyone here used provisioned throughput already? What are we missing?