Hi team,
We are seeing intermittent spikes of HTTP 429 RESOURCE_EXHAUSTED from Vertex AI GenerateContent and want to confirm whether this is expected behavior.
Environment
- Client: Go `google.golang.org/genai` v1.15.0 (Vertex AI backend)
- Endpoint: PredictionService.GenerateContent
- Models tried: gemini-2-5-flash-lite, gemini-3-flash-preview
- Region: moved from us-central1 to global (no clear improvement)
- Plan: pay-as-you-go
- Retry policy: retry on 429/500/503 with exponential backoff + jitter, 5 retries max (6 total attempts)
- Usage: Very low and well below quota limits
Observed behavior
- 1,987 initial requests
- ~4.6% final failures (still 429 after retries)
- Avg latency: ~3.1s
- P99 latency: ~44s
- Errors are spiky (e.g., near-zero for several days, then concentrated spikes on specific days)
We are well below reported project quota limits. This level of transient failure makes the service difficult to ship reliably to users.
Questions
- Is a ~4–5% 429 final failure rate considered normal in `global` for pay-as-you-go?
- Can these 429s be caused by shared model capacity even when project quotas are not exceeded?
- What mitigation pattern is recommended for production reliability: region failover per retry, model fallback, provisioned throughput, or other approaches?
- Is there any reliability difference between calling via Go SDK vs direct HTTP API for this endpoint?
Appreciate any guidance on whether this is expected and how best to mitigate it.