Critical Performance Issues: Gemini Latency Spikes and Unpredictable Inference Times (Vertex AI)

Hello Google Cloud Support Team,

I am reaching out regarding a critical blocker for the launch of our platform, cadrant.ai. Our agentic workflow relies on the Gemini 3 Flash model.

We recently migrated from Google AI Studio to Vertex AI, expecting higher stability and consistent performance for production. However, we are facing severe latency issues that are causing our client projects to crash.

The Issue:

  • Unpredictable Latency: While a standard iteration normally takes 5-10 seconds, we frequently experience spikes where the model takes up to 200 seconds to respond, or sometimes fails to respond entirely.

  • Regional Constraints: Since gemini-3-flash-preview is currently only available via the global endpoint, we have no way to pin our traffic to a specific region (like europe-west9) to potentially reduce latency or improve stability.

  • Resource Exhaustion: We are frequently hitting 429 errors despite being in a paid tier, which seems to be tied to backend availability rather than our specific quota limits.

Our Constraints:

  • As a startup at the launch stage, Provisioned Throughput is currently cost-prohibitive for us. We need the “Pay-as-you-go” model to be reliable enough for a commercial MVP.

  • Our agents require consistent response times to maintain the user experience. A 200s delay is perceived as a system failure by our clients and lead to a timeout of our tasks of generation, so the projects of our clients are failing.

Questions:

  1. Is there a roadmap for the availability of Gemini 3 Flash in specific regions (non-global) to improve stability?

  2. Are there specific optimizations or headers we should use or any way to prioritize our requests within a Tier 3 / Startup program context?

  3. Is there any recommended “Circuit Breaker” for Vertex AI specifically?

We are fully committed to the Google Cloud ecosystem, but we are currently stuck and cannot launch cadrant.ai under these conditions. Any guidance or internal escalation would be greatly appreciated.

Best regards,

Luca Delanglade

1 Like