Frequent 503 “End of TCP stream” with Gemini 2.5 Flash on Vertex AI

Hi Google Cloud / Vertex AI team,

I am using Gemini through Vertex AI, not AI Studio API key.

Setup:

  • Model: gemini-2.5-flash

  • Location: global

  • Backend: FastAPI

  • Wrapper: LangChain ChatVertexAI

  • Package versions: langchain-google-vertexai==2.1.2, langchain==0.3.25, google-genai==1.47.0

In recent days, this issue has started happening more frequently. Most small routing/extraction/chat calls finish in 1–3 seconds, but intermittently one call gets stuck for 5–10 minutes.

Error/warning:

WARNING:langchain_google_vertexai._retry:Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ServiceUnavailable: 503 End of TCP stream.

Example log:

API: POST /chat/workflows | Status: 200 | Time: 306097.67ms

306097.67ms is around 5.1 minutes.

LangSmith traces show normal calls around 0.9s–2.4s, but bad calls around 585s–603s.

Current setup:

from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(project="<GCP_PROJECT_ID>", location="global", model_name="gemini-2.5-flash", temperature=0.7, max_retries=2)

Questions:

  1. Does 503 End of TCP stream indicate temporary Vertex AI / Gemini backend capacity or connection termination?

  2. Has there been any recent increase in this issue for gemini-2.5-flash on the global endpoint?

  3. Is global still recommended, or should I use a specific region like us-central1?

As a workaround, I am testing direct google-genai with Vertex AI and adding hard timeout, low retry count, concurrency limit, and returning 503/504 instead of waiting several minutes.

Related LangChain issue: