Hi Google Cloud / Vertex AI team,
I am using Gemini through Vertex AI, not AI Studio API key.
Setup:
Model: gemini-2.5-flash
Location: global
Backend: FastAPI
Wrapper: LangChain ChatVertexAI
Package versions: langchain-google-vertexai==2.1.2, langchain==0.3.25, google-genai==1.47.0
In recent days, this issue has started happening more frequently. Most small routing/extraction/chat calls finish in 1–3 seconds, but intermittently one call gets stuck for 5–10 minutes.
Error/warning:
WARNING:langchain_google_vertexai._retry:Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ServiceUnavailable: 503 End of TCP stream.
Example log:
API: POST /chat/workflows | Status: 200 | Time: 306097.67ms
306097.67ms is around 5.1 minutes.
LangSmith traces show normal calls around 0.9s–2.4s, but bad calls around 585s–603s.
Current setup:
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(project="<GCP_PROJECT_ID>", location="global", model_name="gemini-2.5-flash", temperature=0.7, max_retries=2)
Questions:
Does 503 End of TCP stream indicate temporary Vertex AI / Gemini backend capacity or connection termination?
Has there been any recent increase in this issue for gemini-2.5-flash on the global endpoint?
Is global still recommended, or should I use a specific region like us-central1?
As a workaround, I am testing direct google-genai with Vertex AI and adding hard timeout, low retry count, concurrency limit, and returning 503/504 instead of waiting several minutes.
Related LangChain issue:
opened 10:49AM - 29 Apr 26 UTC
bug
vertexai
### Package (Required)
- [ ] langchain-google-genai
- [x] langchain-google-vert… exai
- [ ] langchain-google-community
- [ ] Other / not sure / general
### Checked other resources
- [x] I added a descriptive title to this issue
- [x] I searched the LangChain documentation and API reference (linked above)
- [x] I used the GitHub search to find a similar issue and didn't find it
- [x] I am sure this is a bug and not a question or request for help
### Example Code (Python)
```python
import time
from langchain_google_vertexai import ChatVertexAI
GOOGLE_CLOUD_PROJECT = "<PROJECT_ID>"
GOOGLE_CLOUD_LOCATION = "global"
MODEL_NAME = "gemini-2.5-flash"
llm = ChatVertexAI(
project=GOOGLE_CLOUD_PROJECT,
location=GOOGLE_CLOUD_LOCATION,
model_name=MODEL_NAME,
temperature=0.0,
max_retries=2,
thinking_budget=0,
)
prompt = """
You are a routing assistant.
Return ONLY one integer.
Available workflows:
- 4821: weather query
- 3158: billing issue
- 7603: technical error report
- 1846: product recommendation
- 1: default RAG system
User query:
techincal isseu wheree can i get the report
Your response:
"""
for i in range(10):
start = time.perf_counter()
try:
response = llm.invoke(prompt)
elapsed = round(time.perf_counter() - start, 2)
print(f"attempt={i + 1} success elapsed={elapsed}s")
print("response:", response.content)
except Exception as exc:
elapsed = round(time.perf_counter() - start, 2)
print(f"attempt={i + 1} error elapsed={elapsed}s")
print("error type:", type(exc).__name__)
print("error:", exc)
```
### Error Message and Stack Trace (if applicable)
```shell
WARNING:langchain_google_vertexai._retry:Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ServiceUnavailable: 503 End of TCP stream.
Example application log:
INFO | workflow_service.py | 170 | [Workflow Input] field=['country', 'days', 'data'], extracted=False, workflow_terminated=True, value='', tokens={'input_tokens': 1551, 'output_tokens': 36, 'total_tokens': 1587}
INFO | logging.py | 18 | API: POST /chat/workflows | Status: 200 | Time: 306097.67ms
INFO: 127.0.0.1:54298 - "POST /chat/workflows HTTP/1.1" 200 OK
306097.67ms = around 306 seconds / 5.1 minutes.
LangSmith traces show most ChatVertexAI calls complete normally in around 0.9s–2.4s, but intermittent calls take around 585s–603s.
```
### Description
### What were you trying to do?
I am using `ChatVertexAI` with Vertex AI and `gemini-2.5-flash` inside a FastAPI chat/workflow service.
The calls are usually small routing/extraction/chat calls. Most requests complete normally in around 1–3 seconds. However, in recent days this issue has started happening more frequently. Earlier it was rare/intermittent, but recently I am seeing these long `ChatVertexAI` hangs more often in production/testing.
My setup:
```python
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(
project="<GCP_PROJECT_ID>",
location="global",
model_name="gemini-2.5-flash",
temperature=0.7,
max_retries=2,
)
```
Environment:
```Framework: FastAPI
Vertex AI model: gemini-2.5-flash
Vertex AI location: global
Auth: service account / Vertex AI project
```
Relevant package versions:
```
langchain==0.3.25
langchain-community==0.3.25
langchain-google-vertexai==2.1.2
langchain-openai==0.3.33
google-genai==1.47.0
google-cloud-storage==2.19.0
```