hi
We are experiencing a critical performance degradation with the Vertex AI streaming API, which is severely impacting our production application.
Description
When making streaming prediction calls to the Vertex AI API (http://aiplatform.googleapis.com) from our infrastructure located in Silicon Valley, USA, we are observing abnormally high latency for the first token in the response stream. The time-to-first-token (TTFT) consistently exceeds 17 seconds, whereas it is typically under 2 seconds.
server address: 142.250.191.42
1、 Basic Ping Tests (Connectivity & Baseline Latency)
Run these commands from the affected server/client in Silicon Valley.
ping(base) [root@usa-gg-test01 ~]# ping ``aiplatform.googleapis.com
PING ``aiplatform.googleapis.com`` (142.250.191.42) 56(84) bytes of data.
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms
2、 python code test
Using the model: gemini-3-pro-preview
import requests
import json
import time
def stream_gemini_content():
api_key='xxx'
url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"
headers = {
"x-goog-api-key": api_key,
"Content-Type": "application/json"
}
data = {
"contents": [{
"role": "user",
"parts": [{
"text": "请讲一个200字的故事,不要用推理,直接回答。"
}]
}],
"generationConfig": {
"thinkingConfig": {
"includeThoughts": False
}
}
}
print(f"begin requests: {url} ...")
start_time = time.time()
first_token_time = None
last_chunk_time = None
try:
with requests.post(url, headers=headers, json=data, stream=True) as response:
if response.status_code != 200:
print(f"status: {response.status_code}")
print(response.text)
return
print("-" * 50)
for line in response.iter_lines():
if not line:
continue
decoded_line = line.decode('utf-8').strip()
if not decoded_line.startswith("data: "):
continue
json_str = decoded_line[6:]
if json_str == "[DONE]":
break
try:
now = time.time()
if first_token_time is None:
first_token_time = now
print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms")
print("-" * 50)
last_chunk_time = now
chunk_data = json.loads(json_str)
candidates = chunk_data.get("candidates", [])
total_elapsed = (now - start_time) * 1000
chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0
last_chunk_time = now
if candidates:
content = candidates[0].get("content", {})
parts = content.get("parts", [])
if parts:
text_chunk = parts[0].get("text", "")
print(text_chunk, end="", flush=True)
except Exception as e:
pass
except Exception as e:
pass
end_time = time.time()
print("\n\n" + "-" * 50)
print(f"total time: {(end_time - start_time) * 1000:.2f} ms")
if __name__ == "__main__":
stream_gemini_content()
Our code testing is running unacceptably slow, which is impacting our development cycle. Could you please advise on how to resolve this performance issue? thank you


