Google Cloud Platform Vertex AI service experiences extremely slow streaming output with the gemini-3-pro-preview model. The first streaming output exceeds 17 seconds. Are there any effective solutions?(Vertex AI Platform)

hi

We are experiencing a critical performance degradation with the Vertex AI streaming API, which is severely impacting our production application.

Description
When making streaming prediction calls to the Vertex AI API (http://aiplatform.googleapis.com) from our infrastructure located in Silicon Valley, USA, we are observing abnormally high latency for the first token in the response stream. The time-to-first-token (TTFT) consistently exceeds 17 seconds, whereas it is typically under 2 seconds.

server address: 142.250.191.42

1、 Basic Ping Tests (Connectivity & Baseline Latency)
Run these commands from the affected server/client in Silicon Valley.
ping(base) [root@usa-gg-test01 ~]# ping ``aiplatform.googleapis.com
PING ``aiplatform.googleapis.com`` (142.250.191.42) 56(84) bytes of data.
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms
64 bytes from ``nuq04s42-in-f10.1e100.net`` (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms

2、 python code test
Using the model: gemini-3-pro-preview

import requests
import json
import time

def stream_gemini_content():
	api_key='xxx'
    url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"

    headers = {
        "x-goog-api-key": api_key,
        "Content-Type": "application/json"
    }

    data = {
        "contents": [{
            "role": "user",
            "parts": [{
                "text": "请讲一个200字的故事,不要用推理,直接回答。"
            }]
        }],
        "generationConfig": {
            "thinkingConfig": {
                "includeThoughts": False
            }
        }
    }

    print(f"begin requests: {url} ...")

    start_time = time.time()
    first_token_time = None
    last_chunk_time = None  

    try:
        with requests.post(url, headers=headers, json=data, stream=True) as response:

            if response.status_code != 200:
                print(f"status: {response.status_code}")
                print(response.text)
                return

            print("-" * 50)

            for line in response.iter_lines():
                if not line:
                    continue

                decoded_line = line.decode('utf-8').strip()
                if not decoded_line.startswith("data: "):
                    continue

                json_str = decoded_line[6:]
                if json_str == "[DONE]":
                    break

                try:
                    now = time.time()

                    if first_token_time is None:
                        first_token_time = now
                        print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms")
                        print("-" * 50)
                        last_chunk_time = now  

                    chunk_data = json.loads(json_str)
                    candidates = chunk_data.get("candidates", [])

                    total_elapsed = (now - start_time) * 1000
                    chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0
                    last_chunk_time = now



                    if candidates:
                        content = candidates[0].get("content", {})
                        parts = content.get("parts", [])
                        if parts:
                            text_chunk = parts[0].get("text", "")
                            print(text_chunk, end="", flush=True)

                except Exception as e:
					pass

    except Exception as e:
        pass

    end_time = time.time()
    print("\n\n" + "-" * 50)
    print(f"total time: {(end_time - start_time) * 1000:.2f} ms")


if __name__ == "__main__":
    stream_gemini_content()

Our code testing is running unacceptably slow, which is impacting our development cycle. Could you please advise on how to resolve this performance issue? thank you

Hello @tomlnux,

I have been messing around with Gemini 3 Pro Preview to answer you.

Here is my first test with Google AI Studio:

Almost 34 seconds! Thinking was set to High.

Now, with Thinking set to Low:

5 seconds! That’s better.

Now with code, using Google Gen AI Libraries :

from google import genai
from google.genai.types import ThinkingConfig, ThinkingLevel


client = genai.Client(api_key=YOUR_API_KEY)

thinking_config = ThinkingConfig(
    includeThoughts=False,
    thinkingLevel=ThinkingLevel.LOW
)

model = "gemini-3-pro-preview"
response = client.models.generate_content_stream(
    model=model, contents="请讲一个200字的故事,不要用推理,直接回答。",config={
        "thinkingConfig": thinking_config
    })
for chunk in response:
    print(chunk.text, end="")

Using ThinkingLevel.LOW was way faster than using ThinkingLevel.HIGH.

Last but not least, I think that you could have a better latency using location (close to you) but it’s not permitted when you’re not using OAuth2.

Hello,

A similar issue is also discussed in the thread here: Need Guidance on Optimizing API Response times for google gemini 2.5 pro Inference

Though different models, I would recommend taking a look at the next steps to diagnose the performance impact.

Thanks!
Matt

1 Like

Hi tomlnux,

It looks like your time-to-first-token delay is unusually high despite normal network latency, which suggests the bottleneck may be on the Vertex AI side rather than your infrastructure. A few things you could try include pre-warming the model with a lightweight request, enabling a smaller initial maxOutputTokens to get the first token faster, or using the sse streaming method with a persistent connection rather than initiating a new request each time

1 Like

Thank you very much for your advice/suggestions.

1 Like

Thank you. After testing, setting the thinking level to “low” reduced the request time to under 6 seconds.

1 Like

Hello @tomlnux,

I guess that you wanted to answer to me :eyes:

I’m very glad that it helped you. You can mark the right answer as a solution, not so much to reward me, but to help other users find a fix to the same problem.

Have a good day :blush:

1 Like