Unexpected pauses or lack of responses from the 'gemini-1.5-flash-002' model in Vertex AI

I have been using the “gemini-1.5-flash-002” model with Vertex AI to generate content for the past few weeks. While it works well initially, it occasionally pauses unexpectedly after processing a certain number of requests.

I attempted to identify a pattern, such as the number of requests or the time elapsed before the pauses occur, but no consistent trend emerged. Sometimes, the model handles around 1,500 requests without issues, while other times, it pauses after approximately 100 requests.

The variation in the number of input tokens between requests is minimal, as the input data is relatively consistent in length.

When the pause occurs, it lasts for about 10 minutes before throwing the following error:


Traceback (most recent call last):
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
return callable_(*args, **kwargs)
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "Internal error encountered."
debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.67.170:443 {created_time:"2025-01-02T15:39:22.823504078+05:30", grpc_status:13, grpc_message:"Internal error encountered."}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/user/Documents/other_projects/lab/scripts/other/misc.py", line 58, in <module>
output = context_based_match(text)
File "/home/user/Documents/other_projects/lab/scripts/other/misc.py", line 31, in context_based_match
response = model.generate_content(
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/vertexai/generative_models/_generative_models.py", line 619, in generate_content
return self._generate_content(
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/vertexai/generative_models/_generative_models.py", line 744, in _generate_content
gapic_response = self._prediction_client.generate_content(request=request)
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/google/cloud/aiplatform_v1/services/prediction_service/client.py", line 2147, in generate_content
response = rpc(
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
File "/home/user/Documents/other_projects/lab/.venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InternalServerError: 500 Internal error encountered.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1735812565.706618 32847 init.cc:229] grpc_wait_for_shutdown_with_timeout() timed out.

Here is my sample code:

import base64
import json
from datetime import datetime

import pytz
import vertexai
from google.oauth2 import \
    service_account  # importing auth using service_account
from vertexai.generative_models import GenerationConfig, GenerativeModel

indian_tz = pytz.timezone("Asia/Kolkata")

cred_in_base64_encoding = "base64-encoded-google-app-credentials"
google_app_creds = json.loads(base64.b64decode(cred_in_base64_encoding).decode("utf-8"))
credentials = service_account.Credentials.from_service_account_info(
    google_app_creds, scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Initialize Vertex AI
vertexai.init(
    project=google_app_creds["project_id"],
    location="europe-west3",
    credentials=credentials,
)
model_name = "gemini-1.5-flash-002"
model = GenerativeModel(model_name)

def context_based_match(input_text):
    llm_query = f"LLM prompt with {input_text}"
    # Your response should be in this JSON-format: {{"relevant": boolean}}
    response = model.generate_content(
        llm_query,
        generation_config=GenerationConfig(
            response_mime_type="application/json",
            max_output_tokens=32,
            temperature=0,
            seed=1102,
            response_schema={
                "type": "object",
                "properties": {
                    "relevant": {
                        "type": "boolean",
                    }
                },
                "required": ["relevant"],
            },
        ),
    )
    try:
        return json.loads(response.text)
    except Exception as e:
        print(f"Error: {e}")
        return {"relevant": None}

data = list()  # List of Texts
for i, text in enumerate(data, 1):
    output = context_based_match(text)
    print(f"{i} | {datetime.now(indian_tz)} | {output['relevant']} | {text}")
​

I couldn’t find anything related to the issue in any google documentations or on internet. Even “Quotas & System Limits” page doesn’t show any usage statistics of “gemini-1.5-flash-002” model. I can only see “Online prediction requests per minute per region” statistics which is given below.

Any insights would be appreciated.

Hi @urvisism ,

Welcome to Google Cloud Community!

The error you’re encountering, Internal error encountered (StatusCode.INTERNAL), followed by a 10-minute pause, typically indicates that there may be an issue on the server-side or an internal limit being exceeded. This can happen even when the client-side logic seems sound. Given the intermittent nature of the issue, it suggests a potential problem with the connection to the Vertex AI service, rate limiting, or resource exhaustion in the underlying infrastructure. Here are some troubleshooting steps:

1. Retry Mechanism: Add retry logic to your code. Instead of trying once, try several times with longer pauses between each attempt. This helps deal with temporary network or service problems.

2. Request Rate Limiting: Even if you haven’t set up rate limiting, consider adding it. If your requests go beyond the hidden limits of the Gemini service, it could lead to internal errors. Try reducing the number of requests per minute or second.

3. Check Google Cloud Status: Look for any reported outages or service disruptions on the Google Cloud Status Dashboard related to Vertex AI or the region you’re using. If there’s a known issue, waiting for resolution is your best option.

4. Consider Alternative Models (if possible): If your application’s requirements allow, explore using a different Gemini model or even a different large language model altogether. This might avoid the specific issue you’re encountering with gemini-1.5-flash-002.

Since the error is consistently an “Internal Server Error” and isn’t tied to your requests, contacting Google Cloud support is the best approach. They can access internal logs and identify the root cause of the intermittent service disruptions affecting your use of the gemini-1.5-flash-002 model. Provide them with the error message, timestamps, and potentially some sample requests.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.