Vertex AI Evaluation Issues (429 RESOURCE_EXHAUSTED and TOOL_USE_QUALITY Input Schema Error)

I am writing to inquire about the following two issues encountered while using the Vertex AI Evaluation feature.


1. 429 RESOURCE_EXHAUSTED Errors During Vertex AI Evaluation (Judge model resource exhausted)

Issue Description

When running evaluations using the Vertex AI Python SDK Evaluation API, the following error intermittently occurs for some evaluation cases:

429 RESOURCE_EXHAUSTED
Judge model resource exhausted. Please try again later.

Within the same evaluation request, some cases are evaluated successfully, while others fail with score and explanation remaining None.
In the summary metrics, these failures are counted as error=1.

Environment

  • Location: global

  • Authentication: Application Default Credentials (user credentials)

  • API used: Client().evals.evaluate(...)

  • Metric used: RubricMetric.GENERAL_QUALITY

Mitigation Attempts

  • Implemented batch-level retry logic in addition to the SDK’s default retry behavior

  • However, when retries are enabled, processing time exceeds 1 minute per sample, which is not practical for real-world usage

Questions

In situations where judge model calls partially fail with 429 errors, is there a recommended approach to ensure stable evaluation without relying heavily on retries?

Additionally, when attempting to increase the relevant quota, the Edit quota option is disabled on the Quotas page.
Could you please advise on the correct procedure to request a quota increase for judge model usage in this case?


2. INVALID_ARGUMENT Error Related to tool_usage When Using TOOL_USE_QUALITY Metric

Issue Description

When running evaluations with RubricMetric.TOOL_USE_QUALITY, the evaluation fails with the following error:

400 INVALID_ARGUMENT
Error rendering metric prompt template:
Variable tool_usage is required but not provided.

Details

  • According to the official documentation, the tool_use_quality_v1 metric requires the following inputs:
    prompt, developer_instruction, tool_declarations, and intermediate_events,
    and tool_usage is not documented as a required input.

  • However, during execution, the server-side evaluation logic appears to require the {tool_usage} variable, resulting in the error above.

  • This issue persists even when:

    • intermediate_events are provided in Gemini-compatible function call / function response format

    • The client is initialized with HttpOptions(api_version="v1beta1")

    • The metric is explicitly pinned as RubricMetric.TOOL_USE_QUALITY(version="v1")

Questions

  1. Is it expected behavior that the TOOL_USE_QUALITY metric internally relies on a legacy prompt-template path requiring {tool_usage}?

  2. If tool_usage is indeed required, could you please provide the officially supported schema and a concrete example?

  3. Is the documented intermediate_events-based usage insufficient for successfully running this metric at the moment?


We would appreciate any root cause analysis, official guidance, or recommended workarounds.

Thank you for your support.

From my POV the root cause for your first issue is : as evaluation engine’s internal template for tool use quality expects a variable named {tool_usage}, but the standard EvaluateInstances or tool_use_quality_v1 documentation focuses on prompt, tool_declarations, and intermediate_events.

U can explicitly map tool usage or check api version or alternativly u can try

consider using the PointwiseMetric with a custom rubric that mimics the Tool Use Quality logic. This allows you to define your own variables (like tool_call) and avoid the hardcoded internal template error.

Thank you for your explanation.

After analyzing the SDK library code, we verified that when the data is structured with the following schema, the TOOL_USE_QUALITY metric runs without errors:

df = pd.DataFrame(
    [
        {
            "prompt": "AI and data science advancements in 2024",
            "response": "Successfully finished",
            "intermediate_events": [
                {
                    "content": {
                        "parts": [
                            {
                                "function_call": {
                                    "name": "Senior Research Analyst",
                                    "args": {
                                        "query": "Conduct comprehensive research..."
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    ]
)

However, having to analyze the SDK source code each time or consult with experts to determine the correct schema is quite cumbersome.
We would like to ask whether there is any official documentation that clearly specifies the expected input schema for each metric (including Tool Use Quality), so that we can reference it directly.

Thank you.

Hello @lgeax,

About the 429

The Resource Exhausted (429) errors page may interest you. From the documentation, retry and exponential backoff are the way to go.

Note that there is a 30,000 requests per minute (RPM) limit and the following tiers are available in Standard PayGo:

From Troubleshooting Error Code 429 on GenAI:

On the pay-as-you-go quota framework, you have the following options for resolving 429 errors:

The luxury solution would be to set Provisioned Throughput but the prices are insane as it’s oriented for mission-critical workloads. I don’t think that you’re going to need that but if anyone is wondering how much it costs to have a dedicated gemini endpoint… There is a price calculator for that.

That said, which model are you using as a Judge Model?

From experience, pro models may return 429 errors more often than flash ones, as they are used more and take longer to respond, thus consuming more resources, which can lead to instability, especially when there is a shared pool.

If you’re already using flash / flash-lite models… You may have to stick with the documentation :confused:


About tool_usage

I’ve found a (mostly) recent notebook from a Google employee explaining, including the use of RubricMetric.TOOL_USE_QUALITY.

As the product is still in beta, the first thing that comes to mind would be to upgrade to the latest version. We can even see those commands from the notebook:

%pip install -q google-cloud-aiplatform[adk,agent_engines]
%pip install --upgrade --force-reinstall -q google-cloud-aiplatform[evaluation]

We have already reviewed the notebook you shared
(https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/create_agent_and_run_evaluation.ipynb).

The example code in the notebook prepares only the prompt and generates a dataset for evaluation through the inference process.
However, in our case, we would like to perform evaluation without running inference, using a Bring-Your-Own-Response (BYOR) approach.

  1. For running the TOOL_USE_QUALITY metric without data schema errors, we analyzed the vertexai SDK library and identified the required schema.
    We confirmed that no runtime errors occur, but we would like to ask whether using it in the following way is correct for the TOOL_USE_QUALITY metric.

    df = pd.DataFrame(
        [
            {
                "prompt": "AI and data science advancements in 2024",
                "response": "Successfully finished",
                "intermediate_events": [
                    {
                        "content": {
                            "parts": [
                                {
                                    "function_call": {
                                        "name": "Senior Research Analyst",
                                        "args": {
                                            "query": "Conduct comprehensive research..."
                                        }
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        ]
    )
    
  2. To use built-in metrics (including TOOL_USE_QUALITY and other built-in metrics), do we need to analyze the vertexai SDK library each time to identify the required schema as done above?

Thank you in advance for your guidance and support.

From my POV the root cause for your first issue is : as evaluation engine’s internal template for tool use quality expects a variable named {tool_usage}, but the standard EvaluateInstances or tool_use_quality_v1 documentation focuses on prompt, tool_declarations, and intermediate_events.

U can explicitly map tool usage or check api version or alternativly u can try

cnsider using the PointwiseMetric with a custom rubric that mimics the Tool Use Quality logic. This allows you to define your own variables (like tool_call) and avoid the hardcoded internal template error.

To stablise ur evaluation pipelie code should follow following pattern

from vertexai.evaluation import EvalTask, PointwiseMetric

1. Initialize with a specific region if us-central1 is crowded

import vertexai
vertexai.init(project=“your-project”, location=“europe-west4”)

2. Use a custom Metric if the built-in TOOL_USE_QUALITY is buggy

custom_tool_metric = PointwiseMetric(
metric=“custom_tool_use”,
metric_prompt_template=“Assess if {tool_call} matches the {reference_schema}…”,

This avoids the hardcoded requirement for {tool_usage}

)

3. Implement manual batching to avoid 429s

def run_batched_eval(data, batch_size=20):

result =
for i in range(0, len(data), batch_size):

batch = data[i:i+batch_size]

Run eval on batch

time.sleep(5) # Small cooldown for the Judge model
return results

Please feel free to connect

Google’s documentation for the Vertex AI Rapid Evaluation SDK is still being consolidated, there are specific official references you can use to identify the required schema for each metric.

  1. The Definitive Schema Reference

The most detailed technical documentation for these schemas is actually found in the Vertex AI API Reference for the evaluateInstances method. Because the Python SDK is a wrapper for this REST API, the “Input” objects defined there map directly to the keys required in your Pandas DataFrame or dictionary.

  • Official Link: Vertex AI API Reference: EvaluateInstances
  • Key Section: Look at the metric_inputs field. It lists individual input objects for every supported metric (e.g., toolCallValidInput, groundednessInput, fulfillmentInput).
  1. Required Fields by Metric Type

Based on the official API specs, here is a quick reference for the most common Gen AI metrics:

Metric Category Required Schema Keys (Columns in your DataFrame)
Tool Use Quality prompt, response, intermediate_events (containing function_call objects)
Groundedness response, context (the source text to check against)
Fulfillment prompt, response
Summarization prompt, response, context (the original text being summarized)
Question Answering prompt, response, reference (the ground truth answer)
Safety / Fluency response (these are often pointwise and only need the output)
  1. Documentation for Tool-Specific Metrics

For tool-related evaluations specifically, Google recently updated the documentation to reflect the “Trajectory” and “Tool Call” metrics. You can find high-level descriptions and requirements here:

  • Model-based metrics overview: This page categorizes metrics and lists their general requirements (e.g., whether they require “Reference” or “Context”).
  • Gen AI Evaluation Service API reference: This page provides a “Requirements” column in its metric table that specifies exactly which fields (like URI, instruction, or context) must be present.

Why the error happened in your case

The specific error you encountered (tool_usage variable required) often stems from a version mismatch where the SDK expects the “v1” schema but the backend is using a “v1beta” template.

Suggestion for the Future: If you ever encounter an error about a “missing variable” in a rubric metric, you can usually bypass the schema headache by using a PointwiseMetric. It allows you to define your own keys so you aren’t locked into the pre-defined (and sometimes poorly documented) internal schemas of the RubricMetric constants.