Hello there!
We’re thrilled to introduce an experimental version of the new Vertex Gen AI Eval SDK for Python!
The Vertex Gen AI Eval SDK for Python is a client-side framework for evaluating generative AI models and applications. This new release introduces a more intuitive workflow, expanded model support, and flexible data handling.
This is an Experimental release. While it isn’t intended for production use, it’s an opportunity for you to try a new approach and for us to gather your valuable feedback.
With that being said, let’s see what’s new!
What’s New?
The new SDK simplifies the evaluation process with a two-step workflow and introduces several new features.
1. Simplified Two-Step Evaluation
The core of the new SDK is a simple two-step process: run_inference() to generate responses and evaluate() to compute metrics.
import pandas as pd
from vertexai import Client, types
# Initialize client
client = Client(project="your-project-id", location="us-central1")
# Create your dataset
prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})
# Run inference to get responses
eval_dataset = client.evals.run_inference(
model="gemini-2.5-flash",
src=prompts_df
)
# Evaluate the responses
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
# Visualize results directly in your notebook
eval_result.show()
2. Side-by-Side Model Comparison
You can now easily compare multiple models in a single run. Just pass a list of inference results to the evaluate() method to get a side-by-side comparison, complete with win-rate calculations.
# Run inference on two different models
candidate_1 = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
candidate_2 = client.evals.run_inference(model="gemini-2.5-pro", src=prompts_df)
# Compare the results
comparison_result = client.evals.evaluate(
dataset=[candidate_1, candidate_2],
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
comparison_result.show()
3. Native Third-Party Model Support
Evaluate and compare models from providers like OpenAI directly within the SDK. The SDK uses litellm on the backend, so you just need to set your API key as an environment variable.
import os
# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'
# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
model='gpt-4o',
src=prompts_df
)
# Evaluate the response using Vertex AI metrics
eval_result = client.evals.evaluate(
dataset=gpt_response,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
eval_result.show()
4. Flexible & Powerful Metrics
The SDK offers a library of pre-built metrics (TEXT_QUALITY, SAFETY, etc.) and makes it easy to create your own.
- LLM-based Metrics: Define custom metrics for nuanced criteria like style or creativity using the
LLMMetricclass. - Computation-based Metrics: Use predefined metrics like
exact_match,bleu, androuge_1. - Custom Functions: Pass your own Python function for complete control over evaluation logic.
Here’s how to create a custom LLM-based metric:
# Define a custom metric for language simplicity
simplicity_metric = types.LLMMetric(
name='language_simplicity',
prompt_template=types.MetricPromptBuilder(
instruction="Evaluate the story's simplicity for a 5-year-old.",
criteria={
"Vocabulary": "Uses simple words.",
"Sentences": "Uses short sentences.",
},
rating_scores={
"3": "Excellent: Very simple, ideal for a 5-year-old.",
"2": "Fair: Mix of simple and complex; may be challenging.",
"1": "Very Poor: Very complex, unsuitable for a 5-year-old."
}
)
)
5. Asynchronous Batch Evaluation
For large datasets, you can use the batch_evaluate() method to run evaluation as a long-running, asynchronous job. This is perfect for when you don’t need immediate results and want to offload the computation.
bucket_uri = "gs://your-gcs-bucket/batch_eval_results/"
# Run inference and save results to GCS
inference_result_saved = client.evals.run_inference(
model="gemini-2.5-flash",
src="gs://path/to/your/prompts.jsonl",
config={'dest': bucket_uri}
)
# Start the batch evaluation job
batch_eval_job = client.evals.batch_evaluate(
dataset=inference_result_saved,
metrics=[types.PrebuiltMetric.TEXT_QUALITY],
dest=bucket_uri
)
6. Rich In-Notebook Visualization
Use the .show() method on EvaluationDataset and EvaluationResult objects to render detailed, interactive HTML reports directly in your Colab or Jupyter notebooks.
Inference results example
Evaluation report example
Try It Out!
We’ve prepared a Colab notebook to help you get started.
You can also find more detailed information in the official documentation.
We Want Your Feedback!
This is your chance to influence the future of model evaluation on Vertex AI. We’d love to hear about your experience with the new SDK.
- What do you like?
- What’s missing?
- Did you run into any issues?
- How can we make it better for your use cases?
Please share your thoughts, questions, and feedback in the discussion below.

