Introducing the Experimental Vertex AI Gen AI Eval SDK - We are looking for your feedback!

Hello there!

We’re thrilled to introduce an experimental version of the new Vertex Gen AI Eval SDK for Python!

The Vertex Gen AI Eval SDK for Python is a client-side framework for evaluating generative AI models and applications. This new release introduces a more intuitive workflow, expanded model support, and flexible data handling.

This is an Experimental release. While it isn’t intended for production use, it’s an opportunity for you to try a new approach and for us to gather your valuable feedback.

With that being said, let’s see what’s new!


:sparkles: What’s New?

The new SDK simplifies the evaluation process with a two-step workflow and introduces several new features.

1. Simplified Two-Step Evaluation

The core of the new SDK is a simple two-step process: run_inference() to generate responses and evaluate() to compute metrics.

import pandas as pd
from vertexai import Client, types

# Initialize client
client = Client(project="your-project-id", location="us-central1")

# Create your dataset
prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})

# Run inference to get responses
eval_dataset = client.evals.run_inference(
    model="gemini-2.5-flash", 
    src=prompts_df
)

# Evaluate the responses
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

# Visualize results directly in your notebook
eval_result.show()

2. Side-by-Side Model Comparison

You can now easily compare multiple models in a single run. Just pass a list of inference results to the evaluate() method to get a side-by-side comparison, complete with win-rate calculations.

# Run inference on two different models
candidate_1 = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
candidate_2 = client.evals.run_inference(model="gemini-2.5-pro", src=prompts_df)

# Compare the results
comparison_result = client.evals.evaluate(
    dataset=[candidate_1, candidate_2],
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

comparison_result.show()

3. Native Third-Party Model Support

Evaluate and compare models from providers like OpenAI directly within the SDK. The SDK uses litellm on the backend, so you just need to set your API key as an environment variable.

import os

# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'

# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
    model='gpt-4o',
    src=prompts_df
)

# Evaluate the response using Vertex AI metrics
eval_result = client.evals.evaluate(
    dataset=gpt_response,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

eval_result.show()

4. Flexible & Powerful Metrics

The SDK offers a library of pre-built metrics (TEXT_QUALITY, SAFETY, etc.) and makes it easy to create your own.

  1. LLM-based Metrics: Define custom metrics for nuanced criteria like style or creativity using the LLMMetric class.
  2. Computation-based Metrics: Use predefined metrics like exact_match, bleu, and rouge_1.
  3. Custom Functions: Pass your own Python function for complete control over evaluation logic.

Here’s how to create a custom LLM-based metric:

# Define a custom metric for language simplicity
simplicity_metric = types.LLMMetric(
    name='language_simplicity',
    prompt_template=types.MetricPromptBuilder(
        instruction="Evaluate the story's simplicity for a 5-year-old.",
        criteria={
            "Vocabulary": "Uses simple words.",
            "Sentences": "Uses short sentences.",
        },
        rating_scores={
            "3": "Excellent: Very simple, ideal for a 5-year-old.",
            "2": "Fair: Mix of simple and complex; may be challenging.",
            "1": "Very Poor: Very complex, unsuitable for a 5-year-old."
        }
    )
)

5. Asynchronous Batch Evaluation

For large datasets, you can use the batch_evaluate() method to run evaluation as a long-running, asynchronous job. This is perfect for when you don’t need immediate results and want to offload the computation.

bucket_uri = "gs://your-gcs-bucket/batch_eval_results/"

# Run inference and save results to GCS
inference_result_saved = client.evals.run_inference(
    model="gemini-2.5-flash",
    src="gs://path/to/your/prompts.jsonl",
    config={'dest': bucket_uri}
)

# Start the batch evaluation job
batch_eval_job = client.evals.batch_evaluate(
    dataset=inference_result_saved,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY],
    dest=bucket_uri
)

6. Rich In-Notebook Visualization

Use the .show() method on EvaluationDataset and EvaluationResult objects to render detailed, interactive HTML reports directly in your Colab or Jupyter notebooks.

Inference results example

Evaluation report example

:hammer_and_wrench: Try It Out!

We’ve prepared a Colab notebook to help you get started.

You can also find more detailed information in the official documentation.

:speech_balloon: We Want Your Feedback!

This is your chance to influence the future of model evaluation on Vertex AI. We’d love to hear about your experience with the new SDK.

  • What do you like?
  • What’s missing?
  • Did you run into any issues?
  • How can we make it better for your use cases?

Please share your thoughts, questions, and feedback in the discussion below.

4 Likes

"I apologize for writing in Japanese, but this is my first time participating. I’m quite struck by this post. I’m a freelance engineer, and for various reasons, I was commissioned by my current client to develop an internal chatbot—specifically, an agent model—and it’s been nearly four months. This post seems highly relevant to my own development work, so it caught my interest.

What need inspired this idea?

The thing is, my client is a heavy ChatGPT user. I actually built a process that combines the Vertex AI API and the Gemini API, including granting permission for Workspace access to execute and complete tasks. However, its appeal and power didn’t resonate with the client, who isn’t a developer.

Since I have to meet my client’s requests, I’m now in a difficult position because they are asking me to develop a system that accesses the database via the OpenAI API to respond to their requests."