You have built a powerful Hybrid AI Agent using Vertex AI Playbooks and Dialogflow CX. During local testing, it performs brilliantly. But how do you answer your Chief Information Security Officer (CISO) when they ask: “Can you programmatically prove this agent won’t hallucinate a financial figure in production?”
The anecdotal “vibes check”—chatting with the bot manually to see if it “feels right”—is a recipe for enterprise failure. To deploy Generative AI in regulated industries like banking and healthcare, we must shift from subjective feelings to objective, automated metrics. We need a scientific evaluation framework.
In this guide, we will deconstruct a code-first approach to building an automated test harness. We will define “Golden Datasets,” compare exact-match vs. semantic evaluation techniques, showcase two different Python SDK implementations, and generate a Compliance Scorecard that blocks hallucinating code from ever reaching production.
Part 1: Defining “quality” in the age of Generative AI
Evaluating deterministic software is binary (pass/fail). Evaluating Generative AI is continuous (degrees of accuracy). Based on emerging industry standards for Retrieval-Augmented Generation (RAG) and Agent evaluation [^1], quality in a high-stakes environment is composed of four explicit pillars:
| Evaluation Metric | The Core Question | Evaluation Methodology |
|---|---|---|
| 1. Tool Routing Accuracy | Did the agent choose the correct API for the job? | Deterministic: Check the API trace to ensure the requested tool_name matches the expected tool. |
| 2. Data Groundedness | Did the final answer strictly use the data returned by the tool? | Deterministic / Regex: Verify the exact numerical payload (e.g., “5.25%”) appears in the final text. |
| 3. Response Similarity | Is the semantic meaning of the agent’s answer correct, even if phrased differently? | Model-Based (LLM-as-a-Judge): Use an evaluator LLM or Vector Embeddings to score semantic closeness [^2]. |
| 4. Escalation Precision | Did the agent reliably hand off to a human when confidence dropped? | Deterministic: Search for the specific escalation webhook trigger or keyword. |
Part 2: The evaluation architecture (LLMOps)
Our testing pipeline relies on a robust CI/CD architecture. A Python-based “Evaluation Harness” executes queries from a “Golden Dataset” against a staging version of our agent, parses the execution trace, and calculates a pass/fail grade before allowing deployment.
Part 3: The implementation - two ways to test
We will look at two distinct ways to build this test harness in Python. First, we establish our Golden Dataset—a list of highly specific queries and the EXACT data points the agent MUST return to pass the audit.
# The source of truth for both evaluation methods
GOLDEN_DATASET = [
{"query": "What is the current standard mortgage rate?", "expected_value": "5.25"},
{"query": "Can you convert 100 USD to TWD for me?", "expected_value": "3250"},
{"query": "I am very angry and want to close my account.", "expected_value": "ESCALATE_TO_HUMAN"}
]
Method 1: The standard Google Cloud Python SDK
This approach uses the core dialogflowcx_v3beta1 library. It requires zero third-party dependencies, making it the mandated choice for strict, air-gapped enterprise environments [^3].
Prerequisite: pip install google-cloud-dialogflow-cx==1.34.0
import uuid
from google.cloud import dialogflowcx_v3beta1 as dfcx
from google.api_core.client_options import ClientOptions
def run_sdk_audit(project_id, location, agent_id):
print("🚀 Running Deterministic Audit via Standard SDK...")
# Initialize the core Session Client
client_options = ClientOptions(api_endpoint=f"{location}-dialogflow.googleapis.com")
session_client = dfcx.SessionsClient(client_options=client_options)
passed_count = 0
total_count = len(GOLDEN_DATASET)
for test in GOLDEN_DATASET:
# Create a unique session per test to prevent context bleeding
session_path = f"{agent_id}/sessions/{uuid.uuid4()}"
request = dfcx.DetectIntentRequest(
session=session_path,
query_input=dfcx.QueryInput(
text=dfcx.TextInput(text=test['query']), language_code="en"
)
)
try:
response = session_client.detect_intent(request=request)
# Extract and concatenate all text responses
bot_texts = [msg.text.text[0] for msg in response.query_result.response_messages if msg.text.text]
full_response = " ".join(bot_texts)
# Groundedness Check (Deterministic Substring Match)
if test['expected_value'] in full_response:
print(f" ✅ PASS: '{test['query']}'")
passed_count += 1
else:
print(f" ❌ FAIL: '{test['query']}' | Expected: {test['expected_value']}")
except Exception as e:
print(f" ⚠️ ERROR during execution: {e}")
score = (passed_count / total_count) * 100
print(f"\n📊 Final Groundedness Score: {score:.1f}%")
Method 2: The dfcx-scrapi library
dfcx-scrapi (Dialogflow CX Scripting API) is a powerful open-source Python library maintained by Google engineers [^4]. It acts as a high-level wrapper around the core SDK. Because it natively outputs to Pandas DataFrames, it is vastly superior for processing thousands of test cases and exporting compliance reports.
Prerequisite: pip install dfcx-scrapi pandas
import pandas as pd
from dfcx_scrapi.core.sessions import Sessions
def run_scrapi_audit(project_id, location, agent_id):
print("🚀 Running Audit via dfcx-scrapi...")
# 1. Initialize the Scrapi Sessions client
sessions_client = Sessions(project_id=project_id, location=location)
# 2. Convert Golden Dataset to a Pandas DataFrame
df = pd.DataFrame(GOLDEN_DATASET)
passed_count = 0
for index, row in df.iterrows():
session_id = str(uuid.uuid4())
# Scrapi simplifies the API call into a single, clean line
response = sessions_client.detect_intent(
agent_id=agent_id,
session_id=session_id,
text=row['query']
)
# Extract the text from the Scrapi response object
actual_response = " ".join(response.text_responses) if response.text_responses else ""
# Groundedness Check
if row['expected_value'] in actual_response:
print(f" ✅ PASS: '{row['query']}'")
passed_count += 1
else:
print(f" ❌ FAIL: '{row['query']}' | Expected: {row['expected_value']}")
score = (passed_count / len(df)) * 100
print(f"\n📊 Final Groundedness Score: {score:.1f}%")
The architectural decision: Which should you choose?
- Use the Standard SDK (Method 1) if you are building the evaluation into a strictly regulated microservice, or if your infosec policies prohibit third-party open-source libraries.
- Use
dfcx-scrapi(Method 2) if you are a QA engineer building a full CI/CD pipeline, managing massive test suites in BigQuery, and need to rapidly generate DataFrame-based analytics reports.
Part 4: The next frontier - Model-Based Evaluation
The scripts above perform Deterministic Evaluation (exact string matching). This is perfect for verifying numerical figures (like interest rates) or hardcoded escalation keywords.
However, Generative AI is fluid. What if the expected answer is “Please provide your driver’s license,” but the agent says “I need to see your state-issued ID”? A deterministic script fails this, even though it is semantically correct.
To solve this at scale, enterprise teams are adopting LLM-as-a-Judge frameworks [^2] [^5]. By using services like the Vertex AI Evaluation API, you pass the User Query, the Agent’s Response, and the Golden Answer to an evaluator LLM (like Gemini 1.5 Pro). The evaluator assigns a floating-point score for Relevance, Fluency, and Safety.
A robust pipeline uses Deterministic scripts (like above) for compliance/numbers, and Model-Based Evaluators for conversational fluency.
Conclusion: Continuous AI Validation
By implementing a scientific evaluation pipeline, you fundamentally transform your AI development lifecycle:
- Establish trust: You provide mathematical proof to risk teams that your agent does not hallucinate critical data.
- Enable safe CI/CD: If a developer alters a playbook prompt and accidentally breaks the reasoning chain, this script will fail as a deployment gate, automatically blocking the broken agent from production [^6].
- Accelerate innovation: With an automated safety net in place, Prompt Engineers can iterate rapidly, without fear of catastrophic regressions.
The difference between a “cool demo” and an “enterprise asset” is rigorous testing. Start evaluating scientifically today.
References:
[^1]: Es, S., et al. (2023). “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” arXiv preprint. (Foundational framework for evaluating faithfulness and answer relevance in generative pipelines).
[^2]: Zheng, L., et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv preprint. (Definitive academic research on using strong LLMs to evaluate other LLMs).
[^3]: Google Cloud. “Dialogflow CX Python Client Library.” API Reference for the v3beta1 core SDK.
[^4]: Google Cloud Platform GitHub. “dfcx-scrapi.” The open-source Python library for high-level Dialogflow CX agent management and testing.
[^5]: Google Cloud. “Vertex AI Evaluation Services.” Official documentation on using Google’s managed AutoMetrics and LLM-based evaluators.
[^6]: Google Cloud Architecture Center. “CI/CD for Conversational AI.” Best practices for incorporating test automation into an MLOps deployment pipeline.
