The AI Litmus Test 2.0: Scientifically evaluating GenAI Swarms with Agent-as-a-Judge & Vertex AI

aniketagrawal · June 11, 2026, 2:58pm

The “vibes check” is a recipe for enterprise failure

In our foundational piece, The AI Litmus Test: Scientifically Evaluating GenAI Playbook Agents, we established a critical ground rule for enterprise AI: the anecdotal “vibes check”—chatting with a bot manually to see if it “feels right”—is entirely inadequate for production software. When stakeholders or Chief Information Security Officers (CISOs) ask for proof that a bot won’t hallucinate financial figures, we must provide programmatic, automated metrics.

While our previous work focused on deterministic evaluation for Dialogflow CX, the landscape has radically shifted. We are no longer just building state-machine chatbots; we are orchestrating autonomous Multi-Agent Swarms using Vertex AI and the Google Agent Development Kit (ADK).

When an agent is dynamically traversing a Neo4j Graph Database, triggering real-time European Central Bank APIs, and parsing live weather telemetry, traditional evaluation methods (like checking for exact string matches) completely break down. We must shift from evaluating static text to evaluating dynamic trajectories.

In this comprehensive playbook, we will deconstruct the evolution of AI evaluation, establish the mathematical pillars of agentic quality, and provide concrete Python implementations using both Custom Agent-as-a-Judge paradigms and the scalable Vertex AI Gen AI Evaluation SDK.

Agentic evaluation architecture: CI/CD Flow Diagram

This diagram illustrates the complete end-to-end lifecycle of our “Zero-Mock” Tourism Swarm, demonstrating how execution traces are intercepted and routed through our dual-evaluation engines (Agent-as-a-Judge and Vertex AI Eval SDK) before hitting the deployment gatekeeper.

Architectural highlights:

The Interceptor (Phase 1): Notice how the Trajectory Tracker sits between the Worker Swarm and the Evaluation Layer, ensuring we evaluate the actual tool payload (glass-box), not just the final text (black-box).
The Math Tool (Phase 3): The dashed line connecting the Judge to the Programmatic Math Tool represents the crucial difference between passive LLM evaluation and active Agentic evaluation. The Judge offloads the math so it doesn’t hallucinate the scores.
The Gatekeeper (Phase 4): Both evaluation methods ultimately feed into a CI/CD threshold gate, mathematically blocking rogue or inefficient prompt updates from ever reaching the production deployment.

Part 1: The evolution of evaluation paradigms

Evaluating deterministic software is binary. Evaluating a standard LLM relies on continuous semantic metrics. Evaluating an autonomous Agent, however, requires inspecting the trajectory—the step-by-step sequence of tool calls, API inputs, and intermediate logic.

The industry has rapidly evolved through three distinct evaluation paradigms:

Table 1: Comparing evaluation paradigms

Feature	1. Deterministic Scripting (Legacy)	2. LLM-as-a-Judge (Passive)	3. Agent-as-a-Judge (Active)
Core Mechanism	Python `assert` statements and Regex matches over final text.	Passes prompt + final text to an LLM (e.g., Gemini Flash) with a grading rubric.	Deploys a tool-wielding agent (e.g., Gemini Pro) to mathematically audit trajectory JSON logs.
Visibility	Surface Level: Only checks the final string payload.	Low: Blind to intermediate API steps. Judges solely on semantic plausibility.	Glass-Box: Parses the exact Directed Acyclic Graph (DAG) of the swarm’s execution.
Math & Logic Accuracy	100%: Code executes perfectly.	~65%: LLMs struggle to count array lengths or divide steps “in their head,” leading to hallucinated grades.	100%: The Judge is equipped with programmatic Python tools to calculate metrics deterministically.
Best Used For…	Hardcoded constraints (e.g., “Must contain Escalation URL”).	Standard conversational RAG, tone, fluency, and safety checks.	Multi-hop autonomous swarms, verifying API payloads, and penalizing redundant loops.

Part 2: The foundation - Golden Datasets & BYOD

In LLMOps, your evaluation is only as good as the data you run it against. To evaluate an agentic trajectory, we must maintain strict datasets mapping the user prompt to the exact sequence of tools the agent should have called.

Table 2: Dataset curation strategies

Strategy	Definition	Primary LLMOps Value	Weakness
1. Ad-Hoc / Sandbox	Manually typing queries into a terminal or chat UI.	Rapid prototyping and vibe-checking during initial development.	Unscalable, unrepeatable, and provides zero proof of regression safety.
2. Golden Datasets	Curated JSON/CSV of edge-case queries, mapped to their Expected Trajectories.	The CI/CD Gatekeeper: Used in GitHub Actions to block broken prompt logic from deploying.	Static; requires manual curation by Prompt Engineers.
3. BYOD (Bring Your Own Dataset)	Ingesting real execution logs (`predicted_trajectory`) directly from production alongside the expected `reference_trajectory`.	Drift Detection: Batch evaluating real-world interactions over time to detect semantic drift or API degradation.	Requires robust OpenTelemetry tracking to capture DAGs properly.

Example of a BYOD / Golden Dataset Row

{
  "prompt": "Get the train ticket price from Tokyo to Kyoto and convert my $800 budget.",
  "reference_trajectory": [
    {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train price"}},
    {"tool_name": "get_country_demographics", "tool_input": {"country": "Japan"}},
    {"tool_name": "live_currency_conversion", "tool_input": {"amount": 800, "from": "USD", "to": "JPY"}}
  ]
}

Part 3: The 4 pillars of agentic quality (metrics)

To prevent our evaluators from relying purely on semantic guessing, we ground our evaluation in strict mathematical formulas across four pillars.

Table 3: The agentic quality matrix

Evaluation Pillar	The Core Question	Evaluation Methodology
1. Trajectory Accuracy	Did the agent hit the required API nodes in the DAG?	Vertex AI Eval: Checks `trajectory_exact_match` and `trajectory_any_order_match`.
2. Reasoning Faithfulness	Did the final answer strictly use the data returned by the tools?	Agent-as-a-Judge: An evaluator agent cross-references the raw tool payload against the final text to detect manufactured numbers.
3. Step Efficiency ( SE)	Did the agent avoid redundant loops and API calls?	Mathematical: Ratio of optimal steps to actual steps taken, computed via external tools.
4. Task Completion ( TC)	Did the agent resolve all sub-constraints of the prompt?	Model-Based: Fractional accuracy based on multi-intent resolution.

The mathematics of agentic action

1. Task Completion Rate ( TC)

TC = \frac{\sum_{i=1}^{N} C_i}{N}

Where N represents the total explicit constraints requested, and C_i \in \{0, 1\} represents whether constraint i was answered with verified API data.

2. Step Efficiency ( SE)

SE = \frac{U_{\text{optimal}}}{U_{\text{actual}} + R_{\text{redundant}}}

Where U_{\text{optimal}} is the theoretically shortest execution path, U_{\text{actual}} is the number of tools invoked, and R_{\text{redundant}} flags duplicate executions.

Part 4: Implementation Method A - “Agent-as-a-Judge”

This implementation builds an autonomous auditor. We arm Gemini 2.5 Pro with a custom Python function so it can calculate Step Efficiency deterministically, rather than hallucinating the math.

import json
from google.adk.agents import Agent
from vertexai.preview import reasoning_engines

# 1. The Interceptor (Attached to the Worker Swarm)
class TrajectoryTracker:
    def __init__(self):
        self.history = []
    def log_step(self, tool_name, inputs, output):
        self.history.append({"tool": tool_name, "inputs": inputs, "output": str(output)})

# 2. The Judge's Programmatic Math Tool
def compute_trajectory_efficiency(target_optimal_steps: int, raw_trajectory_json: str) -> str:
    """Tool utilized by the Judge Agent to compute Step Efficiency (SE) deterministically."""
    try:
        logs = json.loads(raw_trajectory_json)
        actual_steps = len(logs)
        
        # Enforce mathematical SE definition
        se_score = min(1.0, float(target_optimal_steps) / max(1, actual_steps))
        tools_used = list(set([step['tool'] for step in logs]))
        
        metrics_packet = {
            "mathematical_step_efficiency": round(se_score, 3),
            "total_invocations_observed": actual_steps,
            "unique_tools_accessed": tools_used,
            "contains_redundant_loops": actual_steps > target_optimal_steps
        }
        return json.dumps(metrics_packet)
    except Exception as e:
        return f"Error in mathematical parsing tool: {str(e)}"

# 3. The Agentic Judge (Gemini 2.5 Pro)
trajectory_judge_agent = Agent(
    name="TrajectoryAuditorJudge",
    model="gemini-2.5-pro",
    description="Autonomous evaluation agent that calculates path quality and response faithfulness.",
    instruction="""
    You are an autonomous Agent-as-a-Judge system.
    1. Call 'compute_trajectory_efficiency' passing the raw trajectory logs to extract exact path efficiency statistics.
    2. Check Reasoning Faithfulness: Verify that values in the Final Response perfectly match the tool outputs without hallucination.
    3. Output a structured JSON scorecard.
    """,
    tools=[compute_trajectory_efficiency]
)

Why this works: When piped together, the Judge Agent intercepts the worker’s JSON array, executes its math tool independently, and outputs a mathematically sound, hallucination-free scorecard.

Part 5: Implementation Method B - Vertex AI Gen AI Evaluation SDK

While building a custom Agent-as-a-Judge is incredible for deep, granular debugging, enterprise teams managing massive Golden Datasets in CI/CD pipelines need scalable, asynchronous batch infrastructure.

As we discussed in our original AI Litmus Test blog comparing dfcx-scrapi with standard SDKs, choosing the right tool for bulk evaluation is critical. The Vertex AI Evaluation SDK acts similarly to dfcx-scrapi—allowing you to evaluate thousands of rows from a Pandas DataFrame simultaneously, measuring custom Pointwise metrics and Trajectory matches.

import pandas as pd
import json
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import PointwiseMetric, PointwiseMetricPromptTemplate

# 1. Define the BYOD / Golden Dataset DataFrame
eval_data = {
    "prompt": [
        "Get the train price to Kyoto and convert my $800 USD budget.",
        "Get live weather for Paris and translate 'Hello'."
    ],
    "reference_trajectory": [
        json.dumps([
            {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train"}}, 
            {"tool_name": "live_currency_conversion"}
        ]),
        json.dumps([
            {"tool_name": "get_live_weather"},
            {"tool_name": "translate_text"}
        ])
    ],
    "predicted_trajectory": [  # Captured from our TrajectoryTracker in Production
        json.dumps([
            {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train"}}, 
            {"tool_name": "live_currency_conversion"}
        ]),
        json.dumps([
            {"tool_name": "get_live_weather"},
            {"tool_name": "translate_text"}
        ])
    ],
    "response": [
        "The ticket is 100 USD. Your budget is 128,000 JPY.",
        "It is 18C in Paris. Translation: Bonjour."
    ]
}
eval_df = pd.DataFrame(eval_data)

# 2. Define a Custom Pointwise Metric (LLM-as-a-Judge Rubric)
criteria = {
    "Follows trajectory": (
        "Evaluate whether the agent's response logically follows from the sequence of tool actions.\\n"
        "  - Does the response accurately reflect the data gathered from the tools without hallucination?\\n"
        "  - Are there any illogical jumps in reasoning?"
    )
}
pointwise_rating_rubric = {
    "1": "Response perfectly reflects information gathered in the trajectory.",
    "0": "Response contains hallucinated data or illogical jumps."
}

trajectory_faithfulness_prompt = PointwiseMetricPromptTemplate(
    criteria=criteria,
    rating_rubric=pointwise_rating_rubric,
    input_variables=["prompt", "predicted_trajectory"]
)
faithfulness_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=trajectory_faithfulness_prompt
)

# 3. Combine with Native SDK Trajectory Metrics
metrics_to_run = [
    "trajectory_exact_match",       # Did it follow the DAG perfectly?
    "trajectory_any_order_match",   # Did it hit all required APIs?
    "safety",                       # Is the text output safe?
    faithfulness_metric             # Custom LLM-as-a-Judge rubric defined above
]

# 4. Execute the Scalable EvalTask
eval_task = EvalTask(
    dataset=eval_df,
    metrics=metrics_to_run,
    experiment="tourism-swarm-eval-pipeline"
)

# Run the evaluation against the BYOD dataset
eval_result = eval_task.evaluate()
print("📊 Final Swarm Evaluation Summary:")
print(eval_result.summary_metrics)

Part 6: Choosing your deployment strategy

How do you choose between building a custom Agent-as-a-Judge and utilizing the Vertex AI Evaluation SDK?

Table 4: Framework selection guide

Requirement	Custom Agent-as-a-Judge (Method A)	Vertex AI Evaluation SDK (Method B)
Complexity of Validation	High: Can execute Python scripts to validate DB state, write files, or double-check math programmatically.	Medium: Excellent for text analysis and exact structural matching via Prompt Templates, but cannot execute custom code.
Scale & Concurrency	Manual: Requires you to write custom `asyncio` loops to batch process multiple rows.	Automated: Seamlessly handles Pandas DataFrames with thousands of rows concurrently out-of-the-box.
Observability Integration	Requires Setup: Must manually map outputs to your visualization platform.	Native: Integrates instantly with Vertex AI Experiments for visual Radar and Bar charts over time.
Verdict	Use for deep, programmatic validation of complex reasoning chains requiring external tool access to verify truth.	Use for CI/CD pipeline integration, managing large Golden Datasets, and tracking regressions at scale.

Conclusion: LLMOps and the path to production

The difference between a “cool sandbox demo” and a “production-grade enterprise asset” is rigorous, automated testing.

By implementing an Agent-as-a-Judge framework and combining it with the batch-processing analytics power of the Vertex AI Evaluation SDK, you fundamentally transform your AI development lifecycle:

Establish Trust: You provide programmatic, mathematical proof to your risk teams that your agent does not hallucinate numerical data or get stuck in infinite API loops.
Automated Deployment Gating (CI/CD): If a Prompt Engineer accidentally breaks a reasoning chain in a pull request, the EvalTask will instantly detect a drop in SE or trajectory_exact_match and block the deployment.
Enterprise Observability: Integrate these evaluation frameworks natively with OpenTelemetry tracing tools like Langfuse, Arize Phoenix, or Braintrust to continuously monitor the Directed Acyclic Graph (DAG) in production.

Start evaluating scientifically today. The era of the “Vibes Check” is officially over.

References & resources

Previous Blog: The AI Litmus Test: Scientifically Evaluating GenAI Playbook Agents (Agrawal, 2026)
Previous Blog: Building Agentic GraphRAG on Vertex AI: Part 1 - Architecture & development with Neo4j
Source Code: Agent-as-a-Judge Evaluation Notebook (GitHub)
Source Code: Evaluating ADK Agents with SDK Notebook (GitHub)
Research: Judging LLM-as-a-Judge with MT-Bench (Zheng et al., 2023)
Research: RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)
Research: Agent-as-a-Judge: Evaluate Agents with Agents (Zhuge et al., 2024)
Documentation: Google Cloud Vertex AI Reasoning Engine
Documentation: Neo4j Graph Database Integration with LangChain
Documentation: Google Cloud Vertex AI Gen AI Evaluation Services
APIs Used: Open-Meteo, Frankfurter Forex, RESTCountries

We can’t wait to see what you build. Share your creations and ask questions in the Google Cloud Community. Happy coding!

Let’s keep the conversation going! Share your thoughts, questions, and ideas in the comments.

Note: Should you have any concerns or queries about this post or my implementation, please feel free to connect with me on LinkedIn! Thanks!

Paulo_Leads · June 15, 2026, 1:51pm

Excelente artigo. A transição do “vibes check” para a avaliação científica de trajetórias é exatamente o que separa automação funcional de infraestrutura escalável. O conceito de “Step Efficiency (SE)” – medir a proporção entre passos ótimos e reais – é crítico não apenas para agentes de IA, mas para qualquer pipeline de decisão automatizado. Na prática de prospecção B2B com IA, rastrear a trajetória de qualificação de um lead (quais APIs foram consultadas, quantas tentativas de contato, qual o custo por etapa) é o que viabiliza redução matemática de CAC. A metodologia Agent-as-a-Judge com ferramentas programáticas para cálculo determinístico elimina a alucinação de métricas – mesma lógica que aplicamos para auditar cada interação comercial. Parabéns pela profundidade técnica.

Topic		Replies	Views
The AI Litmus test: Scientifically evaluating GenAI Playbook Agents Community Articles googler-article , agent-platform , dialogflow	4	442	March 23, 2026
Building Computer Use Agents Generative AI & Foundational Models gemini , googler-article	1	523	April 8, 2026
Unlocking GenAI excellence: Why automated evaluation is your secret weapon Community Articles googler-article , ai-ml	0	700	May 23, 2025