The AI Litmus Test 2.0: Scientifically evaluating GenAI Swarms with Agent-as-a-Judge & Vertex AI

:stop_sign: The “vibes check” is a recipe for enterprise failure

In our foundational piece, The AI Litmus Test: Scientifically Evaluating GenAI Playbook Agents, we established a critical ground rule for enterprise AI: the anecdotal “vibes check”—chatting with a bot manually to see if it “feels right”—is entirely inadequate for production software. When stakeholders or Chief Information Security Officers (CISOs) ask for proof that a bot won’t hallucinate financial figures, we must provide programmatic, automated metrics.

While our previous work focused on deterministic evaluation for Dialogflow CX, the landscape has radically shifted. We are no longer just building state-machine chatbots; we are orchestrating autonomous Multi-Agent Swarms using Vertex AI and the Google Agent Development Kit (ADK).

When an agent is dynamically traversing a Neo4j Graph Database, triggering real-time European Central Bank APIs, and parsing live weather telemetry, traditional evaluation methods (like checking for exact string matches) completely break down. We must shift from evaluating static text to evaluating dynamic trajectories.

In this comprehensive playbook, we will deconstruct the evolution of AI evaluation, establish the mathematical pillars of agentic quality, and provide concrete Python implementations using both Custom Agent-as-a-Judge paradigms and the scalable Vertex AI Gen AI Evaluation SDK.

:world_map: Agentic evaluation architecture: CI/CD Flow Diagram

This diagram illustrates the complete end-to-end lifecycle of our “Zero-Mock” Tourism Swarm, demonstrating how execution traces are intercepted and routed through our dual-evaluation engines (Agent-as-a-Judge and Vertex AI Eval SDK) before hitting the deployment gatekeeper.

:magnifying_glass_tilted_left: Architectural highlights:

  1. The Interceptor (Phase 1): Notice how the Trajectory Tracker sits between the Worker Swarm and the Evaluation Layer, ensuring we evaluate the actual tool payload (glass-box), not just the final text (black-box).
  2. The Math Tool (Phase 3): The dashed line connecting the Judge to the Programmatic Math Tool represents the crucial difference between passive LLM evaluation and active Agentic evaluation. The Judge offloads the math so it doesn’t hallucinate the scores.
  3. The Gatekeeper (Phase 4): Both evaluation methods ultimately feed into a CI/CD threshold gate, mathematically blocking rogue or inefficient prompt updates from ever reaching the production deployment.

:microscope: Part 1: The evolution of evaluation paradigms

Evaluating deterministic software is binary. Evaluating a standard LLM relies on continuous semantic metrics. Evaluating an autonomous Agent, however, requires inspecting the trajectory—the step-by-step sequence of tool calls, API inputs, and intermediate logic.

The industry has rapidly evolved through three distinct evaluation paradigms:

Table 1: Comparing evaluation paradigms

Feature 1. Deterministic Scripting (Legacy) 2. LLM-as-a-Judge (Passive) 3. Agent-as-a-Judge (Active)
Core Mechanism Python assert statements and Regex matches over final text. Passes prompt + final text to an LLM (e.g., Gemini Flash) with a grading rubric. Deploys a tool-wielding agent (e.g., Gemini Pro) to mathematically audit trajectory JSON logs.
Visibility Surface Level: Only checks the final string payload. Low: Blind to intermediate API steps. Judges solely on semantic plausibility. Glass-Box: Parses the exact Directed Acyclic Graph (DAG) of the swarm’s execution.
Math & Logic Accuracy 100%: Code executes perfectly. ~65%: LLMs struggle to count array lengths or divide steps “in their head,” leading to hallucinated grades. 100%: The Judge is equipped with programmatic Python tools to calculate metrics deterministically.
Best Used For… Hardcoded constraints (e.g., “Must contain Escalation URL”). Standard conversational RAG, tone, fluency, and safety checks. Multi-hop autonomous swarms, verifying API payloads, and penalizing redundant loops.

:file_cabinet: Part 2: The foundation - Golden Datasets & BYOD

In LLMOps, your evaluation is only as good as the data you run it against. To evaluate an agentic trajectory, we must maintain strict datasets mapping the user prompt to the exact sequence of tools the agent should have called.

Table 2: Dataset curation strategies

Strategy Definition Primary LLMOps Value Weakness
1. Ad-Hoc / Sandbox Manually typing queries into a terminal or chat UI. Rapid prototyping and vibe-checking during initial development. Unscalable, unrepeatable, and provides zero proof of regression safety.
2. Golden Datasets Curated JSON/CSV of edge-case queries, mapped to their Expected Trajectories. The CI/CD Gatekeeper: Used in GitHub Actions to block broken prompt logic from deploying. Static; requires manual curation by Prompt Engineers.
3. BYOD (Bring Your Own Dataset) Ingesting real execution logs (predicted_trajectory) directly from production alongside the expected reference_trajectory. Drift Detection: Batch evaluating real-world interactions over time to detect semantic drift or API degradation. Requires robust OpenTelemetry tracking to capture DAGs properly.

Example of a BYOD / Golden Dataset Row

{
  "prompt": "Get the train ticket price from Tokyo to Kyoto and convert my $800 budget.",
  "reference_trajectory": [
    {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train price"}},
    {"tool_name": "get_country_demographics", "tool_input": {"country": "Japan"}},
    {"tool_name": "live_currency_conversion", "tool_input": {"amount": 800, "from": "USD", "to": "JPY"}}
  ]
}

:triangular_ruler: Part 3: The 4 pillars of agentic quality (metrics)

To prevent our evaluators from relying purely on semantic guessing, we ground our evaluation in strict mathematical formulas across four pillars.

Table 3: The agentic quality matrix

Evaluation Pillar The Core Question Evaluation Methodology
1. Trajectory Accuracy Did the agent hit the required API nodes in the DAG? Vertex AI Eval: Checks trajectory_exact_match and trajectory_any_order_match.
2. Reasoning Faithfulness Did the final answer strictly use the data returned by the tools? Agent-as-a-Judge: An evaluator agent cross-references the raw tool payload against the final text to detect manufactured numbers.
3. Step Efficiency ( SE) Did the agent avoid redundant loops and API calls? Mathematical: Ratio of optimal steps to actual steps taken, computed via external tools.
4. Task Completion ( TC) Did the agent resolve all sub-constraints of the prompt? Model-Based: Fractional accuracy based on multi-intent resolution.

The mathematics of agentic action

1. Task Completion Rate ( TC)

TC = \frac{\sum_{i=1}^{N} C_i}{N}

Where N represents the total explicit constraints requested, and C_i \in \{0, 1\} represents whether constraint i was answered with verified API data.

2. Step Efficiency ( SE)

SE = \frac{U_{\text{optimal}}}{U_{\text{actual}} + R_{\text{redundant}}}

Where U_{\text{optimal}} is the theoretically shortest execution path, U_{\text{actual}} is the number of tools invoked, and R_{\text{redundant}} flags duplicate executions.

:laptop: Part 4: Implementation Method A - “Agent-as-a-Judge”

This implementation builds an autonomous auditor. We arm Gemini 2.5 Pro with a custom Python function so it can calculate Step Efficiency deterministically, rather than hallucinating the math.

import json
from google.adk.agents import Agent
from vertexai.preview import reasoning_engines

# 1. The Interceptor (Attached to the Worker Swarm)
class TrajectoryTracker:
    def __init__(self):
        self.history = []
    def log_step(self, tool_name, inputs, output):
        self.history.append({"tool": tool_name, "inputs": inputs, "output": str(output)})

# 2. The Judge's Programmatic Math Tool
def compute_trajectory_efficiency(target_optimal_steps: int, raw_trajectory_json: str) -> str:
    """Tool utilized by the Judge Agent to compute Step Efficiency (SE) deterministically."""
    try:
        logs = json.loads(raw_trajectory_json)
        actual_steps = len(logs)
        
        # Enforce mathematical SE definition
        se_score = min(1.0, float(target_optimal_steps) / max(1, actual_steps))
        tools_used = list(set([step['tool'] for step in logs]))
        
        metrics_packet = {
            "mathematical_step_efficiency": round(se_score, 3),
            "total_invocations_observed": actual_steps,
            "unique_tools_accessed": tools_used,
            "contains_redundant_loops": actual_steps > target_optimal_steps
        }
        return json.dumps(metrics_packet)
    except Exception as e:
        return f"Error in mathematical parsing tool: {str(e)}"

# 3. The Agentic Judge (Gemini 2.5 Pro)
trajectory_judge_agent = Agent(
    name="TrajectoryAuditorJudge",
    model="gemini-2.5-pro",
    description="Autonomous evaluation agent that calculates path quality and response faithfulness.",
    instruction="""
    You are an autonomous Agent-as-a-Judge system.
    1. Call 'compute_trajectory_efficiency' passing the raw trajectory logs to extract exact path efficiency statistics.
    2. Check Reasoning Faithfulness: Verify that values in the Final Response perfectly match the tool outputs without hallucination.
    3. Output a structured JSON scorecard.
    """,
    tools=[compute_trajectory_efficiency]
)

Why this works: When piped together, the Judge Agent intercepts the worker’s JSON array, executes its math tool independently, and outputs a mathematically sound, hallucination-free scorecard.

:bar_chart: Part 5: Implementation Method B - Vertex AI Gen AI Evaluation SDK

While building a custom Agent-as-a-Judge is incredible for deep, granular debugging, enterprise teams managing massive Golden Datasets in CI/CD pipelines need scalable, asynchronous batch infrastructure.

As we discussed in our original AI Litmus Test blog comparing dfcx-scrapi with standard SDKs, choosing the right tool for bulk evaluation is critical. The Vertex AI Evaluation SDK acts similarly to dfcx-scrapi—allowing you to evaluate thousands of rows from a Pandas DataFrame simultaneously, measuring custom Pointwise metrics and Trajectory matches.

import pandas as pd
import json
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import PointwiseMetric, PointwiseMetricPromptTemplate

# 1. Define the BYOD / Golden Dataset DataFrame
eval_data = {
    "prompt": [
        "Get the train price to Kyoto and convert my $800 USD budget.",
        "Get live weather for Paris and translate 'Hello'."
    ],
    "reference_trajectory": [
        json.dumps([
            {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train"}}, 
            {"tool_name": "live_currency_conversion"}
        ]),
        json.dumps([
            {"tool_name": "get_live_weather"},
            {"tool_name": "translate_text"}
        ])
    ],
    "predicted_trajectory": [  # Captured from our TrajectoryTracker in Production
        json.dumps([
            {"tool_name": "query_tourism_graph", "tool_input": {"question": "Tokyo to Kyoto train"}}, 
            {"tool_name": "live_currency_conversion"}
        ]),
        json.dumps([
            {"tool_name": "get_live_weather"},
            {"tool_name": "translate_text"}
        ])
    ],
    "response": [
        "The ticket is 100 USD. Your budget is 128,000 JPY.",
        "It is 18C in Paris. Translation: Bonjour."
    ]
}
eval_df = pd.DataFrame(eval_data)

# 2. Define a Custom Pointwise Metric (LLM-as-a-Judge Rubric)
criteria = {
    "Follows trajectory": (
        "Evaluate whether the agent's response logically follows from the sequence of tool actions.\\n"
        "  - Does the response accurately reflect the data gathered from the tools without hallucination?\\n"
        "  - Are there any illogical jumps in reasoning?"
    )
}
pointwise_rating_rubric = {
    "1": "Response perfectly reflects information gathered in the trajectory.",
    "0": "Response contains hallucinated data or illogical jumps."
}

trajectory_faithfulness_prompt = PointwiseMetricPromptTemplate(
    criteria=criteria,
    rating_rubric=pointwise_rating_rubric,
    input_variables=["prompt", "predicted_trajectory"]
)
faithfulness_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=trajectory_faithfulness_prompt
)

# 3. Combine with Native SDK Trajectory Metrics
metrics_to_run = [
    "trajectory_exact_match",       # Did it follow the DAG perfectly?
    "trajectory_any_order_match",   # Did it hit all required APIs?
    "safety",                       # Is the text output safe?
    faithfulness_metric             # Custom LLM-as-a-Judge rubric defined above
]

# 4. Execute the Scalable EvalTask
eval_task = EvalTask(
    dataset=eval_df,
    metrics=metrics_to_run,
    experiment="tourism-swarm-eval-pipeline"
)

# Run the evaluation against the BYOD dataset
eval_result = eval_task.evaluate()
print("📊 Final Swarm Evaluation Summary:")
print(eval_result.summary_metrics)

:hammer_and_wrench: Part 6: Choosing your deployment strategy

How do you choose between building a custom Agent-as-a-Judge and utilizing the Vertex AI Evaluation SDK?

Table 4: Framework selection guide

Requirement Custom Agent-as-a-Judge (Method A) Vertex AI Evaluation SDK (Method B)
Complexity of Validation High: Can execute Python scripts to validate DB state, write files, or double-check math programmatically. Medium: Excellent for text analysis and exact structural matching via Prompt Templates, but cannot execute custom code.
Scale & Concurrency Manual: Requires you to write custom asyncio loops to batch process multiple rows. Automated: Seamlessly handles Pandas DataFrames with thousands of rows concurrently out-of-the-box.
Observability Integration Requires Setup: Must manually map outputs to your visualization platform. Native: Integrates instantly with Vertex AI Experiments for visual Radar and Bar charts over time.
Verdict Use for deep, programmatic validation of complex reasoning chains requiring external tool access to verify truth. Use for CI/CD pipeline integration, managing large Golden Datasets, and tracking regressions at scale.

:rocket: Conclusion: LLMOps and the path to production

The difference between a “cool sandbox demo” and a “production-grade enterprise asset” is rigorous, automated testing.

By implementing an Agent-as-a-Judge framework and combining it with the batch-processing analytics power of the Vertex AI Evaluation SDK, you fundamentally transform your AI development lifecycle:

  1. Establish Trust: You provide programmatic, mathematical proof to your risk teams that your agent does not hallucinate numerical data or get stuck in infinite API loops.
  2. Automated Deployment Gating (CI/CD): If a Prompt Engineer accidentally breaks a reasoning chain in a pull request, the EvalTask will instantly detect a drop in SE or trajectory_exact_match and block the deployment.
  3. Enterprise Observability: Integrate these evaluation frameworks natively with OpenTelemetry tracing tools like Langfuse, Arize Phoenix, or Braintrust to continuously monitor the Directed Acyclic Graph (DAG) in production.

Start evaluating scientifically today. The era of the “Vibes Check” is officially over.

:books: References & resources

We can’t wait to see what you build. Share your creations and ask questions in the Google Cloud Community. Happy coding!

Let’s keep the conversation going! Share your thoughts, questions, and ideas in the comments.

Note: Should you have any concerns or queries about this post or my implementation, please feel free to connect with me on LinkedIn! Thanks!

5 Likes

Excelente artigo. A transição do “vibes check” para a avaliação científica de trajetórias é exatamente o que separa automação funcional de infraestrutura escalável. O conceito de “Step Efficiency (SE)” – medir a proporção entre passos ótimos e reais – é crítico não apenas para agentes de IA, mas para qualquer pipeline de decisão automatizado. Na prática de prospecção B2B com IA, rastrear a trajetória de qualificação de um lead (quais APIs foram consultadas, quantas tentativas de contato, qual o custo por etapa) é o que viabiliza redução matemática de CAC. A metodologia Agent-as-a-Judge com ferramentas programáticas para cálculo determinístico elimina a alucinação de métricas – mesma lógica que aplicamos para auditar cada interação comercial. Parabéns pela profundidade técnica.

1 Like