Evaluating success in a multi-agent system: Why trajectory assessment and handoffs matters

Authors

  • Chitra Venkatesh, Product Manager
  • Junjie Bu, Senior Staff Software Engineer

As AI applications advance, multi-agent systems are becoming vital for managing complex tasks by assigning specialized roles to different agents, such as separating CRM and customer support agent. While these systems face challenges like discovering other agents, sharing context, and handling long running operations, Agent2Agent (A2A) protocol provides a paradigm that simplifies agent discovery, ensures scalability, and breaks down silos, ultimately fostering seamless communication and boosting both autonomy and productivity for multi-agent ecosystems.

Our current approach

As we shift towards multi-agent systems for complex use cases, our approach to evaluating agent performance needs a revamp. We often assess agents on a standalone basis, but is that truly the right approach when they’re designed to collaborate?

Current agent benchmarks typically evaluate single agent performance using metrics like task completion rates, latency, consistency, and cost. Current assessments of AI agent trajectories focus on their ability to use tools, often overlooking their capacity to interact with other agents.

However, in multi-agent systems assessing agents in isolation doesn’t provide a good estimate of the entire system’s end-to-end performance. We need to better understand how different agents work together towards the end solution collaboratively. In this post, we’ll explore why this is the case and discuss ways to address this measurement gap.

Evaluating performance in a multi-agent system

A multi-agent system consists of a chain of agent and tool calls. The quality of reasoning and the end-to-end task completion both need to be factored. This is why trajectory assessment is important. Let’s take an example.

Assuming a two-agent system where Agent A is a customer support agent and Agent B is a grievance redressal agent.

Assume a case where a customer reaches out with a complaint about a faulty product they recently purchased.

Customer interaction trajectory example:

  1. Customer initiates contact: “Hi, I bought a SmartWidget last week, and it’s not turning on. I’d like a refund or a replacement.”
  2. Agent A (Customer Support) initiates:
  • Uses the Greeting Tool to acknowledge the customer.
  • Uses the Customer Information Tool to verify the customer’s identity.
  • Uses the Customer Purchase History Tool to confirm the SmartWidget purchase and its date.
  • Recognizes the request falls under a grievance (refund/replacement) that requires specialized tools it doesn’t possess.
  • Decision: Determines the need to hand off the case to Agent B.
  • Action: Forwards the relevant customer and product details, along with the customer’s request, to Agent B.1. Agent B (Grievance Redressal) takes over:
  • Receives the context from Agent A.
  • Analyzes the request for a refund or replacement.
  • Decision: Determines the most appropriate resolution based on policy (e.g., if within the return window, offer refund or replacement).
  • Action (Scenario 1: Refund): Uses the Customer Refund Tool to process the refund.
  • Action (Scenario 2: Replacement): Uses the Order Replacement Tool to initiate a new shipment.
  • Action (Scenario 3: Complex Issue): If the issue is unusual or requires manager approval, uses the Escalation Tool.
  • Communicates the resolution back through Agent A, or directly to the customer (depending on system design).

Why traditional metrics fall short here:

Agent A’s “Task Completion”: On its own, Agent A didn’t “complete” the customer’s request for a refund. It successfully identified the need and handed it off. Evaluating only Agent A would show incomplete resolution.

Agent B’s “Task Completion”: Agent B might successfully process a refund, but if Agent A failed to correctly identify the customer or their purchase, Agent B’s success is moot from an end-to-end perspective.

In a multi-agent system, a key measure of an agent’s performance is its ability to successfully hand off to the next agent while the seamlessly sharing relevant context.

Latency: Measuring latency for individual agent steps doesn’t reflect the total time from initial customer contact to final resolution.

Cost: While individual agent operational costs are tracked, the overall cost of the entire interaction chain (including hand-off overhead) is the more relevant metric for the business.

Path forward

It is evident that trajectory evaluation across agents is imperative. We may also need a mechanism to enable an agent to declare its capability to emit structured data specifically for evaluation purposes. It sets the stage for what kind of evaluation data an agent is configured to expose. Other considerations include compliance adherence and evaluation frameworks that are suited for this paradigm.

We will be exploring this further in our next blog post .

Resources

And more to come..!

20 Likes

Beautiful breakdown

Clearly highlights why multi-agent systems need end-to-end, collaboration-focused evaluation instead of isolated agent metrics.

A very smart technique used.

Great analysis, Chitra and Junjie. Your observation that “assessing agents in isolation doesn’t provide a good estimate of the entire system’s end-to-end performance” is the critical hurdle for the next generation of AI.

​In response to your findings, I have published a research paper titled Optimizing Multi-Agent Trajectories via Cyber-Physical Context Engineering (Jan 2026). We analyzed the exact gap you identified: that an Agent B’s success is “moot” if Agent A provides corrupt context, yet standalone metrics miss this failure. We call this the “Hand-off Paradox.”

​To address the challenges of “discovering other agents” and “sharing context”, our research posits that the A2A protocol requires an architectural layer we call Artificial Wisdom (Integrity Supervisor).

​Instead of relying solely on agents “to declare its capability to emit structured data”, the Hoogland Methodology implements a “Truth Filter” at the hand-off point. This verifies the Authority Provenance and Hardware Reality of the context before the receiving agent acts. This transforms the hand-off from a fragile data transaction into a validated context transfer, preventing cascade failures and ensuring the “end-to-end task completion” you are aiming for.

​I believe this “Integrity Supervisor” is the missing link to robust, scalable Multi-Agent Systems.

Here is the research paper i referred to in this post, for your convenience.

Research Paper: Optimizing Multi-Agent Trajectories via Cyber-Physical Context Engineering

From Isolated Agents to Integrous Ecosystems: Applying the Hoogland Methodology to the Agent2Agent (A2A) Protocol

Date: 17 January 2016

Author: Davey Hoogland | Lead Security Researcher, Creative Mind Solutions

Subject: Analysis of Google’s Multi-Agent Evaluation Framework and the integration of Artificial Wisdom.

1. Abstract

Recent publications by Google’s Office of the CTO and Engineering teams, specifically by Venkatesh and Bu, highlight that current evaluation methods for AI agents are lacking in Multi-Agent Systems (MAS). As the industry shifts to specialized agents (e.g. CRM separate from Customer Support), context sharing and discovery are critical bottlenecks. This paper analyses these findings and posits the Highland Methodology as the necessary architectural solution. By implementing an “Integrity Supervisor” (Artificial Wisdom), the “hand-off” between agents becomes not only a data transaction, but a validated context transfer, which mitigates the risks identified by Google in end-to-end performance.

2. Problem: The “Blind Spot” in Current Multi-Agent Systems

In their analysis, Venkatesh and Bu note that the current benchmarks focus on single agent performance (tools, latency, consistency). In a collaborative ecosystem, however, this is inadequate.

2.1 Failure of Isolated Metrics

A multi-agent system consists of a chain of agent and tool calls.

The Hand-off Paradox: An agent (Agent A) can complete his task technically successfully (e.g. forwarding a ticket), but if the context is incomplete or corrupt, the succeeding agent (Agent B) fails in the final solution.

Moot Success: As Google states, if Agent A misidentifies the customer, Agent B’s successful refund process is irrelevant from an end-to-end perspective.

2.2 The Context Problem

Google identifies specific challenges such as “discovering other agents” and “sharing context.” Without an overarching framework, contextual transfer is fragile. Agent B must blindly rely on the data provided by Agent A. This is where Hoogland Methodology identifies a critical gap: the lack of Context Integrity Verification.

3. Analysis: Application of the Hoogland Methodology

The Hoogland Methodology, based on Cyber-Physical Context Engineering, offers the architectural solution for the “Trajectory Assessment” that Google deems necessary.

3.1 Cognitive Profiling as a solution to ‘Agent Discovery’

Venkatesh describes the need for agents to declare their capabilities.

Google’s approach: Structured data emission for evaluation.

Hoogland’s Innovation: We apply the Constructivist Engine principle. Instead of passive data emissions, every agent must have an active “Cognitive Profile.” Agent A does not simply “search” for Agent B, but the Constructivist Engine synthesizes the customer’s intention and matches it with Agent B’s Integrity Constraints. This prevents tasks from being passed on to agents who do have the tool, but not the authority (e.g. a refund above a certain amount).

3.2 Artificial Wisdom as the Hand-off Supervisor

In the scenario outlined by Google (SmartWidget Refund), the transfer of Agent A to Agent B is the vulnerable point.

The Vulnerability: Agent B receives context from Agent A and takes action (Refund Tool). If Agent A has halucinated that the warranty is valid, Agent B will make an unjustified refund.

The Trial Filter: The Highland Methodology introduces an intermediate layer: Artificial Wisdom. Before Agent B accepts the context, it passes through a Truth Filter. This layer verifies the Hardware Reality (is the purchase date in line with the current Secure Time?) Authority Provenance (Is Agent A authorized to initiate this escalation?).

3.3 Synchronization of Processes

Google cites “handling long running operations” as a challenge. The Chronos Analyzer (part of the Highland Methodology) addresses this by detecting temporal desynchronization. In an A2A protocol, this ensures that Agent B does not act on outdated context of Agent A, a problem that often leads to race conditions in standard asynchronous systems.

4.2 Potential Innovations for the Agent2Agent (A2A) Protocol

By integrating Context Engineering into the A2A paradigm, we unlock new possibilities for Agentic AI:

Verifiable Trajectories:

Google is looking for ways to measure end-to-end performance. The Hoogland Methodology proposes to use the Compliance Auditor as a “Black Box” recorder that not only logs the output, but validates the integrity of each hand-off. This makes the processes auditable for compliance (EU AI Act).

Semantic Silo Breaking:

Google says A2A “breaks down silos.” We add that Context Engineering breaks through semantic silos. By forcing agents to communicate via a standardized Integrity Protocol (part of Artificial Wisdom), we eliminate the confusion of speech between a CRM agent and a Support agent.

Prevention of cascade failure:

In the current model, an error in Agent A leads to failure in Agent B. With the Highland Methodology, the Truth Filter stops the chain as soon as Agent A provides a corrupt context. Agent B is protected from performing senseless or harmful actions.

5. Conclusion

The analysis by Venkatesh and Bu correctly confirms that isolated evaluation of agents fails in complex ecosystems. However, mere measurement of “trajectories” is insufficient without a mechanism for validation.

The Highland Methodology offers the missing link: Artificial Wisdom. By expanding the cognitive ability of the AI with an Integrity Supervisor, we are transforming the Multi-Agent System from a fragile chain of assumptions to a robust network of verified context. This validates Context Engineering not only as a theoretical concept, but as an operational necessity for scaling Google’s Agent2Agent protocol.

Sources quoted:

[1] Venkatesh, C., & Bu, J. (Google). “Agent2Agent (A2A) protocol provides a paradigm that simplifies agent discovery… breaks down silos.”

[2] Venkatesh, C., & Bu, J. (Google). “We often assess agents on a standalone basis, but is that truly the right approach when they are designed to collaborate?”

[3] Venkatesh, C., & Bu, J. (Google). “Current assessments of AI agent trajectories… often overlook their capacity to interact with other agents.”

[4] Venkatesh, C., & Bu, J. (Google). “A multi-agent system consists of a chain of agent and tool calls.”

[5] Venkatesh, C., & Bu, J. (Google). “Assume a case where a customer reaches out with a complaint about a faulty product…”

[6] Venkatesh, C., & Bu, J. (Google). “Action (Scenario 1: Refund): Uses the Customer Refund Tool to process the refund.”

[7] Venkatesh, C., & Bu, J. (Google). “Agent B’s success is beautiful from an end-to-end perspective… evaluating only Agent A would show incomplete resolution.”

8] Venkatesh, C., & Bu, J. (Google). “In a multi-agent system, assessing agents in isolation doesn’t provide a good estimate of the entire system’s end-to-end performance.”

[10] Venkatesh, C., & Bu, J. (Google). “It is obvious that trajectory evaluation across agents is imperative… enable an agent to declare its capability to emit structured data specifically for evaluation purposes.”