Evaluating success in a multi-agent system: Why trajectory assessment and handoffs matters

vchitra · July 2, 2025, 8:08pm

Authors

Chitra Venkatesh, Product Manager
Junjie Bu, Senior Staff Software Engineer

As AI applications advance, multi-agent systems are becoming vital for managing complex tasks by assigning specialized roles to different agents, such as separating CRM and customer support agent. While these systems face challenges like discovering other agents, sharing context, and handling long running operations, Agent2Agent (A2A) protocol provides a paradigm that simplifies agent discovery, ensures scalability, and breaks down silos, ultimately fostering seamless communication and boosting both autonomy and productivity for multi-agent ecosystems.

Our current approach

As we shift towards multi-agent systems for complex use cases, our approach to evaluating agent performance needs a revamp. We often assess agents on a standalone basis, but is that truly the right approach when they’re designed to collaborate?

Current agent benchmarks typically evaluate single agent performance using metrics like task completion rates, latency, consistency, and cost. Current assessments of AI agent trajectories focus on their ability to use tools, often overlooking their capacity to interact with other agents.

However, in multi-agent systems assessing agents in isolation doesn’t provide a good estimate of the entire system’s end-to-end performance. We need to better understand how different agents work together towards the end solution collaboratively. In this post, we’ll explore why this is the case and discuss ways to address this measurement gap.

Evaluating performance in a multi-agent system

A multi-agent system consists of a chain of agent and tool calls. The quality of reasoning and the end-to-end task completion both need to be factored. This is why trajectory assessment is important. Let’s take an example.

Assuming a two-agent system where Agent A is a customer support agent and Agent B is a grievance redressal agent.

Assume a case where a customer reaches out with a complaint about a faulty product they recently purchased.

Customer interaction trajectory example:

Customer initiates contact: “Hi, I bought a SmartWidget last week, and it’s not turning on. I’d like a refund or a replacement.”
Agent A (Customer Support) initiates:

Uses the Greeting Tool to acknowledge the customer.
Uses the Customer Information Tool to verify the customer’s identity.
Uses the Customer Purchase History Tool to confirm the SmartWidget purchase and its date.
Recognizes the request falls under a grievance (refund/replacement) that requires specialized tools it doesn’t possess.
Decision: Determines the need to hand off the case to Agent B.
Action: Forwards the relevant customer and product details, along with the customer’s request, to Agent B.1. Agent B (Grievance Redressal) takes over:
Receives the context from Agent A.
Analyzes the request for a refund or replacement.
Decision: Determines the most appropriate resolution based on policy (e.g., if within the return window, offer refund or replacement).
Action (Scenario 1: Refund): Uses the Customer Refund Tool to process the refund.
Action (Scenario 2: Replacement): Uses the Order Replacement Tool to initiate a new shipment.
Action (Scenario 3: Complex Issue): If the issue is unusual or requires manager approval, uses the Escalation Tool.
Communicates the resolution back through Agent A, or directly to the customer (depending on system design).

Why traditional metrics fall short here:

Agent A’s “Task Completion”: On its own, Agent A didn’t “complete” the customer’s request for a refund. It successfully identified the need and handed it off. Evaluating only Agent A would show incomplete resolution.

Agent B’s “Task Completion”: Agent B might successfully process a refund, but if Agent A failed to correctly identify the customer or their purchase, Agent B’s success is moot from an end-to-end perspective.

In a multi-agent system, a key measure of an agent’s performance is its ability to successfully hand off to the next agent while the seamlessly sharing relevant context.

Latency: Measuring latency for individual agent steps doesn’t reflect the total time from initial customer contact to final resolution.

Cost: While individual agent operational costs are tracked, the overall cost of the entire interaction chain (including hand-off overhead) is the more relevant metric for the business.

Path forward

It is evident that trajectory evaluation across agents is imperative. We may also need a mechanism to enable an agent to declare its capability to emit structured data specifically for evaluation purposes. It sets the stage for what kind of evaluation data an agent is configured to expose. Other considerations include compliance adherence and evaluation frameworks that are suited for this paradigm.

We will be exploring this further in our next blog post .

Resources

And more to come..!

Topic		Replies	Views
Using Vertex AI to evaluate an example A2A Agent Community Articles googler-article	4	1278	August 11, 2025
End-to-end evaluation of multi-agent systems on Vertex AI with Cloud Run deployment for A2A agents Community Articles googler-article , ai-ml , learning	1	1118	August 11, 2025
A2A, MCP, and ADK — Clarifying their roles in the AI Ecosystem Community Articles googler-article , ai-ml	1	1296	August 25, 2025

Evaluating success in a multi-agent system: Why trajectory assessment and handoffs matters

Our current approach

Evaluating performance in a multi-agent system

Path forward

Resources

AI Suggested topics