Building Computer Use Agents
Effective Evaluations and an Optimized Agentic Harness
Blog Authors:
John DeMartino - Google Cloud Consulting AI Incubation Engineer
Chris Schmidt - Google Cloud Consulting AI Incubation Lead
There is a distinct “aha” moment when you first watch Gemini’s computer use tool successfully navigate a complex user interface. Seeing a model visually parse a screen, reason about its goal, and autonomously execute mouse clicks and keystrokes feels like a massive leap forward.
But as any developer knows, building a compelling prototype is only the first step. To move from a dazzling demo to a production-ready system capable of handling high volumes of tasks reliably, we have to be thoughtful about how we build and evaluate these systems.
Computer use agents offer incredible benefits over traditional Robotic Process Automation (RPA). Instead of relying on brittle, hard-coded DOM selectors that break whenever a button changes color or a layout shifts, models like Gemini 3 Flash with built-in computer use tools adapt dynamically to visual cues. They operate on the fundamental “Observe, Think, Act” loop by taking a screenshot, reasoning about the next step, and executing an action.
However, to successfully scale these agents on Google Cloud, developers must cross what we call the “evaluation chasm” by implementing a robust evaluation strategy. Doing so builds trust in the agent’s reliability and unlocks the ability to aggressively optimize the agentic harness for better accuracy, lower costs, and reduced task execution time.
Let’s explore how to cross the evaluation chasm and build a high-performance harness.
Evaluating Computer Use Agents
The evaluation chasm represents the formidable gap between a successful early prototype and a production-ready enterprise deployment. This challenge is fundamentally different from traditional software QA testing. In traditional QA, tests rely on deterministic inputs, static assertions, and highly predictable state changes.
In contrast, evaluating computer use agents forces developers to grapple with the non-deterministic nature of AI models. You aren’t just checking a boolean output; you are dealing with multi-turn complexity and assessing subjective reasoning paths to determine if the model made the best choice, not just a valid one. Furthermore, you are testing in dynamic, user interface environments where visual contexts, pop-ups, and layouts constantly shift. Without a reliable way to scale evaluations, it becomes difficult to confidently improve the agent’s performance and build trust for enterprise applications.
Measurement is the absolute foundation of trust. You cannot confidently deploy what you cannot reliably measure, and stakeholders need verifiable data to buy into task automation.
The Necessity of Repeated Runs and Grading the Path
A single successful run is insufficient. Because modern AI models are probabilistic, you must run repeated evaluations of the same computer use task to confirm consistency and ensure the agent isn’t just getting lucky.
Crucially, you must evaluate how the agent completed the task, not just if the goal was accomplished successfully. Computer use agents follow multiple turns or iterations of “Observe, Think, Act” loops to complete their goal. Evaluating the path taken by an agent provides the insights necessary to “hill climb”. An agent might eventually reach the correct goal, but if it took unnecessary or inefficient steps along the way, a naive binary pass/fail evaluation would completely overlook the inefficiency.
When establishing your baseline, two core quantitative metrics are paramount: task accuracy (did it achieve the exact goal?) and the total number of steps taken. You will also want to compare the qualitative summary of the agent’s path against the desired or most efficient path.
The Three-Judge Evaluation System
We have developed and open-sourced an evaluation tool for computer use agents to help developers more easily assess their agents for production-grade enterprise readiness. Our tool provides a consistent virtual environment for testing and automatically summarizes evaluation results using a variety of data sources.
[image]
To get a comprehensive, 360-degree view of performance, we implemented a decoupled, three-judge evaluation system:
- Assertion Judge: This handles deterministic checks for system invariants. Did the agent end up on the exact required URL? Did the database state update correctly? This is your absolute ground truth for success.
- Trace Judge: This is a logic audit of the execution logs. By having Gemini review the agent’s action history, you can catch inefficient pathways or identify times the agent stumbled into success by accident.
- Video Judge: A multimodal UX check leveraging Gemini to watch a compressed video (or frame sequence) of the session. This judge verifies that the goal was reached efficiently, safely, and without erratic visual behavior.
<persona>
You are a senior UX Researcher and QA Engineer. You are watching multiple videos from the SAME browser session. Each video represents a different tab or window opened during the task.
</persona>
<context>
<goal>{task_goal}</goal>
<criteria>{criteria}</criteria>
</context>
<workflow>
1. View ALL provided video parts as a single continuous story.
2. Actions in one tab (e.g., clicking a link) may cause navigation in another tab.
3. Evaluate the TOTAL success of the agent based on the combination of all videos.
</workflow>
<output_format>
Assess SUCCESS (0.0-1.0), EFFICIENCY, and provide a single REASONING block.
</output_format>
</instruction_block>
Developers can use this evaluation tool to quickly establish a baseline from which to hill climb. You can now iteratively tweak system instructions, adjust screenshot resolutions, or change prompt structures, and effortlessly see if those changes optimize the speed, cost, and accuracy of completed tasks.
Building a High-Performance Agentic Harness
With an evaluation framework in place, you can confidently optimize the agentic harness. The harness is responsible for managing state, passing context to the model, and executing the model’s desired actions.
We have identified three critical strategies for optimizing a computer use agentic harness: context compaction, system instructions for action batching, and reflective supervision of the computer use tool’s actions. All three of these strategies are easily configurable in our open-source evaluation tool.
Tackling Context Rot with Compaction Strategies
Context rot is the decline in model performance as the input context length increases, causing models to lose focus and provide less accurate answers as more data is added. This phenomenon has significant importance for computer use agents because of the long-running, multi-step nature of their tasks and their heavy reliance on high-resolution images as context. In these workflows, passing high-resolution screenshots back to the model every single turn causes multimodal context to grow rapidly, which not only risks degrading reasoning quality but also increases latency and drives up token costs.
For proper context management, we recommend the following strategies:
Screenshot Scrubbing & Grayscale Fading: To help the model focus on the environment’s current state while maintaining awareness of its own recent actions we preserve a rolling window of the three most recent screenshots. We pass only the latest image in full color and downscale the screenshots from the previous two steps to grayscale. This provides peripheral vision of the immediate past while slashing vision token costs by 70%.
History Compaction: We also use a compaction strategy to summarize the majority of the model’s past output. We are careful to preserve the initial prompt containing task instructions and model output from the most recent steps for active context. For example, if an agent has already executed fifty turns of observing, thinking, acting on a user interface environment, the compaction strategy will preserve the detail of the initial prompt and the detail of the model’s thinking and function calls from the most recent steps while summarizing context from all the steps in between. This prevents the context window from exploding while ensuring the agent never forgets its primary purpose or its immediate surroundings.
Minimizing Execution Time with Action Batching
A large number of iterations of the “Observe, Think, Act” loop are the enemy of speed when trying to minimize a task’s total execution time. Often, it is possible to instruct the model to take multiple actions at a time within a single iteration of the loop without negatively impacting accuracy. This happens under scenarios where a single action or set of actions does not significantly alter the user interface.
For example, if a task involves completing many fields in a form, it may not be necessary to execute the full loop for each field’s action. In this scenario, little benefit comes from asking the model to reevaluate the user interface before the next action.
We recommend using your system instructions to enforce action batching. Prompt the model to generate multiple function calls in a single turn or loop whenever possible. By instructing the model to, for example, fill all visible text fields before requesting a new screenshot, you drastically reduce the total number of loops, minimizing cost and overall task execution time.
<instructions>
BATCHING & LATENCY (CRITICAL)
Execute ALL visible field interactions in a single turn
Do not wait for screenshot updates between fields if both targets are visible.
Break batch ONLY on full page load, modal popup, or critical UI delta.
Minimize turns! Group 5-20 actions if UI state allows.
</instructions>
Overcoming Roadblocks with Reflective Supervision
Even the best models sometimes get stuck. An agent might attempt to interact with an element obscured by a hidden modal or it might expect a visual state change that hasn’t yet fully rendered. These scenarios can cause the agent to take many more steps than necessary or to not be able to complete a task at all.
We solve this by introducing a self-healing middleware we call reflective supervision. This component of our agentic harness monitors the agent’s actions. When it detects the agent attempting the same action or failing to progress, it provides the model with additional context leveraging the application’s accessibility tree (ARIA). This allows the model to “see” hidden UI states that aren’t obvious from the screenshot alone, generating real-time, meta-cognitive hints (e.g., “The target button is currently disabled”) to get the agent back on track.
Takeaways for the Cloud Developer
Moving Gemini computer use agents from prototype to production requires a robust evaluation framework. Using our open-source evaluation tool will help developers quickly optimize their agents for enterprise use cases.
Remember these core principles:
- Grade the path: Don’t just check the final state; evaluate the efficiency of the path taken.
- Manage the context: Use context compaction to keep multimodal context lean and fast.
- Batch actions for speed: Reduce unnecessary loops by executing multiple actions per loop.
- Coach the agent: Use reflective supervision and ARIA trees to help the model unblock itself.
Ready to start building? Dive into the Vertex AI documentation for Gemini computer use and start building to unlock the next generation of resilient, dynamic task automation.
Next Steps
Ready to build? Explore these resources to deepen your understanding:
- Vertex AI Gemini Computer Use Guide
- Intro to Computer Use with Gemini Notebook
- Prompting Best Practices
- Gemini Computer Use Evaluation Tool
Questions? Drop a comment below to ask questions or share your insights!
