Intelligent UI testing with Gemini: Redefining mobile app quality assurance with GenAI

Olivier_Zhang · December 8, 2025, 8:11pm

Author:
Olivier Zhang
Shirong Liang

Date: December 2025

Abstract

UI testing has always been a critical component of mobile application quality assurance, yet traditional manual review approaches are inefficient and prone to missing issues. This article introduces an intelligent software testing toolkit (sdet-kit) built on Google Gemini Large Language Models. By integrating multiple GenAI capabilities, this tool enables precise UI element localization, intelligent video anomaly detection, and quantitative rendering performance analysis, significantly improving both the efficiency and accuracy of mobile app testing.

Keywords: Gemini, Computer Use, Mobile App Testing, UI Automation, Video Analysis, Large Language Models

1. Introduction: The challenges of mobile app testing

1.1 Pain points of traditional testing

With the rapid development of mobile internet, user expectations for app experience have grown increasingly demanding. For social e-commerce applications, details such as page loading speed, scrolling smoothness, and content rendering priority directly impact user retention. However, traditional UI testing methods face numerous challenges:

Low efficiency of manual review: Testers need to watch screen recording videos frame by frame, which is time-consuming and leads to fatigue-induced oversights
Difficulty in issue localization: UI anomalies often occur at millisecond scale, making accurate timing difficult to capture with the naked eye
Hard-to-quantify performance metrics: Metrics like LCP (Largest Contentful Paint) and first-screen rendering time lack automated measurement methods
High cost of batch testing: Multi-device, multi-scenario test combinations require substantial human resources

1.2 Opportunities brought by GenAI

The release of Google’s Gemini series models provides entirely new solutions to these problems:

Multimodal understanding: Gemini 3.0 Pro possesses powerful video comprehension capabilities, able to analyze UI changes in screen recordings
Pixel-level operations: Gemini Computer Use model can precisely identify and locate UI elements
Reasoning capabilities: Through carefully designed prompts, the model can understand complex UI anomaly definitions and make judgments
Concurrent processing: API-based approach naturally supports large-scale parallel testing

2. Project background and core value

2.1 Project positioning

sdet-kit is an intelligent testing toolkit designed for SDETs (Software Development Engineers in Test), specifically built for mobile application UI quality assurance.

2.2 Core capabilities

Capability	Traditional Approach	This Project’s Approach
UI Element Localization	XPath/ID Selectors	Gemini Computer Use pixel-level positioning
Anomaly Detection	Manual video review	AI auto-detection of 7 issue types
Performance Measurement	Chrome DevTools	Video frame analysis for automatic measurement
Test Reports	Manual writing	Auto-generated CSV/Excel

3. Technical architecture overview

3.1 System architecture diagram

┌─────────────────────────────────────────────────────────────────┐
│                      User Interface Layer                       │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐│
│  │Image UI Det. │ │Video Anomaly │ │Render Perf.  │ │Page Load ││
│  │  Detection   │ │  Detection   │ │  Detection   │ │Detection ││
│  └──────────────┘ └──────────────┘ └──────────────┘ └──────────┘│
├─────────────────────────────────────────────────────────────────┤
│                     Application Logic Layer                     │
│  ┌──────────────┐ ┌───────────────┐ ┌──────────────┐            │
│  │   utils.py   │ │render_detector│ │frame_detector│            │
│  │Core Detection│ │Render Analysis│ │Frame Analysis│            │
│  └──────────────┘ └───────────────┘ └──────────────┘            │
├─────────────────────────────────────────────────────────────────┤
│                       AI Services Layer                         │
│  ┌─────────────────────────┐ ┌─────────────────────────────────┐│
│  │    Gemini 3.0 Pro       │ │   Gemini Computer Use           ││
│  │  - Video Understanding  │ │  - UI Element Localization      ││
│  │  - Anomaly Detection    │ │  - Android Device Control       ││
│  │  - Performance Analysis │ │  - Pixel-level Coordinates      ││
│  └─────────────────────────┘ └─────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                     Infrastructure Layer                        │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             │
│  │  GCS Storage │ │    FFmpeg    │ │     ADB      │             │
│  │ Video Cache  │ │Video Process │ │Device Control│             │
│  └──────────────┘ └──────────────┘ └──────────────┘             │
└─────────────────────────────────────────────────────────────────┘

3.2 Data flow

Video upload: Users upload screen recording videos through the Web UI
Preprocessing: FFmpeg downsamples video to 1fps (each frame ≈ 33ms of original time)
Cloud caching: Video uploaded to GCS to obtain URI
AI analysis: Call Gemini API for multimodal analysis
Result parsing: Extract detection results in JSON format
Visualization: Display detection report in Web UI

4. Six core functional modules explained

4.1 Precise UI element detection

Feature description

Based on the Gemini Computer Use model, this module achieves pixel-level localization of UI elements in app screenshots. Users simply describe the target element in natural language, and the model returns precise bounding box coordinates.

Core code

def detect_ui(image_path, ui_element, model):
    client = genai.Client()
    img = Image.open(image_path)
    width, height = img.size
    
    system_instructions = """
    Return bounding boxes as a JSON array with labels. 
    Never return masks or code fencing. Limit to 25 objects.
    """
    
    prompt = f"Detect the '{ui_element}' in the image. Return bounding boxes."
    
    response = client.models.generate_content(
        model=model,  # gemini-2.5-computer-use-preview
        contents=[prompt, img],
        config=types.GenerateContentConfig(
            system_instruction=system_instructions,
            temperature=0.5,
        )
    )
    # Parse returned coordinates and convert to absolute pixel values
    # ...

Technical highlights

Normalized coordinates: Model returns normalized coordinates in 0-1000 range, facilitating adaptation to different resolutions
Multi-object detection: Single call can detect multiple similar elements, automatically named by characteristics
Bounding box visualization: Automatically draws red bounding boxes and green center points on original images

4.2 Intelligent video anomaly detection

Feature description

Analyzes app screen recording videos to automatically identify 7 common UI anomaly issues:

Anomaly Type	Definition	Detection Criteria
Bottom Loading Delay	Slow content loading when scrolling to bottom	Loading time > 1 second
Page Stuttering	Non-smooth display during scrolling/switching	Obvious pauses or frame drops
Element Flickering	UI elements unexpectedly flash or disappear	Distinguish normal loading from abnormal flickering
Interface Jitter	Unexpected position shifts of page elements	Layout jumping, position mutations
Gesture Response Issues	Swipe gestures fail or require multiple attempts	Response doesn’t match expectations
App Crash	Application unexpectedly exits to system desktop	Distinguish normal return from abnormal exit
Timing Anomaly	Unreasonable loading order of page elements	Important content loads after secondary content

Parallel mode uses ThreadPoolExecutor to initiate 7 API calls simultaneously, each using specialized prompts optimized for specific issues.

4.3 Render performance detection

Feature description

Specifically detects page rendering performance after Pull-to-Refresh operations, with core metrics:

First Slot Render Time: Time for the first feed card in the top-left corner to fully render
Render Priority: Whether the first slot renders before other areas

Prompt design highlights

SYSTEM_PROMPT = """
You are a professional page performance analyzer specializing in 
detecting page rendering performance after pull-to-refresh operations.

Key Definitions:
1. Pull-to-refresh: User gesture dragging the page **downward** from 
   the top edge, triggering content reload
2. Top-left quadrant: Upper-left quarter of the feed viewport
3. Rendering complete: Content fully rendered with no blur, placeholders, 
   skeleton screens, or loading states

Critical Notes:
- Focus exclusively on pull-to-refresh gestures, **ignore other page loads**
- Video has been processed to 1fps, each frame represents 1 second
"""

Key points:

Clearly defines identification criteria for “pull-to-refresh”, excluding other types of page loads
Uses bold markers (**) to emphasize key constraints
Explains video preprocessing parameters to help model understand time scale

4.4 Page loading performance detection

Feature description

Simultaneously detects two key performance metrics:

LCP (Largest Contentful Paint): Rendering completion time of the largest content element
Full content loading: Time when all main content is stably displayed

Video preprocessing

def preprocess_video_to_fps1(video_path):
    # Step 1: Standardize to 30fps
    cmd_30fps = f"ffmpeg -i {video_path} -r 30 -y {temp_30fps_path}"
    subprocess.run(cmd_30fps, shell=True)
    
    # Step 2: Slow down 30x then reduce to 1fps (each frame represents original 33ms)
    cmd_1fps = f"ffmpeg -i {temp_30fps_path} -filter:v 'setpts=30*PTS' -r 1 -y {temp_1fps_path}"
    subprocess.run(cmd_1fps, shell=True)
    
    return temp_1fps_path

This preprocessing approach preserves millisecond-level time precision while compressing video to API-acceptable sizes.

4.5 Android automation control

Feature description

An intelligent agent based on Gemini Computer Use that can operate Android devices like a human to complete testing tasks.

Architecture design

class AndroidAgent:
    def __init__(self):
        self.client = genai.Client()
        self.tools = AndroidTools()  # ADB operation wrapper
        
        self.config = types.GenerateContentConfig(
            tools=[
                types.Tool(
                    computer_use=types.ComputerUse(
                        environment=types.Environment.ENVIRONMENT_UNSPECIFIED,
                        excluded_predefined_functions=[
                            "open_web_browser", "navigate", ...
                        ]
                    )
                ),
                types.Tool(
                    function_declarations=[
                        types.FunctionDeclaration(
                            name="go_home",
                            description="Return to home screen",
                            parameters={"type": "object", "properties": {}}
                        ),
                        # ...
                    ]
                )
            ],
            system_instruction="""You are operating an Android phone..."""
        )

ADB tool wrapper

class AndroidTools:
    def tap(self, x: int, y: int):
        # Convert normalized coordinates to actual pixels
        actual_x = int(x / 1000 * self.screen_width)
        actual_y = int(y / 1000 * self.screen_height)
        subprocess.run(["adb", "shell", "input", "tap", str(actual_x), str(actual_y)])
    
    def swipe(self, x1, y1, x2, y2, duration=300):
        # Swipe operation, supports custom duration
        ...
    
    def screenshot(self) -> bytes:
        result = subprocess.run(["adb", "exec-out", "screencap", "-p"], capture_output=True)
        return result.stdout

Agent execution loop:

Capture current screen screenshot
Send screenshot and task description to Gemini
Model returns operations to execute (e.g., click_at, scroll_at)
Execute operation and capture new screenshot
Feed execution results back to model, continue loop until task completion

5. Deep dive into GenAI technologies

5.1 Gemini 3.0 Pro’s video understanding capabilities

Gemini 3.0 Pro is currently one of the most powerful multimodal large models for video understanding. Its applications in this project:

Capability	Application Scenario	Technical Implementation
Long Video Processing	Analyze 10+ second app recordings	GCS URI + VideoMetadata(fps)
Temporal Understanding	Detect anomalies at specific moments	Model outputs timestamps in MM:SS format
Fine-grained Analysis	Identify subtle UI element changes	1fps preprocessing preserves millisecond precision
Multi-event Tracking	Detect multiple pull-to-refresh operations	JSON array format output

5.2 Gemini computer use’s pixel-level operations

The Computer Use model is specifically designed for GUI operations, with core characteristics:

# Model returns normalized coordinates (0-1000)
response = {
    "box_2d": [y_min, x_min, y_max, x_max],  # Note: y comes before x
    "label": "Comment button"
}

# Convert to absolute pixel coordinates
abs_x = int(box[1] / 1000 * width)
abs_y = int(box[0] / 1000 * height)

Function calling capability: Model can directly call predefined tool functions

types.Tool(
    computer_use=types.ComputerUse(
        environment=types.Environment.ENVIRONMENT_UNSPECIFIED
    )
)
# Model will return function calls like click_at(x=450, y=800)

5.3 Prompt engineering best practices

Principle 1: Clear role definition

You are a professional mobile application UI testing expert, specializing in detecting UI issues in apps.

Principle 2: Structured problem definition

1. **Bottom Loading Delay**
   - Definition: When user scrolls to page bottom, new content loading shows obvious delay
   - Manifestation: Loading indicator appears at bottom (e.g., spinner, "Loading" text)
   - Focus: Loading duration exceeding 1 second is considered problematic

Principle 3: Strict output format constraints

Output Format (JSON array):
[
  {
    "top_left_complete_timestamp": "xx:xx",
    "is_first_loaded": true/false
  }
]

Principle 4: Boundary condition specification

Note: If no pull-to-refresh operations detected in video, return empty array []

6. Conclusion and future outlook

6.1 Project value summary

This project successfully validates the enormous potential of GenAI in the software testing domain:

Efficiency improvement: Automated detection replaces manual review, saving 80%+ testing time
More comprehensive coverage: Simultaneously detects 7 types of issues, avoiding human oversight
Quantifiable results: Provides performance data accurate to milliseconds
Strong extensibility: Support for new detection scenarios through prompt modification

6.2 Future outlook

More anomaly types: Extend detection to white screen, black screen, ANR issues
Real-time detection: Integration with CI/CD pipeline for automatic testing on each release
Cross-platform support: Extend to iOS devices and web applications
Baseline comparison: Establish performance baselines, automatically detect performance regressions
Intelligent root cause analysis: Not only detect issues but also analyze possible causes

Topic		Replies	Views
Modernize IT Operations with Gemini: Two use cases that boost efficiency Community Articles googler-article , vertex-ai-platform	1	262	August 5, 2025
How Gemini is revolutionizing the media & entertainment industry Community Articles googler-article , thought-leadership , ai-ml	1	1096	June 20, 2025
Building Computer Use Agents Generative AI & Foundational Models gemini , googler-article	1	385	April 8, 2026

Intelligent UI testing with Gemini: Redefining mobile app quality assurance with GenAI

Abstract

1. Introduction: The challenges of mobile app testing

1.1 Pain points of traditional testing

1.2 Opportunities brought by GenAI

2. Project background and core value

2.1 Project positioning

2.2 Core capabilities

3. Technical architecture overview

3.1 System architecture diagram

3.2 Data flow

4. Six core functional modules explained

4.1 Precise UI element detection

Feature description

Core code

Technical highlights

4.2 Intelligent video anomaly detection

Feature description

4.3 Render performance detection

Feature description

Prompt design highlights

4.4 Page loading performance detection

Feature description

Video preprocessing

4.5 Android automation control

Feature description

Architecture design

ADB tool wrapper

5. Deep dive into GenAI technologies

5.1 Gemini 3.0 Pro’s video understanding capabilities

5.2 Gemini computer use’s pixel-level operations

5.3 Prompt engineering best practices

Principle 1: Clear role definition

Principle 2: Structured problem definition

Principle 3: Strict output format constraints

Principle 4: Boundary condition specification

6. Conclusion and future outlook

6.1 Project value summary

6.2 Future outlook

References

AI Suggested topics