Intelligent UI testing with Gemini: Redefining mobile app quality assurance with GenAI

Author:
Olivier Zhang
Shirong Liang

Date: December 2025


Abstract

UI testing has always been a critical component of mobile application quality assurance, yet traditional manual review approaches are inefficient and prone to missing issues. This article introduces an intelligent software testing toolkit (sdet-kit) built on Google Gemini Large Language Models. By integrating multiple GenAI capabilities, this tool enables precise UI element localization, intelligent video anomaly detection, and quantitative rendering performance analysis, significantly improving both the efficiency and accuracy of mobile app testing.

Keywords: Gemini, Computer Use, Mobile App Testing, UI Automation, Video Analysis, Large Language Models


1. Introduction: The challenges of mobile app testing

1.1 Pain points of traditional testing

With the rapid development of mobile internet, user expectations for app experience have grown increasingly demanding. For social e-commerce applications, details such as page loading speed, scrolling smoothness, and content rendering priority directly impact user retention. However, traditional UI testing methods face numerous challenges:

  • Low efficiency of manual review: Testers need to watch screen recording videos frame by frame, which is time-consuming and leads to fatigue-induced oversights
  • Difficulty in issue localization: UI anomalies often occur at millisecond scale, making accurate timing difficult to capture with the naked eye
  • Hard-to-quantify performance metrics: Metrics like LCP (Largest Contentful Paint) and first-screen rendering time lack automated measurement methods
  • High cost of batch testing: Multi-device, multi-scenario test combinations require substantial human resources

1.2 Opportunities brought by GenAI

The release of Google’s Gemini series models provides entirely new solutions to these problems:

  • Multimodal understanding: Gemini 3.0 Pro possesses powerful video comprehension capabilities, able to analyze UI changes in screen recordings
  • Pixel-level operations: Gemini Computer Use model can precisely identify and locate UI elements
  • Reasoning capabilities: Through carefully designed prompts, the model can understand complex UI anomaly definitions and make judgments
  • Concurrent processing: API-based approach naturally supports large-scale parallel testing

2. Project background and core value

2.1 Project positioning

sdet-kit is an intelligent testing toolkit designed for SDETs (Software Development Engineers in Test), specifically built for mobile application UI quality assurance.

2.2 Core capabilities

Capability Traditional Approach This Project’s Approach
UI Element Localization XPath/ID Selectors Gemini Computer Use pixel-level positioning
Anomaly Detection Manual video review AI auto-detection of 7 issue types
Performance Measurement Chrome DevTools Video frame analysis for automatic measurement
Test Reports Manual writing Auto-generated CSV/Excel

3. Technical architecture overview

3.1 System architecture diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      User Interface Layer                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚Image UI Det. β”‚ β”‚Video Anomaly β”‚ β”‚Render Perf.  β”‚ β”‚Page Load β”‚β”‚
β”‚  β”‚  Detection   β”‚ β”‚  Detection   β”‚ β”‚  Detection   β”‚ β”‚Detection β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     Application Logic Layer                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚   utils.py   β”‚ β”‚render_detectorβ”‚ β”‚frame_detectorβ”‚            β”‚
β”‚  β”‚Core Detectionβ”‚ β”‚Render Analysisβ”‚ β”‚Frame Analysisβ”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                       AI Services Layer                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚    Gemini 3.0 Pro       β”‚ β”‚   Gemini Computer Use           β”‚β”‚
β”‚  β”‚  - Video Understanding  β”‚ β”‚  - UI Element Localization      β”‚β”‚
β”‚  β”‚  - Anomaly Detection    β”‚ β”‚  - Android Device Control       β”‚β”‚
β”‚  β”‚  - Performance Analysis β”‚ β”‚  - Pixel-level Coordinates      β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     Infrastructure Layer                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  GCS Storage β”‚ β”‚    FFmpeg    β”‚ β”‚     ADB      β”‚             β”‚
β”‚  β”‚ Video Cache  β”‚ β”‚Video Process β”‚ β”‚Device Controlβ”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 Data flow

  1. Video upload: Users upload screen recording videos through the Web UI
  2. Preprocessing: FFmpeg downsamples video to 1fps (each frame β‰ˆ 33ms of original time)
  3. Cloud caching: Video uploaded to GCS to obtain URI
  4. AI analysis: Call Gemini API for multimodal analysis
  5. Result parsing: Extract detection results in JSON format
  6. Visualization: Display detection report in Web UI

4. Six core functional modules explained

4.1 Precise UI element detection

Feature description

Based on the Gemini Computer Use model, this module achieves pixel-level localization of UI elements in app screenshots. Users simply describe the target element in natural language, and the model returns precise bounding box coordinates.

Core code

def detect_ui(image_path, ui_element, model):
    client = genai.Client()
    img = Image.open(image_path)
    width, height = img.size
    
    system_instructions = """
    Return bounding boxes as a JSON array with labels. 
    Never return masks or code fencing. Limit to 25 objects.
    """
    
    prompt = f"Detect the '{ui_element}' in the image. Return bounding boxes."
    
    response = client.models.generate_content(
        model=model,  # gemini-2.5-computer-use-preview
        contents=[prompt, img],
        config=types.GenerateContentConfig(
            system_instruction=system_instructions,
            temperature=0.5,
        )
    )
    # Parse returned coordinates and convert to absolute pixel values
    # ...

Technical highlights

  • Normalized coordinates: Model returns normalized coordinates in 0-1000 range, facilitating adaptation to different resolutions
  • Multi-object detection: Single call can detect multiple similar elements, automatically named by characteristics
  • Bounding box visualization: Automatically draws red bounding boxes and green center points on original images

4.2 Intelligent video anomaly detection

Feature description

Analyzes app screen recording videos to automatically identify 7 common UI anomaly issues:

Anomaly Type Definition Detection Criteria
Bottom Loading Delay Slow content loading when scrolling to bottom Loading time > 1 second
Page Stuttering Non-smooth display during scrolling/switching Obvious pauses or frame drops
Element Flickering UI elements unexpectedly flash or disappear Distinguish normal loading from abnormal flickering
Interface Jitter Unexpected position shifts of page elements Layout jumping, position mutations
Gesture Response Issues Swipe gestures fail or require multiple attempts Response doesn’t match expectations
App Crash Application unexpectedly exits to system desktop Distinguish normal return from abnormal exit
Timing Anomaly Unreasonable loading order of page elements Important content loads after secondary content

Parallel mode uses ThreadPoolExecutor to initiate 7 API calls simultaneously, each using specialized prompts optimized for specific issues.

4.3 Render performance detection

Feature description

Specifically detects page rendering performance after Pull-to-Refresh operations, with core metrics:

  • First Slot Render Time: Time for the first feed card in the top-left corner to fully render
  • Render Priority: Whether the first slot renders before other areas

Prompt design highlights

SYSTEM_PROMPT = """
You are a professional page performance analyzer specializing in 
detecting page rendering performance after pull-to-refresh operations.

Key Definitions:
1. Pull-to-refresh: User gesture dragging the page **downward** from 
   the top edge, triggering content reload
2. Top-left quadrant: Upper-left quarter of the feed viewport
3. Rendering complete: Content fully rendered with no blur, placeholders, 
   skeleton screens, or loading states

Critical Notes:
- Focus exclusively on pull-to-refresh gestures, **ignore other page loads**
- Video has been processed to 1fps, each frame represents 1 second
"""

Key points:

  • Clearly defines identification criteria for β€œpull-to-refresh”, excluding other types of page loads
  • Uses bold markers (**) to emphasize key constraints
  • Explains video preprocessing parameters to help model understand time scale

4.4 Page loading performance detection

Feature description

Simultaneously detects two key performance metrics:

  1. LCP (Largest Contentful Paint): Rendering completion time of the largest content element
  2. Full content loading: Time when all main content is stably displayed

Video preprocessing

def preprocess_video_to_fps1(video_path):
    # Step 1: Standardize to 30fps
    cmd_30fps = f"ffmpeg -i {video_path} -r 30 -y {temp_30fps_path}"
    subprocess.run(cmd_30fps, shell=True)
    
    # Step 2: Slow down 30x then reduce to 1fps (each frame represents original 33ms)
    cmd_1fps = f"ffmpeg -i {temp_30fps_path} -filter:v 'setpts=30*PTS' -r 1 -y {temp_1fps_path}"
    subprocess.run(cmd_1fps, shell=True)
    
    return temp_1fps_path

This preprocessing approach preserves millisecond-level time precision while compressing video to API-acceptable sizes.

4.5 Android automation control

Feature description

An intelligent agent based on Gemini Computer Use that can operate Android devices like a human to complete testing tasks.

Architecture design

class AndroidAgent:
    def __init__(self):
        self.client = genai.Client()
        self.tools = AndroidTools()  # ADB operation wrapper
        
        self.config = types.GenerateContentConfig(
            tools=[
                types.Tool(
                    computer_use=types.ComputerUse(
                        environment=types.Environment.ENVIRONMENT_UNSPECIFIED,
                        excluded_predefined_functions=[
                            "open_web_browser", "navigate", ...
                        ]
                    )
                ),
                types.Tool(
                    function_declarations=[
                        types.FunctionDeclaration(
                            name="go_home",
                            description="Return to home screen",
                            parameters={"type": "object", "properties": {}}
                        ),
                        # ...
                    ]
                )
            ],
            system_instruction="""You are operating an Android phone..."""
        )

ADB tool wrapper

class AndroidTools:
    def tap(self, x: int, y: int):
        # Convert normalized coordinates to actual pixels
        actual_x = int(x / 1000 * self.screen_width)
        actual_y = int(y / 1000 * self.screen_height)
        subprocess.run(["adb", "shell", "input", "tap", str(actual_x), str(actual_y)])
    
    def swipe(self, x1, y1, x2, y2, duration=300):
        # Swipe operation, supports custom duration
        ...
    
    def screenshot(self) -> bytes:
        result = subprocess.run(["adb", "exec-out", "screencap", "-p"], capture_output=True)
        return result.stdout

Agent execution loop:

  1. Capture current screen screenshot
  2. Send screenshot and task description to Gemini
  3. Model returns operations to execute (e.g., click_at, scroll_at)
  4. Execute operation and capture new screenshot
  5. Feed execution results back to model, continue loop until task completion

5. Deep dive into GenAI technologies

5.1 Gemini 3.0 Pro’s video understanding capabilities

Gemini 3.0 Pro is currently one of the most powerful multimodal large models for video understanding. Its applications in this project:

Capability Application Scenario Technical Implementation
Long Video Processing Analyze 10+ second app recordings GCS URI + VideoMetadata(fps)
Temporal Understanding Detect anomalies at specific moments Model outputs timestamps in MM:SS format
Fine-grained Analysis Identify subtle UI element changes 1fps preprocessing preserves millisecond precision
Multi-event Tracking Detect multiple pull-to-refresh operations JSON array format output

5.2 Gemini computer use’s pixel-level operations

The Computer Use model is specifically designed for GUI operations, with core characteristics:

# Model returns normalized coordinates (0-1000)
response = {
    "box_2d": [y_min, x_min, y_max, x_max],  # Note: y comes before x
    "label": "Comment button"
}

# Convert to absolute pixel coordinates
abs_x = int(box[1] / 1000 * width)
abs_y = int(box[0] / 1000 * height)

Function calling capability: Model can directly call predefined tool functions

types.Tool(
    computer_use=types.ComputerUse(
        environment=types.Environment.ENVIRONMENT_UNSPECIFIED
    )
)
# Model will return function calls like click_at(x=450, y=800)

5.3 Prompt engineering best practices

Principle 1: Clear role definition

You are a professional mobile application UI testing expert, specializing in detecting UI issues in apps.

Principle 2: Structured problem definition

1. **Bottom Loading Delay**
   - Definition: When user scrolls to page bottom, new content loading shows obvious delay
   - Manifestation: Loading indicator appears at bottom (e.g., spinner, "Loading" text)
   - Focus: Loading duration exceeding 1 second is considered problematic

Principle 3: Strict output format constraints

Output Format (JSON array):
[
  {
    "top_left_complete_timestamp": "xx:xx",
    "is_first_loaded": true/false
  }
]

Principle 4: Boundary condition specification

Note: If no pull-to-refresh operations detected in video, return empty array []

6. Conclusion and future outlook

6.1 Project value summary

This project successfully validates the enormous potential of GenAI in the software testing domain:

  1. Efficiency improvement: Automated detection replaces manual review, saving 80%+ testing time
  2. More comprehensive coverage: Simultaneously detects 7 types of issues, avoiding human oversight
  3. Quantifiable results: Provides performance data accurate to milliseconds
  4. Strong extensibility: Support for new detection scenarios through prompt modification

6.2 Future outlook

  • More anomaly types: Extend detection to white screen, black screen, ANR issues
  • Real-time detection: Integration with CI/CD pipeline for automatic testing on each release
  • Cross-platform support: Extend to iOS devices and web applications
  • Baseline comparison: Establish performance baselines, automatically detect performance regressions
  • Intelligent root cause analysis: Not only detect issues but also analyze possible causes

References

  1. Gemini Computer Use Guide
  2. Gemini Multimodal Understanding
  3. Android Debug Bridge (ADB)
2 Likes