Author:
Olivier Zhang
Shirong LiangDate: December 2025
Abstract
UI testing has always been a critical component of mobile application quality assurance, yet traditional manual review approaches are inefficient and prone to missing issues. This article introduces an intelligent software testing toolkit (sdet-kit) built on Google Gemini Large Language Models. By integrating multiple GenAI capabilities, this tool enables precise UI element localization, intelligent video anomaly detection, and quantitative rendering performance analysis, significantly improving both the efficiency and accuracy of mobile app testing.
Keywords: Gemini, Computer Use, Mobile App Testing, UI Automation, Video Analysis, Large Language Models
1. Introduction: The challenges of mobile app testing
1.1 Pain points of traditional testing
With the rapid development of mobile internet, user expectations for app experience have grown increasingly demanding. For social e-commerce applications, details such as page loading speed, scrolling smoothness, and content rendering priority directly impact user retention. However, traditional UI testing methods face numerous challenges:
- Low efficiency of manual review: Testers need to watch screen recording videos frame by frame, which is time-consuming and leads to fatigue-induced oversights
- Difficulty in issue localization: UI anomalies often occur at millisecond scale, making accurate timing difficult to capture with the naked eye
- Hard-to-quantify performance metrics: Metrics like LCP (Largest Contentful Paint) and first-screen rendering time lack automated measurement methods
- High cost of batch testing: Multi-device, multi-scenario test combinations require substantial human resources
1.2 Opportunities brought by GenAI
The release of Googleβs Gemini series models provides entirely new solutions to these problems:
- Multimodal understanding: Gemini 3.0 Pro possesses powerful video comprehension capabilities, able to analyze UI changes in screen recordings
- Pixel-level operations: Gemini Computer Use model can precisely identify and locate UI elements
- Reasoning capabilities: Through carefully designed prompts, the model can understand complex UI anomaly definitions and make judgments
- Concurrent processing: API-based approach naturally supports large-scale parallel testing
2. Project background and core value
2.1 Project positioning
sdet-kit is an intelligent testing toolkit designed for SDETs (Software Development Engineers in Test), specifically built for mobile application UI quality assurance.
2.2 Core capabilities
| Capability | Traditional Approach | This Projectβs Approach |
|---|---|---|
| UI Element Localization | XPath/ID Selectors | Gemini Computer Use pixel-level positioning |
| Anomaly Detection | Manual video review | AI auto-detection of 7 issue types |
| Performance Measurement | Chrome DevTools | Video frame analysis for automatic measurement |
| Test Reports | Manual writing | Auto-generated CSV/Excel |
3. Technical architecture overview
3.1 System architecture diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface Layer β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββββ
β βImage UI Det. β βVideo Anomaly β βRender Perf. β βPage Load ββ
β β Detection β β Detection β β Detection β βDetection ββ
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Application Logic Layer β
β ββββββββββββββββ βββββββββββββββββ ββββββββββββββββ β
β β utils.py β βrender_detectorβ βframe_detectorβ β
β βCore Detectionβ βRender Analysisβ βFrame Analysisβ β
β ββββββββββββββββ βββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β AI Services Layer β
β βββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β β Gemini 3.0 Pro β β Gemini Computer Use ββ
β β - Video Understanding β β - UI Element Localization ββ
β β - Anomaly Detection β β - Android Device Control ββ
β β - Performance Analysis β β - Pixel-level Coordinates ββ
β βββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure Layer β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β GCS Storage β β FFmpeg β β ADB β β
β β Video Cache β βVideo Process β βDevice Controlβ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 Data flow
- Video upload: Users upload screen recording videos through the Web UI
- Preprocessing: FFmpeg downsamples video to 1fps (each frame β 33ms of original time)
- Cloud caching: Video uploaded to GCS to obtain URI
- AI analysis: Call Gemini API for multimodal analysis
- Result parsing: Extract detection results in JSON format
- Visualization: Display detection report in Web UI
4. Six core functional modules explained
4.1 Precise UI element detection
Feature description
Based on the Gemini Computer Use model, this module achieves pixel-level localization of UI elements in app screenshots. Users simply describe the target element in natural language, and the model returns precise bounding box coordinates.
Core code
def detect_ui(image_path, ui_element, model):
client = genai.Client()
img = Image.open(image_path)
width, height = img.size
system_instructions = """
Return bounding boxes as a JSON array with labels.
Never return masks or code fencing. Limit to 25 objects.
"""
prompt = f"Detect the '{ui_element}' in the image. Return bounding boxes."
response = client.models.generate_content(
model=model, # gemini-2.5-computer-use-preview
contents=[prompt, img],
config=types.GenerateContentConfig(
system_instruction=system_instructions,
temperature=0.5,
)
)
# Parse returned coordinates and convert to absolute pixel values
# ...
Technical highlights
- Normalized coordinates: Model returns normalized coordinates in 0-1000 range, facilitating adaptation to different resolutions
- Multi-object detection: Single call can detect multiple similar elements, automatically named by characteristics
- Bounding box visualization: Automatically draws red bounding boxes and green center points on original images
4.2 Intelligent video anomaly detection
Feature description
Analyzes app screen recording videos to automatically identify 7 common UI anomaly issues:
| Anomaly Type | Definition | Detection Criteria |
|---|---|---|
| Bottom Loading Delay | Slow content loading when scrolling to bottom | Loading time > 1 second |
| Page Stuttering | Non-smooth display during scrolling/switching | Obvious pauses or frame drops |
| Element Flickering | UI elements unexpectedly flash or disappear | Distinguish normal loading from abnormal flickering |
| Interface Jitter | Unexpected position shifts of page elements | Layout jumping, position mutations |
| Gesture Response Issues | Swipe gestures fail or require multiple attempts | Response doesnβt match expectations |
| App Crash | Application unexpectedly exits to system desktop | Distinguish normal return from abnormal exit |
| Timing Anomaly | Unreasonable loading order of page elements | Important content loads after secondary content |
Parallel mode uses ThreadPoolExecutor to initiate 7 API calls simultaneously, each using specialized prompts optimized for specific issues.
4.3 Render performance detection
Feature description
Specifically detects page rendering performance after Pull-to-Refresh operations, with core metrics:
- First Slot Render Time: Time for the first feed card in the top-left corner to fully render
- Render Priority: Whether the first slot renders before other areas
Prompt design highlights
SYSTEM_PROMPT = """
You are a professional page performance analyzer specializing in
detecting page rendering performance after pull-to-refresh operations.
Key Definitions:
1. Pull-to-refresh: User gesture dragging the page **downward** from
the top edge, triggering content reload
2. Top-left quadrant: Upper-left quarter of the feed viewport
3. Rendering complete: Content fully rendered with no blur, placeholders,
skeleton screens, or loading states
Critical Notes:
- Focus exclusively on pull-to-refresh gestures, **ignore other page loads**
- Video has been processed to 1fps, each frame represents 1 second
"""
Key points:
- Clearly defines identification criteria for βpull-to-refreshβ, excluding other types of page loads
- Uses bold markers (
**) to emphasize key constraints - Explains video preprocessing parameters to help model understand time scale
4.4 Page loading performance detection
Feature description
Simultaneously detects two key performance metrics:
- LCP (Largest Contentful Paint): Rendering completion time of the largest content element
- Full content loading: Time when all main content is stably displayed
Video preprocessing
def preprocess_video_to_fps1(video_path):
# Step 1: Standardize to 30fps
cmd_30fps = f"ffmpeg -i {video_path} -r 30 -y {temp_30fps_path}"
subprocess.run(cmd_30fps, shell=True)
# Step 2: Slow down 30x then reduce to 1fps (each frame represents original 33ms)
cmd_1fps = f"ffmpeg -i {temp_30fps_path} -filter:v 'setpts=30*PTS' -r 1 -y {temp_1fps_path}"
subprocess.run(cmd_1fps, shell=True)
return temp_1fps_path
This preprocessing approach preserves millisecond-level time precision while compressing video to API-acceptable sizes.
4.5 Android automation control
Feature description
An intelligent agent based on Gemini Computer Use that can operate Android devices like a human to complete testing tasks.
Architecture design
class AndroidAgent:
def __init__(self):
self.client = genai.Client()
self.tools = AndroidTools() # ADB operation wrapper
self.config = types.GenerateContentConfig(
tools=[
types.Tool(
computer_use=types.ComputerUse(
environment=types.Environment.ENVIRONMENT_UNSPECIFIED,
excluded_predefined_functions=[
"open_web_browser", "navigate", ...
]
)
),
types.Tool(
function_declarations=[
types.FunctionDeclaration(
name="go_home",
description="Return to home screen",
parameters={"type": "object", "properties": {}}
),
# ...
]
)
],
system_instruction="""You are operating an Android phone..."""
)
ADB tool wrapper
class AndroidTools:
def tap(self, x: int, y: int):
# Convert normalized coordinates to actual pixels
actual_x = int(x / 1000 * self.screen_width)
actual_y = int(y / 1000 * self.screen_height)
subprocess.run(["adb", "shell", "input", "tap", str(actual_x), str(actual_y)])
def swipe(self, x1, y1, x2, y2, duration=300):
# Swipe operation, supports custom duration
...
def screenshot(self) -> bytes:
result = subprocess.run(["adb", "exec-out", "screencap", "-p"], capture_output=True)
return result.stdout
Agent execution loop:
- Capture current screen screenshot
- Send screenshot and task description to Gemini
- Model returns operations to execute (e.g., click_at, scroll_at)
- Execute operation and capture new screenshot
- Feed execution results back to model, continue loop until task completion
5. Deep dive into GenAI technologies
5.1 Gemini 3.0 Proβs video understanding capabilities
Gemini 3.0 Pro is currently one of the most powerful multimodal large models for video understanding. Its applications in this project:
| Capability | Application Scenario | Technical Implementation |
|---|---|---|
| Long Video Processing | Analyze 10+ second app recordings | GCS URI + VideoMetadata(fps) |
| Temporal Understanding | Detect anomalies at specific moments | Model outputs timestamps in MM:SS format |
| Fine-grained Analysis | Identify subtle UI element changes | 1fps preprocessing preserves millisecond precision |
| Multi-event Tracking | Detect multiple pull-to-refresh operations | JSON array format output |
5.2 Gemini computer useβs pixel-level operations
The Computer Use model is specifically designed for GUI operations, with core characteristics:
# Model returns normalized coordinates (0-1000)
response = {
"box_2d": [y_min, x_min, y_max, x_max], # Note: y comes before x
"label": "Comment button"
}
# Convert to absolute pixel coordinates
abs_x = int(box[1] / 1000 * width)
abs_y = int(box[0] / 1000 * height)
Function calling capability: Model can directly call predefined tool functions
types.Tool(
computer_use=types.ComputerUse(
environment=types.Environment.ENVIRONMENT_UNSPECIFIED
)
)
# Model will return function calls like click_at(x=450, y=800)
5.3 Prompt engineering best practices
Principle 1: Clear role definition
You are a professional mobile application UI testing expert, specializing in detecting UI issues in apps.
Principle 2: Structured problem definition
1. **Bottom Loading Delay**
- Definition: When user scrolls to page bottom, new content loading shows obvious delay
- Manifestation: Loading indicator appears at bottom (e.g., spinner, "Loading" text)
- Focus: Loading duration exceeding 1 second is considered problematic
Principle 3: Strict output format constraints
Output Format (JSON array):
[
{
"top_left_complete_timestamp": "xx:xx",
"is_first_loaded": true/false
}
]
Principle 4: Boundary condition specification
Note: If no pull-to-refresh operations detected in video, return empty array []
6. Conclusion and future outlook
6.1 Project value summary
This project successfully validates the enormous potential of GenAI in the software testing domain:
- Efficiency improvement: Automated detection replaces manual review, saving 80%+ testing time
- More comprehensive coverage: Simultaneously detects 7 types of issues, avoiding human oversight
- Quantifiable results: Provides performance data accurate to milliseconds
- Strong extensibility: Support for new detection scenarios through prompt modification
6.2 Future outlook
- More anomaly types: Extend detection to white screen, black screen, ANR issues
- Real-time detection: Integration with CI/CD pipeline for automatic testing on each release
- Cross-platform support: Extend to iOS devices and web applications
- Baseline comparison: Establish performance baselines, automatically detect performance regressions
- Intelligent root cause analysis: Not only detect issues but also analyze possible causes