Authors: @Abhay_Ketkar , @abhijithmp
In the era of accelerated computing, GPUs are the engines driving innovation in Artificial Intelligence and Machine Learning. However, with great power comes the critical need for unwavering reliability. At Google Cloud, we understand that our customers entrust us with their most demanding and mission-critical AI workloads. That’s why we’ve implemented a comprehensive, multi-layered approach to GPU hardware and software qualification, designed to proactively prevent issues and ensure a stable platform from day one. Our commitment to reliability isn’t just a checkbox; it’s deeply embedded in the entire lifecycle of every GPU that enters our fleet.
Proactive prevention: Catching issues before they reach you
Our approach is simple: prevent problems before they can ever impact customer workloads. This proactive stance is a cornerstone of our GPU reliability strategy, encompassing several key phases:
-
Deep, early partnership with NVIDIA: Our collaboration with NVIDIA begins long before new GPU architectures become widely available. During the New Product Introduction (NPI) phase, Google Cloud engineers work hand-in-hand with NVIDIA, providing continuous feedback on product stability, performance characteristics, and reliability paradigms. This co-engineering approach allows us to influence design considerations and ensures that new technologies are vetted against Google’s high standards from their inception.
-
Initial qualification: Stress-testing for perfection: Newly arrived hardware as well as systems that undergo repairs/hardware swaps are subjected to an intense barrage of tests. We leverage the latest diagnostics, often co-developed with NVIDIA, to screen for any hardware defects. This isn’t just about the GPU silicon itself; we validate the entire server ecosystem. This includes rigorous checks on power delivery, thermal management, high-speed networking (like NVLink and RoCEv2), and component compatibility. Thermal and mechanical stress tests, along with extended burn-in periods, are run to force out any manufacturing flaws or early-life failures, ensuring any weak units are identified and replaced immediately.
-
Bill of health: Certifying production readiness: Before the hardware is delivered to customers, it enters the “Bill of Health” phase. This is where we simulate real-world conditions at scale. We execute a demanding suite of storage I/O tests, compute stress tests, and network performance benchmarks. Crucially, we also run representative AI/ML workload simulations, mirroring the complex computational patterns of our customers. This “shake out” process is designed to uncover any latent hardware issues, integration bottlenecks, or performance regressions that might only appear under sustained load. By validating every aspect of the machine, we ensure each unit is hardened and optimized for mission-critical production environments, as detailed in our AI Hypercomputer capacity experience.
-
Automated periodic health scans: Our vigilance doesn’t end once a machine is in service. We perform automated, periodic health checks on idle machines within the fleet. These scans help us proactively detect any signs of degradation or emerging issues, allowing for timely remediation before they can affect active customer workloads.
-
Always running passive health checks: We leverage telemetry data from various sources, combined with advanced predictive analytics and heuristics, to detect potential issues before they impact customer workloads. This allows us to issue proactive emergent maintenance notifications when hardware repairs are required to ensure continuous service health.
A closer look at our testing arsenal
The tests we employ are comprehensive and categorized to cover all bases:
-
Foundation tests: We establish a baseline by validating fundamental hardware and software interactions. This includes using tools like HW Field Diags for low-level testing of GPU and NVSwitch, NVIDIA-SMI for basic GPU status and telemetry, DCGM (Data Center GPU Manager) for in-depth health monitoring and diagnostics, and NCCL (NVIDIA Collective Communications Library) tests to verify the integrity and bandwidth of inter-GPU communication links.
-
Representative workload simulations: To ensure reliability under real-world conditions, we go beyond synthetic tests. We deploy a diverse set of custom workloads and actual model training/inference tasks, including Large Language Model (LLM) training, fine-tuning, inference, and computer vision models. These simulations are tailored to the specific GPU architecture, pushing the hardware in ways that mimic customer usage. We also incorporate screening for Silent Data Corruptions (SDCs) during these runs to ensure data integrity.
-
Dynamic validation lifecycle: Our “burn-in” and qualification standards are not static. They evolve with each new GPU generation and with insights gained from our fleet operations. We continue to invest heavily in custom workloads to improve burn-in. Validation windows can range from hours to several days, ensuring we capture transient issues like thermal instabilities or memory errors that only manifest under prolonged, high-intensity use.
Here is a subset of the testing tools we use and their roles:
| Library | Purpose | Key Detections | Reference Links |
|---|---|---|---|
| HW Field Diags | Perform deep, low-level testing of GPU memory, data transfers, NVL72 domain, and computational engines to identify hardware failures. | GPU Failures, NVL72 Domain Failures, SDC, XIDs, Thermal failures | HW Field Diags |
| NVIDIA-SMI | Basic GPU monitoring and management via NVIDIA drivers. | GPU presence/status, critical XID errors, abnormal power/clock. | Nvidia-SMI documentation |
| DCGM | Comprehensive NVIDIA suite for diagnosing and monitoring GPUs in cluster environments. | PCIe issues, GPU memory errors (ECC), thermal/power violations, NVLink errors. | DCGM Diagnostics |
| NCCL | Validates inter-GPU and multi-node communication links. | Collective timeouts, link training failures, bandwidth degradation. | NCCL User Guide |
| Nemotron | NVIDIA’s native LLM framework used for end-to-end multi-node cluster testing over extended periods. | Workload-specific XIDs, network hangs, performance issues, platform reliability. | Nemotron Guide |
The Google Cloud difference: Reliability by design
Our exhaustive systematic hardware qualification process along with continuous health checks (active and passive) across the infrastructure lifecycle is a key differentiator. By investing heavily in proactive prevention, rigorous testing, and deep industry partnerships, Google Cloud minimizes the risk of hardware-related disruptions. This allows our customers to focus on innovation, confident in the knowledge that their AI workloads are running on a foundation of rock-solid, certified, and continuously monitored GPU infrastructure. Your success is our priority, and that starts with reliable hardware.