GCE GPU monitoring guide: Visualizing infrastructure observability, operations and actionable insights

Authors: @abhijithmp @Abhay_Ketkar

Continuing our Google Cloud GPU reliability series published in April, this second installment explores deep-dive monitoring for GPUs on Google Cloud.

As organizations scale training and inference workloads across thousands of nodes, comprehensive telemetry is vital for troubleshooting failures, planning maintenance, and mitigating downtime. Reliability in AI infrastructure directly correlates with model time-to-market and serving performance; thus, partnering with a provider that employs robust infrastructure management is essential. For foundational principles, refer to our previous blog: https://goo.gle/gpu-guide.

This post details the monitoring and observability features offered by GCE to help customers troubleshoot and reduce disruptions across their AI infrastructure. Note: References to VMs also apply to Bare Metal instances on supported GPU types.

Infrastructure observability during runtime

Distributed AI/ML workloads rely on high-bandwidth networking and complex storage layers. Degradation in any component can stall progress. While hard failures are often apparent, “soft” failures require deep visibility into observability metrics across the entire stack for effective debugging.

Google’s Cluster Director platform aggregates raw telemetry from diverse infrastructure sources, categorized into two primary types:

a. Out-of-band metrics:

Out-of-band telemetry is collected at the infrastructure layer and emitted as time-series metrics via the compute.googleapis.com endpoint. The table below summarizes these metrics and their endpoints. Detailed documentation is available here: https://docs.cloud.google.com/monitoring/api/metrics_gcp_c#gcp-compute

Source Metrics Endpoint Considerations
GPU All the metrics at the endpoint com.googleapis.com/instance/gpu/* These metrics are GPU generation specific, so read the metrics documentation about if a specific metric is available for that GPU type. Also, these metrics are only offered for GPU instances that are part of a reservation and not offered for DWF Flex or spot VMs
Host machine machine/machine_status This metric can help look at the healthy machines in the all capacity mode, mainly to monitor machines that are part of a NVlink domain and their status if they are healthy or in active REPAIR or degraded due to failing health checks.
NVlink metrics com.googleapis.com/instance/gpu/gpu_nvlink*
com.google.com/instance/gpu/nvlink*
NVlink metrics including key metrics like Bit Error Rates
NCCL telemetry com.googleapis.com/instance/gpu/NCCL/* Metrics captured from NCCL telemetry

b. In-band metrics:

In-band metrics are captured from the guest OS via agents such as the Google Cloud Ops agent or DCGM exporter. GPU instances within GKE or Cluster Toolkit are preconfigured for DCGM collection. Manual configuration instructions are available here: https://docs.cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/third-party-nvidia.

We also leverage the NVIDIA NCCL profile plugin for advanced troubleshooting, such as straggler and hang detection. NCCL telemetry collection via the Collective Communication Analyzer (CoMMA) library is enabled by default. Additional details can be found at: https://docs.cloud.google.com/ai-hypercomputer/docs/nccl/comma

Metrics usage options:

Dashboards in Google Cloud UI

GCE provides preconfigured, customizable dashboards for key metrics. Available options depend on the consumption model (e.g., GCE vs. GKE). View the preconfigured dashboards here: https://docs.cloud.google.com/ai-hypercomputer/docs/monitor

For GKE, there are configured dashboards along with job set monitoring. Refer to this document for additional information on the dashboards - https://docs.cloud.google.com/kubernetes-engine/docs/concepts/gpus#monitoring.

Custom dashboards for monitoring domain health

Users can create custom dashboards using Metrics Explorer to aggregate metrics and logs for exhaustive monitoring. Learn more about custom dashboards: https://docs.cloud.google.com/monitoring/charts/dashboards.

Here is a sample custom dashboard to monitor a subblock health using machine_status and infra_health metric.

Alerting and notifications

Configure metric or log-based alerts with custom thresholds. Cloud Monitoring supports various notification channels, including Slack, PagerDuty, and email. Refer to the alerting guide: https://docs.cloud.google.com/monitoring/alerts

Cloud observability APIs and metrics forwarding

For third-party solutions (e.g., Prometheus, Datadog), users can leverage Cloud Monitoring APIs (https://docs.cloud.google.com/monitoring/api/v3) or open-source exporters like the Stackdriver exporter to forward metrics.

Monitoring deployment topology, operations and health

Visibility of deployment topology and scheduling

GCE GPU infrastructure is densely deployed, and visibility into this topology is essential for schedulers to perform topology-aware placement for optimal performance. Deployment topology is accessible via GCE APIs and the guest OS.

Here is a sample of topology information as part of subblock describe API for a A4x VM:

physicalHostTopology:
   cluster: europe-west1-cluster-jfhb
   block: 3e3056e23cf91a5cb4a8621b6a52c100
   subblock: 1fc18636cbd4abd623553784ca2c174e
   host: 2326279b5ecdfc545fd5e39167698168

Here is sample visibility of topology for the same VM inside Guest operating system via Instance’s metadata server.

{
  "cluster": "europe-west1-cluster-jfhb",
  "block": "3e3056e23cf91a5cb4a8621b6a52c100",
  "subBlock": "1fc18636cbd4abd623553784ca2c174e",
  "host": "2326279b5ecdfc545fd5e39167698168"
}

For a detailed understanding of GPU infrastructure topology, see: https://docs.cloud.google.com/ai-hypercomputer/docs/manage/instance-topology

Schedulers and orchestrators can utilize Topology awareness in deploying the workloads to achieve optimal performance. Here is a sample of how GKE along with Kueue utilizes Topology aware scheduling (TAS) for workload placement.

Refer to the documentation on using Topology Aware Scheduling for GKE: https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/schedule-gke-workloads-tas

Visibility of operations and controls

To ensure platform stability, Google performs periodic maintenance. These operations are categorized by their impact and the level of advance notification provided.

  1. Hardware operations

    These are operations related to hardware maintenance and repairs. AI infrastructure has many hardware components which often fail or degrade and we have to perform periodic maintenance operations to repair, replace or upgrade them. These operations are disruptive by nature as they cannot be performed when a VM is running on the infrastructure.

  2. Software operations

These are operations related to software upgrades and fixes. AI infrastructure software stack has various software components ranging from GPU drivers to host kernels to other management software. Depending on the type of software, a subset of these could be disruptive to running VMs and workloads.

Based on the severity of these operations and how early we can detect the need for them these operations can further be broken down into three buckets:

  1. Planned operations (or maintenance)

These are planned upgrades and repairs that we know in advance and are necessary to rollout across the clusters. Though these are disruptive, they are less frequent in nature and have advance notification (Order of weeks to months) for customers to handle the downtime impact. Customers have visibility into the schedule of these operations and have control to start these operations on demand as explained in the example in the section below.

  1. Emergent operations (or repairs)

These are unplanned but are of medium severity which are not necessary to be addressed immediately, but cannot wait till the next planned operation window. For these operations, we still give advanced notification (Order of days) with the schedule along with the necessary control to initiate these operations on demand.

  1. Critical failures

These are the set of operations where either GCE cannot detect in advance (also known as hosterror with failures including host crashes, VM kernel panics, CPU uncorrectable errors etc.,) or failures of higher severity which cannot be deferred to a future time (Also known as Urgent repairs including failures like GPU overheating, critical XID failures). For these critical failures we cannot provide longer notification and controls to initiate these operations on-demand.

Below are examples for monitoring and managing each of the above mentioned operations.

  1. Managing planned maintenance operations

Planned maintenance triggers ‘upcomingMaintenance’ notifications via Cloud Logging, the GCE API, and instance metadata. Notifications include start times, duration, and on-demand initiation options. For details, see: https://docs.cloud.google.com/ai-hypercomputer/docs/manage/host-events#view-maintenance.

For certain GPU families, maintenance notifications are sent for the entire deployment topology along with the control to initiate group maintenance across different cluster topologies. Refer to this document for maintenance across the reservation topology - https://docs.cloud.google.com/ai-hypercomputer/docs/manage/host-events-reservations#manage-maintenance.

  1. Managing repairs via emergent operations

When GCE detects symptoms for any hardware degradation or software vulnerabilities, we send a similar ‘upcomingMaintenance’ notification with failure component details as part of maintenanceReasons field in the notification. Emergent maintenance related ‘upcomingMaintenance’ notifications are sent 7 days in advance with ‘type’ = ‘UNSCHEDULED’ and ‘maintenanceReasons’ having values starting with ‘FAILURE_’.

Users can manage emergent maintenance via Reservation-level APIs. Google recommends enabling emergent maintenance to gain control over necessary repairs without immediate disruption. Learn more: https://docs.cloud.google.com/ai-hypercomputer/docs/manage/host-events-reservations#emergency-notifications

  1. Managing failures

Critical failure notifications are post occurrence of failures, so users cannot take any preventive measures to manage such failures. Customers want to track such failures for trends over time and categories of failures per failing component. A standard metric they want to track is Mean Time Between Failures (MTBF).

Beyond these critical failures, customers can detect hardware and infrastructure failures as they run their workloads either based on performance degradation or calculation errors due to Silent Data Corruptions (SDCs). To manage such User detected faults, we have provided APIs to report such failures and request for replacement hardware.

  1. Faulty host reporting: SDCs usually surface at single VM instance level, so users can report these hosts as faulty and based on the GPU VM family, VMs wait for hosts to be repaired or will be moved over to spare hosts. More details on host level fault reporting can be found here: https://docs.cloud.google.com/ai-hypercomputer/docs/manage/report-faulty-host. GKE users have a simpler option to mark hosts as faulty by attaching labels to nodes as documented here: https://docs.cloud.google.com/ai-hypercomputer/docs/manage/manage-gke-clusters#report-faulty-hosts-how-to

  2. Faulty NVlink domain: For external multi-node NVlink based GPU VMs like A4X and beyond, customers can report entire NVlink domains as faulty. This API is in preview.

Actionable insights: Beyond raw observability and telemetry

Actionable insights transform raw reliability signals into specific recommendations. These are categorized into workload-specific and infrastructure-specific insights.

Workload insights:

These insights facilitate troubleshooting of performance degradation caused by infrastructure issues. Key features include Straggler Node Detection and NCCL Hang Detection for precise fault localization.

In distributed workload runs, failures come in two flavors. The first, a fail-stop failure, is obvious—a component crashes and goes silent. The second is far more insidious: a fail-slow failure. Here, a component doesn’t stop working; it just gets slow. This underperforming node, or “straggler,” continues to participate in computations, but its sluggishness creates a drag on the entire system, turning a minor hardware or software issue into a significant bottleneck that drastically increases the overall training time.

Below are two key features that help address these Fail-slow and Fail-stop scenarios.

  1. Straggler Node Detection: We utilize NCCL telemetry and heuristics to identify nodes performing inconsistently due to hardware or software issues. These “stragglers” are flagged in metrics (straggler_status), dashboards, and logs. See: https://cloud.google.com/blog/products/compute/stragglers-in-ai-a-guide-to-automated-straggler-detection.

2_-_Straggler_Cascade (1)

4_-_Algorithm (1)

  1. NCCL Hang detection (Preview) - Detecting a single NCCL hang in hundreds of nodes participating in a job run is hard and time consuming. We are enhancing our fault localization feature to detect NCCL hangs and send notifications in logs, metrics and in dashboards for these faulty nodes.

    Both Straggler node and NCCL hang detection relies on NCCL telemetry collection.

Infrastructure insights

  1. Health-Aware Node Placement: Predictive ML models analyze telemetry to forecast GPU health degradation. These signals (failure_prediction_status) are integrated with Kubernetes Kueue for reliability-optimized scheduling. Learn more: https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/enable-node-health-prediction.

  1. VM Infrastructure health: We run a series of always running/passive health checks and update the health of the VM instances based on the test results. These results are emitted as a time series metric (com.googleapis.com/instance/gpu/infra_health), along with values for different unhealthy categories. This metric showing as unhealthy might based on our health check might not make your workload show any degradation or failure, so we strongly recommend customers to run additional tests and health checks before reporting these VMs are faulty.

  2. VM Termination Report (Preview): This feature provides a centralized log per VM instance detailing termination causes, including granular root-cause analysis (e.g., networking updates vs. hardware maintenance). Full analysis may take up to 72 hours.

{
  "insertId": "7738270856251265259",
  "jsonPayload": {
    "impactCategory": "INSTANCE_INTERRUPTION",
    "instanceInterruptionReport": {
      "diagnosisTime": "2026-05-07T21:01:06.58668329Z",
      "type": "PLANNED_MAINTENANCE",
      "status": "COMPLETE",
      "summary": "The diagnosis for the instance interruption that occurred at 2026-05-07T20:22:41.676219Z is complete.",
      "interruptionTime": "2026-05-07T20:22:41.676219Z",
      "reasons": [
        {
          "reason": "PLANNED_NETWORK_UPDATE"
        }
      ]
    }
  },
  "resource": {
    "type": "gce_instance",
    "labels": {
      "zone": "us-central1-b",
      "project_id": "455207029971",
      "instance_id": "6857930333645934192"
    }
  },
  "timestamp": "2026-05-07T20:22:41.676219Z",
  "severity": "INFO",
  "logName": "projects/supercomputer-testing/logs/compute.googleapis.com%2Finterruption_report",
  "receiveTimestamp": "2026-05-10T20:22:42.722514194Z"
}

Feature compatibility matrix

Here is the summary table showing the support of above features across different GPU VMs currently offered in GCE.

Feature Supported GPU instance family Additional considerations
Out of band telemetry metrics A3 Mega, A3 Ultra, A4, A4X, A4X Max Spot VMs are excluded.
In-band telemetry metrics A3 Mega, A3 Ultra, A4, A4X, A4X Max for NCCL telemetry.

DCGM collection preconfigured for all GKE and Cluster toolkit VMs

All VMs with OpsAgent configured for DCGM collection
Topology visibility A3 Mega, A3 Ultra, A4, A4X, A4X Max
Planned maintenance experience for software updates All GPU VM families Different GPU VM families have different maintenance experience configured based on the
Emergent maintenance experience for hardware repairs A3 Mega, A3 Ultra, A4, A4X, A4X Max A3 High VMs can get Emergent maintenance experience in preview.
Customer reporting of faulty host/infrastructure A3 Mega, A3 Ultra, A4, A4X, A4X Max
Straggler node detection A3 Mega, A3 Ultra, A4, A4X, A4X Max
NCCL hang detection (Preview) A3 Mega, A3 Ultra, A4, A4X, A4X Max
Health prediction aware workload scheduling with Kueue A3 Mega, A3 Ultra, A4, A4X, A4X Max
VM infrastructure health checks A3 Mega, A3 Ultra, A4, A4X, A4X Max
VM termination reporting (Preview) A3 Mega, A3 Ultra, A4, A4X, A4X Max

If customers use other GPU VMs or prefer to participate in the preview features, get in touch with us through your account teams to discuss further.

Most of the points highlighted in this doc were part of the AI Ops at Scale session at 2026 Google Cloud Next. You can refer to the session deck here: https://content-cdn.sessionboard.com/content/p71dktTETGil1cZH0nHy_BRK2-174.pdf

3 Likes