Since 2025 Nov 11th, our team’s Batch jobs always ends up like this: it is in state scheduled for a long while with several vm instances already spinned up but no task is actually running. Later it would fail with the following error:
This should not happen because we just requested some c2-standard-30 or e2-highmem-16 with Standard provision model in us-central-1,which should be very easy to satisfy. Could someone DM me and help us take a look? Thanks!
Hi! There are several ways to try to manage this problem but from the information you provided I would recommend verifying your job configuration. Gcloud offers capabilities to check if machines of a given type are available in that region, e.g.:
gcloud compute machine-types list
Machine type is not the only reason for which ZRPE can happen. Resource types that are subject to stockouts include:
- Compute Resources (vCPUs and Memory):
Specific VM families (e.g., N1, N2, N2D, E2, C2, C3, M1, M2, M3, A2, A3, G2, etc.).
Specific VM shapes (e.g., n2-standard-64, c2-highmem-32).
Minimum CPU platforms (e.g., requesting Intel Ice Lake or later).
The sheer amount of cores or RAM requested.
Accelerators:
Specific types and counts of GPUs (e.g., NVIDIA T4, V100, A100, H100).
TPUs (Tensor Processing Units).
Storage:
Local SSD: Lack of available Local SSD capacity on machines that match the other VM requirements.
Persistent Disk (PD): While sometimes manifesting as a PD_STOCKOUT, the inability to create the required PD (due to lack of cell-level capacity for HDD, SSD, or IOPS) can cause the VM creation to fail, sometimes still surfacing as a general ZRPE to the user.
Good test would be also to temporarily rent an instance (or single-vm Regional Managed Instance Group) with the same spec - this would allow you to say if the region supports that configuration. From your description it seems it does (“several instances spinned up“)
Once you exclude these reasons I encourage to raise a ticket with details of a recent failure.