Frequent “GCE out of resources.” in europe-west1-{c,d} with GKE Standard (Spot/Preemptible CI nodes) after years of stability

For ~5 years, we had no capacity issues in Europe-West1. Over the last few weeks, we’ve been experiencing daily “GCE out of resources” errors when GKE attempts to scale node pools, particularly for our CI cluster, which runs on Spot/Preemptible nodes. We moved from zone europe-west1-d to europe-west1-c, but the problem persists. We’re looking for:

  1. Practical ways to see or anticipate capacity constraints per zone/region, and

  2. Recommendations to make our setup robust so that our GitLab CI runners can always scale (and production remains safe).

  3. The suggestions are multi-regional, but how do we know that the data centre has enough pre-emptable/stock resources as well?

The status page always indicates that everything is up and healthy, but aside from that, we are unable to resolve the issue with preemptible nodes.

Environment (high level)

  • GKE Standard, private cluster, currently zonal in europe-west1 (Belgium).

  • Production & prerelease: fixed (on-demand) node pools.

  • CI cluster: Spot/Preemptible nodes to keep costs low.

  • Typical machine families include N2D (e.g., n2d-standard-8) and e2-standard-4 & e2-standard-8.

  • Over the last few weeks, scaling attempts have failed with the error message “GCE out of resources” when attempting to add nodes.

What changed?

Nothing intentional on our side other than moving the CI pool from europe-west1-d to europe-west1-c. Due to errors regarding CPU & Memory.