We have following ComputeClass in a GKE 1.31.7-gke.1265000 standard cluster.
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
name: nvidia-l4-1
spec:
priorities:
- machineFamily: g2
gpu:
type: nvidia-l4
count: 1
nodePoolAutoCreation:
enabled: true
autoscalingPolicy:
consolidationDelayMinutes: 10
consolidationThreshold: 70
gpuConsolidationThreshold: 100
activeMigration:
optimizeRulePriority: true
whenUnsatisfiable: DoNotScaleUp
As soon as we deploy a workload with this class:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-example
namespace: bla
spec:
replicas: 1
selector:
matchLabels:
app: gpu-example
template:
spec:
containers:
- image: stefanprodan/podinfo:latest
imagePullPolicy: Always
name: gpu-example
resources:
limits:
cpu: 200m
memory: 128Mi
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 128Mi
nvidia.com/gpu: "1"
nodeSelector:
cloud.google.com/compute-class: nvidia-l4-1
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Equal
value: present
The Compute class gets unhealthy:
Status:
Conditions:
Last Transition Time: 2025-06-10T13:34:03Z
Message: Nodepool nap-g2-standard-8-gpu1-ehtlyb5i has an Crd label but doesn't match any priority rule.
Reason: NoRuleMatching
Status: True
Type: NodepoolMisconfigured
Last Transition Time: 2025-06-10T13:34:03Z
Message: Nodepool nap-g2-standard-8-gpu1-ehtlyb5i doesn't match any priority rule and Crd is configured to not scale up in that case
Reason: NodepoolWillNeverScaleUp
Status: True
Type: NodepoolMisconfigured
Last Transition Time: 2025-06-10T13:34:03Z
Message: Crd is not healthy.
Reason: Health
Status: False
Type: Health
Events: <none>
Even it says, NodepoolWillNeverScaleUp, it does scale up and run the workload.
Also scaling the app, will create new nodepools, instead of creating more nodes in the existing nodepool.
What we’re observing is that GPU nodepools remain in the system even after scaling down to 0 nodes. These empty nodepools, along with their associated managed instance groups and templates, continue to accumulate over time. Eventually, this accumulation hits our quota limits, preventing any further scaling operations in the cluster.
The issue appears to be in how GKE handles cleanup of these resources.
Hi hprotzek,
Welcome to Google Cloud Community!
Have you tried upgrading your cluster version? Based on your current setup, the cluster version you’re using is outdated and no longer available in any release channel. I recommend upgrading your cluster to GKE version 1.32.2-gke.1400000 or newer.
I attempted to reproduce your issue using the same ComputeClass configuration, and the CRD status appeared as “Healthy” in my setup. Here’s a sample output ComputeClass Status:
If the issue still persists, please feel free to reach out to our Google Cloud Support.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
we tested with 1.31.7-gke.1390000 and 1.32.2-gke.1297002 same result. I already opened a issue here, also with the steps to reproduce the problem: https://issuetracker.google.com/issues/423939362
@francislouie Could you show the parameters of your test cluster, where it worked?
This is the config we used reproduce the problem:
gcloud beta container --project "<redacted>" clusters create "example-cluster-1" --region "europe-west4" --tier "standard" --no-enable-basic-auth --cluster-version "1.31.7-gke.1390000" --release-channel "stable" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET --enable-ip-alias --network "<redacted>" --subnetwork "<redacted>" --cluster-secondary-range-name "<redacted>" --services-secondary-range-name "<redacted>" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-ip-access --security-posture=standard --workload-vulnerability-scanning=disabled --enable-dataplane-v2 --no-enable-google-cloud-access --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --binauthz-evaluation-mode=DISABLED --enable-autoprovisioning --min-cpu 1 --max-cpu 1000 --min-memory 1 --max-memory 1000 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --shielded-integrity-monitoring --no-shielded-secure-boot --node-locations "europe-west4-a","europe-west4-b"
Hi @hprotzek ,
Unfortunately, I have already deleted the test cluster where I reproduced the issue, as it was also consuming resources on my end. What I recommend is creating a new GKE cluster for testing purposes to prevent any downtime to your workloads.