GKE ComputeClass with GPU

We have following ComputeClass in a GKE 1.31.7-gke.1265000 standard cluster.

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: nvidia-l4-1
spec:
  priorities:
    - machineFamily: g2
      gpu:
        type: nvidia-l4
        count: 1
  nodePoolAutoCreation:
    enabled: true
  autoscalingPolicy:
    consolidationDelayMinutes: 10
    consolidationThreshold: 70
    gpuConsolidationThreshold: 100
  activeMigration:
    optimizeRulePriority: true
  whenUnsatisfiable: DoNotScaleUp

As soon as we deploy a workload with this class:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-example
  namespace: bla
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-example
  template:
    spec:
      containers:
      - image: stefanprodan/podinfo:latest
        imagePullPolicy: Always
        name: gpu-example
        resources:
          limits:
            cpu: 200m
            memory: 128Mi
            nvidia.com/gpu: "1"
          requests:
            cpu: 200m
            memory: 128Mi
            nvidia.com/gpu: "1"
      nodeSelector:
        cloud.google.com/compute-class: nvidia-l4-1
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Equal
        value: present

The Compute class gets unhealthy:

Status:
  Conditions:
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Nodepool nap-g2-standard-8-gpu1-ehtlyb5i has an Crd label but doesn't match any priority rule.
    Reason:                NoRuleMatching
    Status:                True
    Type:                  NodepoolMisconfigured
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Nodepool nap-g2-standard-8-gpu1-ehtlyb5i doesn't match any priority rule and Crd is configured to not scale up in that case
    Reason:                NodepoolWillNeverScaleUp
    Status:                True
    Type:                  NodepoolMisconfigured
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Crd is not healthy.
    Reason:                Health
    Status:                False
    Type:                  Health
Events:                    <none>

Even it says, NodepoolWillNeverScaleUp, it does scale up and run the workload.

Also scaling the app, will create new nodepools, instead of creating more nodes in the existing nodepool.

What we’re observing is that GPU nodepools remain in the system even after scaling down to 0 nodes. These empty nodepools, along with their associated managed instance groups and templates, continue to accumulate over time. Eventually, this accumulation hits our quota limits, preventing any further scaling operations in the cluster.

The issue appears to be in how GKE handles cleanup of these resources.

Hi hprotzek,

Welcome to Google Cloud Community!

Have you tried upgrading your cluster version? Based on your current setup, the cluster version you’re using is outdated and no longer available in any release channel. I recommend upgrading your cluster to GKE version 1.32.2-gke.1400000 or newer.

I attempted to reproduce your issue using the same ComputeClass configuration, and the CRD status appeared as “Healthy” in my setup. Here’s a sample output ComputeClass Status:

If the issue still persists, please feel free to reach out to our Google Cloud Support.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

we tested with 1.31.7-gke.1390000 and 1.32.2-gke.1297002 same result. I already opened a issue here, also with the steps to reproduce the problem: https://issuetracker.google.com/issues/423939362

@francislouie Could you show the parameters of your test cluster, where it worked?

This is the config we used reproduce the problem:

gcloud beta container --project "<redacted>" clusters create "example-cluster-1" --region "europe-west4" --tier "standard" --no-enable-basic-auth --cluster-version "1.31.7-gke.1390000" --release-channel "stable" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET --enable-ip-alias --network "<redacted>" --subnetwork "<redacted>" --cluster-secondary-range-name "<redacted>" --services-secondary-range-name "<redacted>" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-ip-access --security-posture=standard --workload-vulnerability-scanning=disabled --enable-dataplane-v2 --no-enable-google-cloud-access --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --binauthz-evaluation-mode=DISABLED --enable-autoprovisioning --min-cpu 1 --max-cpu 1000 --min-memory 1 --max-memory 1000 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --shielded-integrity-monitoring --no-shielded-secure-boot --node-locations "europe-west4-a","europe-west4-b"

Hi @hprotzek ,

Unfortunately, I have already deleted the test cluster where I reproduced the issue, as it was also consuming resources on my end. What I recommend is creating a new GKE cluster for testing purposes to prevent any downtime to your workloads.