GKE ComputeClass with GPU

hprotzek · June 10, 2025, 1:49pm

We have following ComputeClass in a GKE 1.31.7-gke.1265000 standard cluster.

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: nvidia-l4-1
spec:
  priorities:
    - machineFamily: g2
      gpu:
        type: nvidia-l4
        count: 1
  nodePoolAutoCreation:
    enabled: true
  autoscalingPolicy:
    consolidationDelayMinutes: 10
    consolidationThreshold: 70
    gpuConsolidationThreshold: 100
  activeMigration:
    optimizeRulePriority: true
  whenUnsatisfiable: DoNotScaleUp

As soon as we deploy a workload with this class:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-example
  namespace: bla
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-example
  template:
    spec:
      containers:
      - image: stefanprodan/podinfo:latest
        imagePullPolicy: Always
        name: gpu-example
        resources:
          limits:
            cpu: 200m
            memory: 128Mi
            nvidia.com/gpu: "1"
          requests:
            cpu: 200m
            memory: 128Mi
            nvidia.com/gpu: "1"
      nodeSelector:
        cloud.google.com/compute-class: nvidia-l4-1
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Equal
        value: present

The Compute class gets unhealthy:

Status:
  Conditions:
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Nodepool nap-g2-standard-8-gpu1-ehtlyb5i has an Crd label but doesn't match any priority rule.
    Reason:                NoRuleMatching
    Status:                True
    Type:                  NodepoolMisconfigured
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Nodepool nap-g2-standard-8-gpu1-ehtlyb5i doesn't match any priority rule and Crd is configured to not scale up in that case
    Reason:                NodepoolWillNeverScaleUp
    Status:                True
    Type:                  NodepoolMisconfigured
    Last Transition Time:  2025-06-10T13:34:03Z
    Message:               Crd is not healthy.
    Reason:                Health
    Status:                False
    Type:                  Health
Events:                    <none>

Even it says, NodepoolWillNeverScaleUp, it does scale up and run the workload.

Also scaling the app, will create new nodepools, instead of creating more nodes in the existing nodepool.

hprotzek · June 11, 2025, 6:48am

What we’re observing is that GPU nodepools remain in the system even after scaling down to 0 nodes. These empty nodepools, along with their associated managed instance groups and templates, continue to accumulate over time. Eventually, this accumulation hits our quota limits, preventing any further scaling operations in the cluster.

The issue appears to be in how GKE handles cleanup of these resources.

francislouie · June 12, 2025, 1:30pm

Hi hprotzek,

Welcome to Google Cloud Community!

Have you tried upgrading your cluster version? Based on your current setup, the cluster version you’re using is outdated and no longer available in any release channel. I recommend upgrading your cluster to GKE version 1.32.2-gke.1400000 or newer.

I attempted to reproduce your issue using the same ComputeClass configuration, and the CRD status appeared as “Healthy” in my setup. Here’s a sample output ComputeClass Status:

If the issue still persists, please feel free to reach out to our Google Cloud Support.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

hprotzek · June 13, 2025, 7:19am

we tested with 1.31.7-gke.1390000 and 1.32.2-gke.1297002 same result. I already opened a issue here, also with the steps to reproduce the problem: https://issuetracker.google.com/issues/423939362

hprotzek · June 13, 2025, 7:23am

@francislouie Could you show the parameters of your test cluster, where it worked?

This is the config we used reproduce the problem:

gcloud beta container --project "<redacted>" clusters create "example-cluster-1" --region "europe-west4" --tier "standard" --no-enable-basic-auth --cluster-version "1.31.7-gke.1390000" --release-channel "stable" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET --enable-ip-alias --network "<redacted>" --subnetwork "<redacted>" --cluster-secondary-range-name "<redacted>" --services-secondary-range-name "<redacted>" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-ip-access --security-posture=standard --workload-vulnerability-scanning=disabled --enable-dataplane-v2 --no-enable-google-cloud-access --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --binauthz-evaluation-mode=DISABLED --enable-autoprovisioning --min-cpu 1 --max-cpu 1000 --min-memory 1 --max-memory 1000 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --shielded-integrity-monitoring --no-shielded-secure-boot --node-locations "europe-west4-a","europe-west4-b"

francislouie · June 13, 2025, 8:45am

Hi @hprotzek ,

Unfortunately, I have already deleted the test cluster where I reproduced the issue, as it was also consuming resources on my end. What I recommend is creating a new GKE cluster for testing purposes to prevent any downtime to your workloads.

Topic		Replies	Views
GKE autopilot cluster can't scale up GPU pod Serverless Applications gke	1	7	March 15, 2024
GKE Autopilot cluster and Wanted up a GPU ( Nvidia-l4 or Nvidia-tesla-t4 ) Serverless Applications	4	4	July 30, 2024
custom compute class stops scaling of the entire cluster Serverless Applications gke	3	1	November 22, 2024

GKE ComputeClass with GPU

AI Suggested topics