GPU time-sharing stopped working after GKE cluster upgrade

Shashwat · September 1, 2025, 4:49pm

Until Saturday (Aug 30, 2025) my GKE cluster was running fine with GPU time-sharing enabled. I could schedule 2 pods per NVIDIA L4 GPU node (using maxSharedClientsPerGpu=2). After my cluster auto-upgraded to v1.33.3-gke.1136000, all my GPU nodes only advertise Allocatable: 1, and I can no longer co-schedule multiple pods per GPU.

Shashwat · September 1, 2025, 4:56pm

Since the same time, I have also been getting these events (never got these before) from cluster-autoscaler stating max-cluster nvidia-l4 limit reached, which doesn’t make any sense to me considering that the quotas (32) are much higher than the number that is running (1).

AlexL · September 4, 2025, 7:48am

We seem to have the same problem!

Node pools with GPUs are not scaling up with events saying: 2 max cluster nvidia-l4 limit reached, 2 not ready for scale-up, 6 max cluster nvidia-tesla-t4 limit reached.

Our project quota is at 3 / 32:

NissesSenap · September 4, 2025, 9:10am

I just spoke to the support. The solution they recommend is to either downgrade the node pool or create a new one that is lower than 1.33.3.

They apparently have multiple client that got this issue and engineering is working on a solution.

Topic		Replies	Views
Node pool in GKE standard cluster with GPU and time-sharing enabled is not matchable by workloads Serverless Applications gke	2	22	May 1, 2024
Reporting outage - GKE autopilot scheduling with NVIDIA GPUs broken Serverless Applications gke	12	21	July 31, 2023
Trouble allocating GPUs to GKE cluster Serverless Applications gke	4	36	May 17, 2024

GPU time-sharing stopped working after GKE cluster upgrade

AI Suggested topics