Trouble allocating GPUs to GKE cluster

tvarner · May 16, 2024, 6:36am

Hi,

I’m unable to allocate GPU resources for a Pod on a GKE cluster because the cluster autoscaler is failing to scale up the Nodes required to run the Pod. When I inspect the pod, GKE reports that I’m exceeding a GCE quota, and as a result, it fails to schedule the node. However, it doesn’t show what resource I exceeded, and I when I go to the Quotas page, I don’t see any resources being exceeded.

I’m not sure if there is a bug in my setup, or if Google Cloud has run out of GPUs in my cluster’s region. I tried switching regions from us-west4 to us-west1, but I saw the same error.

Could someone please help point me in the right direction?

Below is the output from kubectl describe pod:

Below is my manifest file:

garisingh · May 16, 2024, 12:59pm

Depending on your account, it’s possible that you don’t have quota to use any GPUs.
You should check the quota for “GPUs (all regions)”

My personal project:

One of my work projects:

tvarner · May 16, 2024, 2:02pm

Thanks @garisingh . I’ve checked my Quotas via the Quotas & System Limits page, and it says that I have 2 available GPUs.

I’m only requesting one GPU when deploying my pod and it’s saying that I’m exceeding a GCE quota limit, and failing to start. The pod runs perfectly (and instantly) when I remove the GPU request from my manifest file.

Any other ideas?

tvarner · May 16, 2024, 2:54pm

Quick rant: getting GPU-accelerated pods up and running with GKE has been a nightmare because of this issue… I’ve lost so many hours…I’ve carefully followed all documentation instructions, and have double-checked my quotas. But it seems like the Autoscaler is completely unreliable, and outputs nothing useful in the console or pod event messages that can identify what resource is getting exceeded, or what other methods to try.

shannduin · May 17, 2024, 2:14pm

~~What GPU type are you trying to use?~~

Sorry, I see that it’s T4. Can you check your T4 GPU quota (not the GPUs - all regions quota) to see whether you have a quota for T4 in that region?

Topic		Replies	Views
insufficient nvidia.com/gpu with autopilot Serverless Applications gke	2	92	February 6, 2023
Request GPU quota for GKE Serverless Applications gke	1	8	March 12, 2024
GKE Autopilot cluster and Wanted up a GPU ( Nvidia-l4 or Nvidia-tesla-t4 ) Serverless Applications	4	77	July 30, 2024

Trouble allocating GPUs to GKE cluster

AI Suggested topics