GPU Allocation Issues on GKE: Questions About Regional Availability and Retention

ReddaGlue · November 21, 2024, 3:45pm

I am unable to create a node pool with n1 nodes and T4 GPUs. I tried changing the region and switching to another project in us-central1, but even in that case, the system failed to fulfill the request, creating only 2 out of 3 nodes.

I have two questions:

Is there a way to determine which region has the highest GPU availability?
If we manage to allocate nodes with GPUs on GKE, in the event of node restarts due to maintenance/updates performed by GCP itself, would the GPU remain allocated to us, or is there a risk of losing them?

nmagcalengjr · November 29, 2024, 2:34pm

Hi @ReddaGlue ,

Welcome to Google Cloud Community!

The issue you’re facing is likely due to insufficient GPU quota or resource availability in the selected region or zone. This can happen even in regions like us-central1, which typically have higher resource availability.

You can start by verifying your quota and checking the GPU availability in your selected region and zones.

Based on this documentation, some manual changes recreate the nodes using a node upgrade strategy immediately without respecting maintenance policies. For more details, see Manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies.

If you need more information on GPU availability or capacity in specific regions, or if you’ve already checked quotas and explored other zones or regions without any success, you can contact Google Cloud Support for further assistance at any time.

I hope the above information is helpful.

smit890 · November 30, 2024, 3:07pm

Hello

To find the regions with the highest GPU availability, you can refer to Google Cloud’s GPU availability checker, which provides details on where GPU resources are most accessible.

If you successfully allocate GPUs on Google Kubernetes Engine (GKE), Google Cloud ensures that GPU resources remain assigned to your project even if nodes restart for maintenance, provided that GPUs are still available in the selected region. However, temporary interruptions may occur due to hardware updates.

Topic		Replies	Views
Trouble allocating GPUs to GKE cluster Serverless Applications gke	4	64	May 17, 2024
Is there a way to get GPU availability on a node without creating node pool? Serverless Applications gke	3	70	December 2, 2024
GKE Autopilot cluster and Wanted up a GPU ( Nvidia-l4 or Nvidia-tesla-t4 ) Serverless Applications	4	131	July 30, 2024

GPU Allocation Issues on GKE: Questions About Regional Availability and Retention

AI Suggested topics