GKE Autopilot Inifinite Pod Pending

gaby_rla · September 18, 2024, 1:44pm

Hi, I am trying to set up a AI backend using Autopilot GPU.

When I set the GPU to nvidia-tesla-t4, the pod ends up in infinite pending state. However, when I set GPU to nvidia-l4, it ends up alerting “scale.up.error.quota.exceeded”. I have quotes where

t4: 3
l4: 1

How should I properly set my Kubernetes up?

shannduin · September 18, 2024, 2:45pm

Can you share your Pod manifest?

gaby_rla · September 24, 2024, 1:35pm

here is my service & deployment & pv-pvc yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ckpt-pv
spec:
  # balanced persistent disk
  storageClassName: "standard-rwo"
  capacity:
    storage: 100G
  accessModes:
    - ReadOnlyMany
  claimRef:
    namespace: default
    name: ckpt-pvclaim
  csi:
    driver: pd.csi.storage.gke.io
    # https://cloud.google.com/compute/docs/gpus/create-gpu-vm-accelerator-optimized#limitations
    # regional disk is not supported for nvidia-l4 gpu (g2 vm type)
    volumeHandle: projects/passionboost/zones/us-central1-a/disks/ckpt-zonal
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: default
  name: ckpt-pvclaim
spec:
  storageClassName: "standard-rwo"
  volumeName: ckpt-pv
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100G

---
kind: Service
apiVersion: v1
metadata:
  name: torchserve
  labels:
    app: torchserve
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: /metrics
    prometheus.io/port: '8082'
spec:
  ports:
  - name: preds
    port: 8080
    targetPort: ts
  - name: mdl
    port: 8081
    targetPort: ts-management
  - name: metrics
    port: 8082
    targetPort: ts-metrics
  - name: grpc
    port: 7070
    targetPort: ts-grpc
  selector:
    app: torchserve
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: torchserve
  labels:
    app: torchserve
spec:
  replicas: 1 
  selector:
    matchLabels:
      app: torchserve
  template:
    metadata:
      labels:
        app: torchserve
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
        # cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-accelerator-count: "1"
        # https://cloud.google.com/kubernetes-engine/docs/how-to/gke-zonal-topology#nodeselector-placement
        topology.kubernetes.io/zone: "us-central1-a"
      volumes:
      - name: ckpt-volume
        persistentVolumeClaim:
          claimName: ckpt-pvclaim
          readOnly: true
      containers:
      - name: torchserve
        image:  us-central1-docker.pkg.dev/passionboost/autopilot:testing
        command: ["torchserve", "--start", "--models=no-model.mar",  "--model-store", "/home/model-server/model-store/", "--ts-config=config.properties"]
        ports:
        - name: ts
          containerPort: 8080
        - name: ts-management
          containerPort: 8081
        - name: ts-metrics
          containerPort: 8082
        - name: ts-grpc
          containerPort: 7070
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - mountPath: /home/model-server/ckpt
            name: ckpt-volume
        resources:
          limits:
            cpu: 4
            memory: 20Gi
            nvidia.com/gpu: 1

gaby_rla · September 24, 2024, 1:53pm

When I set the GPU to “nvidia-tesla-t4”, I got the following autopilot logs infinitely and the pods is in PENDING state forever

“scale.up.error.out.of.resources”

garisingh · September 24, 2024, 2:00pm

Maybe you do not have enough storage quota in the T4 case?

For the L4 case, when you go to the quotas page, does it show any L4’s in use?

gaby_rla · September 24, 2024, 2:07pm

In T4 case,

do you mean the disks by “storage”? I don’t think so cause I didn’t receive any quota related error messages and I did formatted and mounted the target persistent disk before.

In L4 case,

Have to double check on that, however, I am pretty sure I wasn’t using any L4 GPU elsewhere

yanqiang · October 1, 2024, 6:15pm

+1, I just use the example manifest from the doc:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-accelerator-count: "1"
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: 1

It’s always in “pending” state, and this is what I got from events

8m34s Warning FailedScaleUp pod/my-gpu-pod Node scale up in zones us-west1-a associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.

I can confirm I have the quota

shannduin · October 1, 2024, 6:47pm

GCE out of resources usually means that there’s no hardware available in the region. You’ll have to wait until enough GPUs become free or switch to a different region/try different GPUs.

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/autopilot-clusters#scaleup-failed-out-of-resources

Topic		Replies	Views
GKE Autopilot cluster and Wanted up a GPU ( Nvidia-l4 or Nvidia-tesla-t4 ) Serverless Applications	4	77	July 30, 2024
GKE autopilot cluster can't scale up GPU pod Serverless Applications gke	1	115	March 15, 2024
insufficient nvidia.com/gpu with autopilot Serverless Applications gke	2	93	February 6, 2023

GKE Autopilot Inifinite Pod Pending

AI Suggested topics