GKE Autopilot Inifinite Pod Pending

Hi, I am trying to set up a AI backend using Autopilot GPU.

When I set the GPU to nvidia-tesla-t4, the pod ends up in infinite pending state. However, when I set GPU to nvidia-l4, it ends up alerting “scale.up.error.quota.exceeded”. I have quotes where

  • t4: 3

  • l4: 1

How should I properly set my Kubernetes up?

Can you share your Pod manifest?

here is my service & deployment & pv-pvc yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ckpt-pv
spec:
  # balanced persistent disk
  storageClassName: "standard-rwo"
  capacity:
    storage: 100G
  accessModes:
    - ReadOnlyMany
  claimRef:
    namespace: default
    name: ckpt-pvclaim
  csi:
    driver: pd.csi.storage.gke.io
    # https://cloud.google.com/compute/docs/gpus/create-gpu-vm-accelerator-optimized#limitations
    # regional disk is not supported for nvidia-l4 gpu (g2 vm type)
    volumeHandle: projects/passionboost/zones/us-central1-a/disks/ckpt-zonal
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: default
  name: ckpt-pvclaim
spec:
  storageClassName: "standard-rwo"
  volumeName: ckpt-pv
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100G
---
kind: Service
apiVersion: v1
metadata:
  name: torchserve
  labels:
    app: torchserve
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: /metrics
    prometheus.io/port: '8082'
spec:
  ports:
  - name: preds
    port: 8080
    targetPort: ts
  - name: mdl
    port: 8081
    targetPort: ts-management
  - name: metrics
    port: 8082
    targetPort: ts-metrics
  - name: grpc
    port: 7070
    targetPort: ts-grpc
  selector:
    app: torchserve
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: torchserve
  labels:
    app: torchserve
spec:
  replicas: 1 
  selector:
    matchLabels:
      app: torchserve
  template:
    metadata:
      labels:
        app: torchserve
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
        # cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-accelerator-count: "1"
        # https://cloud.google.com/kubernetes-engine/docs/how-to/gke-zonal-topology#nodeselector-placement
        topology.kubernetes.io/zone: "us-central1-a"
      volumes:
      - name: ckpt-volume
        persistentVolumeClaim:
          claimName: ckpt-pvclaim
          readOnly: true
      containers:
      - name: torchserve
        image:  us-central1-docker.pkg.dev/passionboost/autopilot:testing
        command: ["torchserve", "--start", "--models=no-model.mar",  "--model-store", "/home/model-server/model-store/", "--ts-config=config.properties"]
        ports:
        - name: ts
          containerPort: 8080
        - name: ts-management
          containerPort: 8081
        - name: ts-metrics
          containerPort: 8082
        - name: ts-grpc
          containerPort: 7070
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - mountPath: /home/model-server/ckpt
            name: ckpt-volume
        resources:
          limits:
            cpu: 4
            memory: 20Gi
            nvidia.com/gpu: 1

When I set the GPU to “nvidia-tesla-t4”, I got the following autopilot logs infinitely and the pods is in PENDING state forever

“scale.up.error.out.of.resources”

Maybe you do not have enough storage quota in the T4 case?

For the L4 case, when you go to the quotas page, does it show any L4’s in use?

In T4 case,

do you mean the disks by “storage”? I don’t think so cause I didn’t receive any quota related error messages and I did formatted and mounted the target persistent disk before.

In L4 case,

Have to double check on that, however, I am pretty sure I wasn’t using any L4 GPU elsewhere

+1, I just use the example manifest from the doc:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-accelerator-count: "1"
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
        nvidia.com/gpu: 1

It’s always in “pending” state, and this is what I got from events

8m34s Warning FailedScaleUp pod/my-gpu-pod Node scale up in zones us-west1-a associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.

I can confirm I have the quota

1 Like

GCE out of resources usually means that there’s no hardware available in the region. You’ll have to wait until enough GPUs become free or switch to a different region/try different GPUs.

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/autopilot-clusters#scaleup-failed-out-of-resources