I created GKE autopilot cluster with default parameters:
gcloud container clusters create-auto test-autopilot-gpu --location=europe-west4 --project=wwl-ml
gcloud container clusters get-credentials test-autopilot-gpu --location=europe-west4 --project=wwl-ml
And tried to deploy a GPU pod there as described in https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
limits:
nvidia.com/gpu: 1
However, the pod remains in Pending state indefinitely. Here is it’s description:
Name: my-gpu-pod
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: autopilot.gke.io/resource-adjustment:
{"input":{"containers":[{"limits":{"nvidia.com/gpu":"1"},"requests":{"nvidia.com/gpu":"1"},"name":"my-gpu-container"}]},"output":{"contain...
autopilot.gke.io/warden-version: 2.7.52
cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-06T05:21:52+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
IP:
IPs: <none>
Containers:
my-gpu-container:
Image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
cpu: 9
ephemeral-storage: 1Gi
memory: 60Gi
nvidia.com/gpu: 1
Requests:
cpu: 9
ephemeral-storage: 1Gi
memory: 60Gi
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x75h9 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-x75h9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-tesla-a100
cloud.google.com/gke-accelerator-count=1
Tolerations: cloud.google.com/gke-accelerator=nvidia-tesla-a100:NoSchedule
cloud.google.com/machine-family:NoSchedule op=Exists
kubernetes.io/arch=amd64:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m13s (x3 over 3m31s) gke.io/optimize-utilization-scheduler no nodes available to schedule pods
Warning FailedScheduling 2m54s gke.io/optimize-utilization-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal TriggeredScaleUp 2m25s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/wwl-ml/zones/europe-west4-a/instanceGroups/gk3-test-autopilot-gpu-nap-1y8i627v-a8bfd2df-grp 0->1 (max: 1000)}]
Warning FailedScaleUp 2m cluster-autoscaler Node scale up in zones europe-west4-a associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
Normal NotTriggerScaleUp 83s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}, 18 node(s) didn't match Pod's node affinity/selector, 2 in backoff after failed scale-up
Warning FailedScheduling 82s gke.io/optimize-utilization-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Cloud logging show following errors:
[
{
"insertId": "de8e1715-c796-4fbf-a79e-27d6e9d39fed@a1",
"jsonPayload": {
"noDecisionStatus": {
"noScaleUp": {
"unhandledPodGroups": [
{
"rejectedMigs": [
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-4",
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-pool-4-5cf4acab-grp"
}
},
{
"mig": {
"nodepool": "pool-6",
"name": "gk3-test-autopilot-gpu-pool-6-3e8afa7d-grp",
"zone": "europe-west4-a"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-2",
"name": "gk3-test-autopilot-gpu-pool-2-806b23c7-grp",
"zone": "europe-west4-c"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-5-6f0ef260-grp",
"nodepool": "pool-5"
}
},
{
"mig": {
"zone": "europe-west4-b",
"nodepool": "pool-3",
"name": "gk3-test-autopilot-gpu-pool-3-fb1922fe-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"mig": {
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-1-a065df10-grp",
"nodepool": "pool-1"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-6-e5b063f1-grp",
"nodepool": "pool-6",
"zone": "europe-west4-c"
}
},
{
"mig": {
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-3-811400b8-grp",
"nodepool": "pool-3"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"mig": {
"name": "gk3-test-autopilot-gpu-pool-1-b989b740-grp",
"zone": "europe-west4-b",
"nodepool": "pool-1"
},
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
}
},
{
"mig": {
"name": "gk3-test-autopilot-gpu-pool-2-fd71e860-grp",
"zone": "europe-west4-a",
"nodepool": "pool-2"
},
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-pool-5-fca82bbc-grp",
"nodepool": "pool-5"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-4-9a759708-grp",
"zone": "europe-west4-c",
"nodepool": "pool-4"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-6",
"zone": "europe-west4-b",
"name": "gk3-test-autopilot-gpu-pool-6-cbfebf7e-grp"
}
},
{
"mig": {
"nodepool": "pool-1",
"name": "gk3-test-autopilot-gpu-pool-1-0b03ec88-grp",
"zone": "europe-west4-a"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"mig": {
"name": "gk3-test-autopilot-gpu-default-pool-be071803-grp",
"zone": "europe-west4-a",
"nodepool": "default-pool"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"TaintToleration",
"node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}"
]
}
},
{
"mig": {
"zone": "europe-west4-b",
"nodepool": "pool-4",
"name": "gk3-test-autopilot-gpu-pool-4-2992e3f5-grp"
},
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"zone": "europe-west4-b",
"nodepool": "pool-5",
"name": "gk3-test-autopilot-gpu-pool-5-3e1c0e68-grp"
}
},
{
"mig": {
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-pool-3-b0feff4d-grp",
"nodepool": "pool-3"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-2-4c51e47d-grp",
"nodepool": "pool-2",
"zone": "europe-west4-b"
}
}
],
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.illegal.config",
"parameters": [
"europe-west4-a"
]
},
{
"parameters": [
"europe-west4-b"
],
"messageId": "no.scale.up.nap.pod.zonal.illegal.config"
},
{
"parameters": [
"europe-west4-c"
],
"messageId": "no.scale.up.nap.pod.zonal.illegal.config"
}
],
"podGroup": {
"totalPodCount": 1,
"samplePod": {
"name": "my-gpu-pod",
"namespace": "default"
}
}
}
],
"skippedMigs": [
{
"mig": {
"nodepool": "nap-1y8i627v",
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-nap-1y8i627v-a8bfd2df-grp"
},
"reason": {
"messageId": "no.scale.up.mig.skipped",
"parameters": [
"in backoff after failed scale-up"
]
}
},
{
"mig": {
"nodepool": "nap-1y8i627v",
"name": "gk3-test-autopilot-gpu-nap-1y8i627v-5ff53eff-grp",
"zone": "europe-west4-b"
},
"reason": {
"messageId": "no.scale.up.mig.skipped",
"parameters": [
"in backoff after failed scale-up"
]
}
}
],
"unhandledPodGroupsTotalCount": 1
},
"measureTime": "1709702512"
}
},
"resource": {
"type": "k8s_cluster",
"labels": {
"location": "europe-west4",
"project_id": "wwl-ml",
"cluster_name": "test-autopilot-gpu"
}
},
"timestamp": "2024-03-06T05:21:52.968881760Z",
"logName": "projects/wwl-ml/logs/container.googleapis.com%2Fcluster-autoscaler-visibility",
"receiveTimestamp": "2024-03-06T05:21:53.323492805Z"
},
{
"insertId": "3f2686b3-403d-49b9-945a-56bb8b5c7f53@a1",
"jsonPayload": {
"noDecisionStatus": {
"noScaleUp": {
"unhandledPodGroups": [
{
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
"parameters": [
"europe-west4-a"
]
},
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
"parameters": [
"europe-west4-b"
]
}
],
"podGroup": {
"totalPodCount": 1,
"samplePod": {
"namespace": "default",
"name": "my-gpu-pod"
}
}
}
],
"unhandledPodGroupsTotalCount": 1
},
"measureTime": "1709702555"
}
},
"resource": {
"type": "k8s_cluster",
"labels": {
"cluster_name": "test-autopilot-gpu",
"project_id": "wwl-ml",
"location": "europe-west4"
}
},
"timestamp": "2024-03-06T05:22:35.463795917Z",
"logName": "projects/wwl-ml/logs/container.googleapis.com%2Fcluster-autoscaler-visibility",
"receiveTimestamp": "2024-03-06T05:22:35.936621027Z"
},
{
"insertId": "7683fa55-47a1-4615-ae07-217dc9552d95@a1",
"jsonPayload": {
"noDecisionStatus": {
"noScaleUp": {
"unhandledPodGroupsTotalCount": 1,
"unhandledPodGroups": [
{
"napFailureReasons": [
{
"parameters": [
"europe-west4-a"
],
"messageId": "no.scale.up.nap.pod.zonal.illegal.config"
},
{
"messageId": "no.scale.up.nap.pod.zonal.illegal.config",
"parameters": [
"europe-west4-b"
]
},
{
"parameters": [
"europe-west4-c"
],
"messageId": "no.scale.up.nap.pod.zonal.illegal.config"
}
],
"podGroup": {
"samplePod": {
"namespace": "default",
"name": "my-gpu-pod"
},
"totalPodCount": 1
},
"rejectedMigs": [
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-1",
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-1-a065df10-grp"
}
},
{
"mig": {
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-5-6f0ef260-grp",
"nodepool": "pool-5"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"mig": {
"zone": "europe-west4-c",
"nodepool": "pool-4",
"name": "gk3-test-autopilot-gpu-pool-4-9a759708-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"mig": {
"nodepool": "default-pool",
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-default-pool-be071803-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"TaintToleration",
"node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}"
]
}
},
{
"mig": {
"nodepool": "pool-6",
"zone": "europe-west4-b",
"name": "gk3-test-autopilot-gpu-pool-6-cbfebf7e-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-6-3e8afa7d-grp",
"zone": "europe-west4-a",
"nodepool": "pool-6"
}
},
{
"mig": {
"nodepool": "pool-4",
"zone": "europe-west4-a",
"name": "gk3-test-autopilot-gpu-pool-4-5cf4acab-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-1-0b03ec88-grp",
"nodepool": "pool-1",
"zone": "europe-west4-a"
}
},
{
"mig": {
"zone": "europe-west4-c",
"nodepool": "pool-2",
"name": "gk3-test-autopilot-gpu-pool-2-806b23c7-grp"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-3",
"name": "gk3-test-autopilot-gpu-pool-3-b0feff4d-grp",
"zone": "europe-west4-a"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"zone": "europe-west4-b",
"nodepool": "pool-2",
"name": "gk3-test-autopilot-gpu-pool-2-4c51e47d-grp"
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-2-fd71e860-grp",
"zone": "europe-west4-a",
"nodepool": "pool-2"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"nodepool": "pool-5",
"name": "gk3-test-autopilot-gpu-pool-5-3e1c0e68-grp",
"zone": "europe-west4-b"
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"zone": "europe-west4-b",
"name": "gk3-test-autopilot-gpu-pool-4-2992e3f5-grp",
"nodepool": "pool-4"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"zone": "europe-west4-c",
"nodepool": "pool-3",
"name": "gk3-test-autopilot-gpu-pool-3-811400b8-grp"
}
},
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"nodepool": "pool-6",
"zone": "europe-west4-c",
"name": "gk3-test-autopilot-gpu-pool-6-e5b063f1-grp"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"name": "gk3-test-autopilot-gpu-pool-5-fca82bbc-grp",
"nodepool": "pool-5",
"zone": "europe-west4-a"
}
},
{
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
},
"mig": {
"zone": "europe-west4-b",
"nodepool": "pool-3",
"name": "gk3-test-autopilot-gpu-pool-3-fb1922fe-grp"
}
},
{
"mig": {
"nodepool": "pool-1",
"name": "gk3-test-autopilot-gpu-pool-1-b989b740-grp",
"zone": "europe-west4-b"
},
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
}
}
]
}
],
"skippedMigs": [
{
"reason": {
"parameters": [
"in backoff after failed scale-up"
],
"messageId": "no.scale.up.mig.skipped"
},
"mig": {
"zone": "europe-west4-b",
"nodepool": "nap-1n277bso",
"name": "gk3-test-autopilot-gpu-nap-1n277bso-88e2c650-grp"
}
},
{
"reason": {
"messageId": "no.scale.up.mig.skipped",
"parameters": [
"in backoff after failed scale-up"
]
},
"mig": {
"zone": "europe-west4-a",
"nodepool": "nap-1n277bso",
"name": "gk3-test-autopilot-gpu-nap-1n277bso-139213a0-grp"
}
}
]
},
"measureTime": "1709702877"
}
},
"resource": {
"type": "k8s_cluster",
"labels": {
"project_id": "wwl-ml",
"location": "europe-west4",
"cluster_name": "test-autopilot-gpu"
}
},
"timestamp": "2024-03-06T05:27:57.627433959Z",
"logName": "projects/wwl-ml/logs/container.googleapis.com%2Fcluster-autoscaler-visibility",
"receiveTimestamp": "2024-03-06T05:27:58.131601669Z"
}
]
I checked that quotas compute.googleapis.com/nvidia_a100_gpus=1, and went through other possible solutions described here https://cloud.google.com/kubernetes-engine/docs/troubleshooting/autopilot-clusters#scale-up-failed-serial-port-logging . However, nothing worked. Could point me to the solution for this problem?