My Kubernetes cluster running on GKE autopilot has an unhealthy node since a few days. The node has a Ready status, but all the pods running on it have a CreateContainerError status and seems to be stuck polling container images.
Example:
(Normal Pulled 2m4s (x26987 over 4d1h) kubelet Container image “gke.gcr.io/cluster-proportional-autoscaler:v1.8.10-gke.3@sha256:274afbfd520aef0933f1fefabddbb33144700982965f9e3632caabb055e912c6” already present on machine).
Something went wrong with the node. I suspect it’s because I did upgrade Kubernetes and my account ran out of SSD quota during the upgrade. I got more quota since and new nodes got created and the upgrade completed. It could be unrelated too.
I did “cordon” the node to mark it unschedulable, and manually deleted my pods from it. New pods got scheduled on healthier nodes, so not too bad and I could live with one broken node.
But I want to clean-up. The old pods I deleted were stuck in Terminating state, but force deleting them made them disappear.
I cannot do the same on the kube-system and gke-gmp-system namespaces. I see the “managed” pods with a CreateContainerError status, and they are pulling container images in a loop. One is also stuck with a Terminating status.
I would like to remove this node, and I drained it as the documentation says. But a few days later, it’s still there.
How could I remove the unhealthy node?