GKE 1.30.8 upgrade results in pods stuck in terminating state.

antoniotamer · February 18, 2025, 9:32pm

Our test and production clusters both got upgraded to 1.30.8. Our test cluster on Jan 13th and our production cluster on Feb 3rd. Immediately following the upgrade, we started seeing pods stuck in Terminating state. Or Job pods that show are “running” but in fact the container no longer exists on the host. We initially thought it was due to the below -

Kubelet: Fix the volume manager didn't check the device mount state in the actual state of the world before marking the volume as detached. It may cause a pod to be stuck in the Terminating state due to the above issue when it was deleted. (#129063) [SIG Node]

However this was fixed in 1.30.9. We upgraded our test environment to this version on and it took a couple of days for the issue to resurface.

We upgraded again to 1.31.5 and the issue is still happening but a little less frequently.

Meanwhile our production cluster started exhibiting this issue as soon as it was upgraded to 1.30.8.

I did see an issue somewhere on github project where a GCP customer said that GCP customer support recommended upgrading GKE to 1.30.9, which we also did, but we continue to see this issue.

When I ssh into a box, I can see the error pertains to stopped containers. But deleting the stopped containers is yielding the same result:

ctr -n k8s.io tasks delete a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b
WARN[0000] DEPRECATION: The `mirrors` property of `[plugins."io.containerd.grpc.v1.cri".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead.
ERRO[0005] unable to delete a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b  error="failed to delete task: failed rootfs umount: failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b/rootfs: device or resource busy: unknown"

I haven’t been able to pinpoint what process is using that volume mount. Doing lsof shows it’s the containerd process.

Restarting the containerd service seems to result in a cleanup of orphaned containers:

Feb 18 20:31:55 gke-applications-applications2-9c570d6b-56mf containerd[2276158]: time="2025-02-18T20:31:55.411309449Z" level=info msg="RemoveContainer for \"a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b\""
Feb 18 20:31:55 gke-applications-applications2-9c570d6b-56mf containerd[2276158]: time="2025-02-18T20:31:55.421683369Z" level=info msg="RemoveContainer for \"a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b\" returns successfully"

If this is a known issue with a kubernetes version, would you be able to make a recommendation? are there any additional logs I can provide to help troubleshoot this issue?

Some containerd issues suggest that it might be a scan of sorts that’s causing this. We have the following monitoring tools in place: gke enterprise with threat detection (and image scanning), falco, datadog. These have been on for long before the upgrade.

antoniotamer · February 18, 2025, 10:23pm

Here’s a link to a similar issue seen by another GCP customer and who was recommended to upgrade, yet the issue apparently persists:

https://github.com/actions/actions-runner-controller/issues/3903#issuecomment-2643003023

However this other poster suggests this issue remains with 1.31.

https://github.com/actions/actions-runner-controller/issues/3903#issuecomment-2647883662

Topic		Replies	Views
GKE Autopilot Pod Stuck Terminating Serverless Applications	4	42	July 15, 2024
Deleted pvc is Terminating forever Serverless Applications gke	1	10	December 22, 2021
Calico-node cant start up due to not enough resources on node and blocks other services to terminate Serverless Applications gke	4	40	September 16, 2022

GKE 1.30.8 upgrade results in pods stuck in terminating state.

AI Suggested topics