Our test and production clusters both got upgraded to 1.30.8. Our test cluster on Jan 13th and our production cluster on Feb 3rd. Immediately following the upgrade, we started seeing pods stuck in Terminating state. Or Job pods that show are “running” but in fact the container no longer exists on the host. We initially thought it was due to the below -
Kubelet: Fix the volume manager didn't check the device mount state in the actual state of the world before marking the volume as detached. It may cause a pod to be stuck in the Terminating state due to the above issue when it was deleted. (#129063) [SIG Node]
However this was fixed in 1.30.9. We upgraded our test environment to this version on and it took a couple of days for the issue to resurface.
We upgraded again to 1.31.5 and the issue is still happening but a little less frequently.
Meanwhile our production cluster started exhibiting this issue as soon as it was upgraded to 1.30.8.
I did see an issue somewhere on github project where a GCP customer said that GCP customer support recommended upgrading GKE to 1.30.9, which we also did, but we continue to see this issue.
When I ssh into a box, I can see the error pertains to stopped containers. But deleting the stopped containers is yielding the same result:
ctr -n k8s.io tasks delete a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b
WARN[0000] DEPRECATION: The `mirrors` property of `[plugins."io.containerd.grpc.v1.cri".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead.
ERRO[0005] unable to delete a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b error="failed to delete task: failed rootfs umount: failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b/rootfs: device or resource busy: unknown"
I haven’t been able to pinpoint what process is using that volume mount. Doing lsof shows it’s the containerd process.
Restarting the containerd service seems to result in a cleanup of orphaned containers:
Feb 18 20:31:55 gke-applications-applications2-9c570d6b-56mf containerd[2276158]: time="2025-02-18T20:31:55.411309449Z" level=info msg="RemoveContainer for \"a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b\""
Feb 18 20:31:55 gke-applications-applications2-9c570d6b-56mf containerd[2276158]: time="2025-02-18T20:31:55.421683369Z" level=info msg="RemoveContainer for \"a56c5f2d19b3623d5d0ef25ea44622051702a1bcbced012922dca84ae98b4c0b\" returns successfully"
If this is a known issue with a kubernetes version, would you be able to make a recommendation? are there any additional logs I can provide to help troubleshoot this issue?
Some containerd issues suggest that it might be a scan of sorts that’s causing this. We have the following monitoring tools in place: gke enterprise with threat detection (and image scanning), falco, datadog. These have been on for long before the upgrade.