pdcsi-node OOM because fsck on large volumes

JordanP · July 10, 2024, 2:26pm

Hi,

I have some large (>5TB) volumes formatted with an EXT4 filesystem. When a pod tries to attach such a volume, at some point an fsck process is spawned. And that fsck process seems to be killed because the gce-pd-driver container in the pdcsi-node pod has a memory limit of 50MB.

INFO 2024-07-10T14:11:01.219800489Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938 the /dev/* path is /dev/sdb for disk/by-id path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221497320Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938, device path /dev/sdb, found serial number restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221534929Z [resource.labels.containerName: gce-pd-driver] Successfully found attached GCE PD “restore-us-central1-d8ff-pg-data-pg-main-0-2938” at device path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938.
INFO 2024-07-10T14:11:01.221567009Z [resource.labels.containerName: gce-pd-driver] NodePublishVolume check volume path /var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/a001f3bad7990d3afb447355ea6314f7da31e3609310dfee75555b3d2e9f0687/globalmount is mounted false: error
INFO 2024-07-10T14:11:01.221573429Z [resource.labels.containerName: gce-pd-driver] Attempting to determine if disk “/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938” is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938])
INFO 2024-07-10T14:11:01.287565794Z [resource.labels.containerName: gce-pd-driver] Output: “DEVNAME=/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938\nTYPE=ext4\n”
INFO 2024-07-10T14:11:01.287604043Z [resource.labels.containerName: gce-pd-driver] Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:08.508376947Z [resource.labels.containerName: gce-pd-driver] fsck error fsck from util-linux 2.36.1
ERROR 2024-07-10T14:11:08.508442197Z [resource.labels.containerName: gce-pd-driver] /dev/sdb: recovering journal
ERROR 2024-07-10T14:11:08.508450037Z [resource.labels.containerName: gce-pd-driver] fsck: Warning… fsck.ext4 for device /dev/sdb exited with signal 9.

The VM kernel logs say

Sometimes, the OOM killer not only decide to kill the child fsck process but also the gce-pd-csi-driver process, which crashe the whole pdcsi-node pod:

Could we raise the 50MB memory limit for the gce-pd-driver container ? It looks like it’s really not enough to fsck a very large FS. What do you think ?

koenfaro · August 5, 2024, 8:32pm

Same issue here, really annoying. Any workarounds? I am half-way migrating a cluster, and the two biggest disks now just fail to be mounted at all, great. Guess I will try a fsck from a compute node.

ajutrowski · August 20, 2024, 10:17am

Same issue, I didn’t find any reliable solution yet

JordanP · August 20, 2024, 7:33pm

I opened a « bug report / help request » here

https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/1782 and found a workaround

ajutrowski · August 27, 2024, 2:37pm

@JordanP I found this thread as well, and it was somewhat helpful. In our case, we concluded that we could stop the pod from which we’re snapshotting the volume. So, our process is to stop the pod, take the snapshot, start a new pod from the snapshot, and then restart the original pod.

blaisem · February 1, 2025, 12:49pm

We ended up creating a modified pdcsi driver. You can copy the yaml of the pdcsi daemonset and modify the memory request there + change its name, e.g., pdcsi-driver-modified. Then you need to use kubectl patch on the existing pdcsi-driver ds to update the nodeselector to some other value, some prefix followed by dev-only. Don’t have it handy as I am on my phone. After that, you can kubectl apply the new ds et voila.

Topic		Replies	Views
PVCs not resizing Serverless Applications gke	9	20	May 29, 2025
Read/Write access to GCE Persistent Disk from multiple Pods in GKE Autopilot mode Serverless Applications gke	9	60	January 27, 2025
GKE using CSI Driver with standard-rwx Serverless Applications gke	6	60	August 29, 2024

pdcsi-node OOM because fsck on large volumes

AI Suggested topics