Hi,
I have some large (>5TB) volumes formatted with an EXT4 filesystem. When a pod tries to attach such a volume, at some point an fsck process is spawned. And that fsck process seems to be killed because the gce-pd-driver container in the pdcsi-node pod has a memory limit of 50MB.
INFO 2024-07-10T14:11:01.219800489Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938 the /dev/* path is /dev/sdb for disk/by-id path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221497320Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938, device path /dev/sdb, found serial number restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221534929Z [resource.labels.containerName: gce-pd-driver] Successfully found attached GCE PD “restore-us-central1-d8ff-pg-data-pg-main-0-2938” at device path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938.
INFO 2024-07-10T14:11:01.221567009Z [resource.labels.containerName: gce-pd-driver] NodePublishVolume check volume path /var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/a001f3bad7990d3afb447355ea6314f7da31e3609310dfee75555b3d2e9f0687/globalmount is mounted false: error
INFO 2024-07-10T14:11:01.221573429Z [resource.labels.containerName: gce-pd-driver] Attempting to determine if disk “/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938” is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938])
INFO 2024-07-10T14:11:01.287565794Z [resource.labels.containerName: gce-pd-driver] Output: “DEVNAME=/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938\nTYPE=ext4\n”
INFO 2024-07-10T14:11:01.287604043Z [resource.labels.containerName: gce-pd-driver] Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:08.508376947Z [resource.labels.containerName: gce-pd-driver] fsck
error fsck from util-linux 2.36.1
ERROR 2024-07-10T14:11:08.508442197Z [resource.labels.containerName: gce-pd-driver] /dev/sdb: recovering journal
ERROR 2024-07-10T14:11:08.508450037Z [resource.labels.containerName: gce-pd-driver] fsck: Warning… fsck.ext4 for device /dev/sdb exited with signal 9.
The VM kernel logs say
Sometimes, the OOM killer not only decide to kill the child fsck process but also the gce-pd-csi-driver process, which crashe the whole pdcsi-node pod:
Could we raise the 50MB memory limit for the gce-pd-driver container ? It looks like it’s really not enough to fsck a very large FS. What do you think ?