UPDATE: This appears to be an issue with GKE 1.32.x - 1.32.1729000 globally.
Hello! Since yesterday my PVCs no longer will resize, across clusters, different GKE versions, and with different versions of the pdcsi driver.
The normal behaviour is to change the PVC resource request for storage to a larger number, and then when the resize is pending a Pod start, kill the existing pod attached to the PVC.
Now, I see the following event, and nothing further:
From v1.32.1-gke.1489001 with pdcsi v1.15.4:
CSI migration enabled for kubernetes.io/gce-pd; waiting for external resizer to expand the pvc
From v1.32.1-gke.1729000 with pdcsi v1.16.1:
waiting for an external controller to expand this PVC
Check if the External Resizer is Running
Since the error states that it is âwaiting for an external controller to expand this PVC,â verify if the external resizer is running correctly in your cluster:
kubectl get pods -n kube-system | grep csi
Verify the Resizer Logs
Check the logs of the external resizer:
kubectl logs -n kube-system -c external-resizer
Check for Pending PVC Events
Describe the PVC to see more details on its status:
kubectl describe pvc
Ensure Proper CSI Migration Configuration
Since you see the message âCSI migration enabled for kubernetes.io/gce-pdâ, verify that migration is properly configured and there are no conflicts. Run:
kubectl get csidrivers
Manually Restart the CSI Controller
Try restarting the csi-provisioner and csi-resizer:
kubectl delete pod -n kube-system -l app=pd-csi-controller
I tried to reproduce your concern and Iâm getting the same output. Per GKE release Channel, as Rapid channel provides the newest GKE versions, these versions are excluded from the GKE SLA and may contain issues without known workarounds.
To ensure the features and APIs in your configuration work as expected, I suggest using the Regular channel instead of the Rapid channel. The Rapid channel is designed for early access and experimentation, which means some features can be unstable or even temporarily disabled. By switching to the Regular channel, youâll be using a more stable environment that supports the components in your configuration.
If you need further assistance, please donât hesitate to submit a ticket to our support team.
For further reference, please see below documentations:
Was this helpful? If so, please accept this answer as âSolutionâ. If you need additional assistance, reply here within 2 business days and Iâll be happy to help.
Youâve hit upon a known and frustrating issue with GKE 1.32.x, specifically related to PVC resizing. The symptoms youâre describing, where the PVC resize hangs with âwaiting for external resizerâ messages, are indicative of this problem.
Understanding the Issue
GKE 1.32.x Bug:
The root cause is a bug in the GKE 1.32.x series, particularly versions up to 1.32.1729000. This bug disrupts the communication between the Kubernetes control plane and the CSI (Container Storage Interface) driver responsible for resizing Persistent Volumes.
The issue stops the external resizer from properly recieving the resize request.
CSI Driver Interaction:
Kubernetes relies on CSI drivers to manage storage operations, including volume resizing. The pd.csi.storage.gke.io driver handles Persistent Disks (PD) in Google Cloud.
The bug in GKE 1.32.x interferes with the ability of the CSI driver to receive and process resize requests.
Impact:
This issue prevents you from dynamically resizing your Persistent Volumes, which can be critical for applications that require flexible storage capacity.
Troubleshooting and Workarounds
GKE Version Downgrade (If Possible):
If possible, the most reliable workaround is to downgrade your GKE clusters to a stable version before 1.32.x. For example, 1.31.x versions are generally considered stable.
This is not always possible, but is the most reliable fix.
Wait for GKE Patch:
Google Cloud is aware of this issue and is working on a patch. Keep an eye on the GKE release notes and the Google Cloud Status Dashboard for updates.
The fact that you have seen this issue globally, confirms that google is working on a fix.
Manual Volume Resizing (Complex):
As a temporary workaround, you might be able to manually resize the underlying Persistent Disk using the gcloud command-line tool or the Google Cloud Console.
However, this is a complex and risky process that requires careful coordination with your application and Kubernetes.
You would have to:
Detach the volume from the node.
Resize the PD.
Resize the filesystem on the volume.
Reattach the volume to the node.
Then, resize the PVC object in Kubernetes.
This is highly discouraged, unless you are very comfortable with storage management.
Create New PVCs and Migrate Data (Inconvenient):
Another workaround is to create new, larger PVCs and migrate your data to them.
This is inconvenient and can cause downtime, but it might be necessary if you urgently need to increase storage capacity.
Check for CSI Driver Issues (Less Likely):
Although you mentioned you have seen this across different pdcsi driver versions, it is still worth double checking for any reported issues with the pd.csi.storage.gke.io driver.
However, because of the global nature of this issue, and the GKE version correlation, the GKE version is the most likely culprit.
Recommendations
Monitor the GKE release notes for updates and patches.
If you need immediate PVC resizing, consider downgrading to a stable GKE version if possible.
Avoid manual volume resizing unless absolutely necessary and you have a strong understanding of storage management.
If possible, try to schedule non-critical workloads to run until a patch is released.