Need help on GKE Standard Autoscaler events

dheerajpanyam · October 18, 2023, 3:28pm

Hello,

We have a single nodepool GKE Standard (7 node) multi-tenant cluster (per namespace) which deploys microservices for each customer. Autoscaling just doesn’t work. Autoscaler reports scale up and scale down errors (more details in the screenshots below) . I went through this link and I don’t see any issue with PV or PVC. We are not doing anything fancy just using the standard GCP storage class (so dynamic provisioning) and I don’t see quota exceeded errors for the Storage.

lawrencenelson · October 23, 2023, 9:18pm

Hi @dheerajpanyam ,

Basically you’re seeing two errors. The cluster can’t scale up because of no.scale.up.mig.failing.predicate - “pod has unbound immediate PersistentVolumeClaims”, and the cluster can’t scale down with an error “Pod is blocking scale down because its controller can’t be found.”

Troubleshooting the Scaling Up Issue

The error message “pod has unbound immediate PersistentVolumeClaims” can still appear even if the pods have bound PVCs due to some other issue. You can try the following resolutions:

Enable Compute Engine persistent disk CSI Driver [1]

As stated in this Stack Overflow thread [1], you may check if the driver is installed by running the command kubectl get csidriver. You may follow the steps in this document to enable the disk CSI Driver.

Check PV objects in Released State and clean them up

The kube-scheduler log will report the error pod has unbound immediate PersistentVolumeClaims when pod is unschedulable. The reason for this is that there is a limitation of current PV controller to process large number of PV objects. You may use a cron job to clean up the PVs.

Troubleshooting the Scaling Down Issue

To troubleshoot the scaling down issue, please refer to this document and review your YAML file. Make sure that the pods are created by a controller object (such as a deployment, replica set, job, or stateful set).

If the above options don’t work, you can contact Google Cloud Support to further look into your case as small details can cause errors like these. Thank you.

dheerajpanyam · October 23, 2023, 10:15pm

Thanks @lawrencenelson I will troubleshoot as per your suggestions.

dheerajpanyam · October 23, 2023, 10:24pm

I remember going through the SO link you mentioned. I am using dynamic provisioning with the default OOB GKE storage class which provisions standard GCE disk i do not have custom Storage Classes so nothing fancy or I am not using any external storage. Also there are about 10-15 PVCs . I did not understand about the limitation of the current PV controller to process large number of PV objects can you please elaborate?

garisingh · October 24, 2023, 9:15am

How re you deploying your workload? For example, are you using a StatefulSet for the Postgres deployment?

**no.scale.down.node.pod.controller.not.**found means that somehow whatever controller you were using to deploy the workload is no longer available. Perhaps you were using an operator?

dheerajpanyam · October 24, 2023, 1:40pm

Hey @garisingh as always thank you so much for your reply. Just to give you some context the GKE Standard cluster is designed to be a multi tenant cluster (namespace scoped) with a Dependency tracker API server (3rd party solution ) that is used for component analysis. It needs a Postgres backend so each namespace has a DT API Server and an associated Postgres backend. I can see Postgres was deployed as a Statefulset but I see this error happening in a specific namespace only considering that we used statefulset to keep it uniform so I would have expected this error to be seen for all Postgres deployments. . The more important question though is does the scaling issue (which is intermittent) affect cluster autoscaling? Is this a critical error that needs to be resolved or can it be ignored. I also see this as an intermittent error. The goal is to make scale down work so that we can save costs. It is a decent size 7 node cluster.

lawrencenelson · October 24, 2023, 10:51pm

Hi @dheerajpanyam ,

Regarding the current limitation of the PV controller, it is a bug when creating new pods with PVC resources, new pods may be stuck in unschedulable status for over 10 mins until being in running status. In some instances, this causes the “Pod has unbound immediate PersistentVolumeClaims” error.

The affected user had lots of stale local PV objects in their cluster, which is causing PV controller to hit scaling bottlenecks.

But since you’re only working with 10-15 PVCs, I think this certain issue doesn’t fit with yours.

Regarding the persistent disk CSI Driver, were you able to enable it on your cluster?

Image from Stack Overflow user Jonas_Hess [1].

Another alternative solution, as described in this Stack Overflow thread, is to change the storageClassName definition from standard to standard-rwo and then redeploy your workload.

Thank you.

[1]. https://stackoverflow.com/questions/52668938/pod-has-unbound-persistentvolumeclaims

dheerajpanyam · October 25, 2023, 2:42pm

Hello @lawrencenelson The CSI Driver setting is Enabled and probably the default setting. Let me deploy the alternate solution. The bigger concern I have though is the impact on cost due to scale down issue. Would this intermittent error (occuring in a specific namespace) prevent scaling down the cluster?

Topic		Replies	Views
What is the secret sauce to GKE Cost Optimization and GKE Autoscaling (Standard GKE)? Serverless Applications gke	3	11	February 10, 2024
Pod is blocking scale down because it has local storage Serverless Applications gke	2	11	July 27, 2022
gke cluster is not scaling down even after applying optimize-utilization option Serverless Applications gke	1	13	June 15, 2023

Need help on GKE Standard Autoscaler events

AI Suggested topics