GKE autopilot - fixing misbehaving autoscaler

Is there a way to fix the misbehaving autoscaling issues?

I have several nodes running basically nothing, and the unrequested cores is unbelievably high: requesting less then 10, but more then 36 unrequested cores in the cluster. Yesterday I requested a high quota, and it could scale for a while, but I hit the limit again, it just keeps scheduling nodes which it isn’t using, and even has old nodes with a different version of GKE.

2 Likes

Does anyone else see this behavior on autopilot? Leaving unsuitable (old version) nodes running, spinning up super big nodes for tiny workflows, and the main issue: scaling so big that even with increase quota’s (several times), I still run into quota issues.

1 Like

Hi paul-aldora,

Welcome to Google Cloud Community!

Have you enabled the GCE Autoscaler? When both autoscalers are enabled simultaneously, it can lead to unintended behavior. Additionally, your nodes could be underutilized. This behavior of frequently adding and removing nodes can be caused by the repeated events of the GKE cluster autoscaler to scale the cluster down due to the node’s underutilization.

Here are some recommendations to help mitigate high resource utilization and quota issues:

  1. If GCE autoscaling is enabled, consider disabling it to prevent conflicts with the GKE Cluster Autoscaler.
  2. Optimize resource Requests and Limits for your workloads to ensure efficient usage of cluster resources
  3. Use node selectors or taints/tolerations to influence node type selection. For example, specify smaller machine types for lightweight workloads

If the issue still persists and needs further assistance, please feel free to reach out to the google support team.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Can you post the output of the following command? I’m pretty sure I know what is happening.
I assume you just recently created this cluster?

kubectl get nodes \
-o=custom-columns=NAME:.metadata.name,CPU:status.capacity.cpu,ARCH:".metadata.labels.beta\.kubernetes\.io/arch",MACHINE:".metadata.labels.cloud\.google\.com/machine-family",SPOT:".metadata.labels.cloud\.google\.com/gke-spot",ZONE:".metadata.labels.topology\.kubernetes\.io/zone"
1 Like
NAME CPU ARCH MACHINE SPOT ZONE VERSION
gk3-aldora-nap-1caxq9gj-dd6abc29-r4xn 32 amd64 ek <none> us-west2-c v1.32.4-gke.1106006
gk3-aldora-nap-1ln2dx42-076f2a3f-f7n9 2 amd64 e2 <none> us-west2-c v1.32.4-gke.1106006
gk3-aldora-nap-1s8ew6zd-d090e53f-rxzh 32 amd64 ek <none> us-west2-b v1.32.4-gke.1106006
gk3-aldora-pool-2-406222cb-ca60 4 amd64 e2 <none> us-west2-a v1.32.4-gke.1106006
gk3-aldora-pool-2-e7754495-y32i 4 amd64 e2 <none> us-west2-b v1.32.4-gke.1106006
1 Like

I only saw your response just now, see the output below (I also added the version). Finally after 3-4 weeks I don’t have any nodes running for an old version. But still far to many big nodes.

This cluster has been running for a longer time. I’m pretty sure this is running for a year now, maybe 2. But we started using it a lot more since January this year.

I have done nothing with the GCE autoscaler, it is off. This behaviour is new since a new version was deployed 15 May, it still behaves weirdly. I suspect the bug is related to the “safe-to-evict” annotation

So you are setting “safe-to-evict: false” ?

to be exact: cluster-autoscaler.kubernetes.io/safe-to-evict: “false”

But not on each workload, just on some, although I might remove it from more soon (we used to run a few workloads which latest days, which shouldn’t be interrupted, most of that has been resolved though)

What do you suspect is the issue? It seems to be happening since an auto-update which ran at 15 May (1.31.7-gke.1212000 to 1.32.3-gke.1785003)

More context:

  • the 2 big nodes (32 cpu): 1 running nothing but system resources (gke/kubernetes default pods), the other is running one pod with 600m cpu requested, 2GB mem requested (limits a little bit higher)

totals, with most of it being this balloon pod created by GKE itself::

NODE NAMESPACE POD CPU REQUESTS CPU LIMITS CPU UTIL MEMORY REQUESTS MEMORY LIMITS MEMORY UTIL
* * * 40572m (55%) 103418m (140%) 1160m (1%) 133243Mi (49%) 155942Mi (57%) 12691Mi (4%)

gk3-aldora-nap-1caxq9gj-dd6abc29-r4xn kube-system gke-system-balloon-pod-kgl72 29920m (93%) 29920m (93%) 0m (0%) 115077Mi (96%) 115077Mi (96%) 1Mi (0%)

With this much room on the cluster, it still spins up a new node almost every time a new pod is created.

Also good to see: unrequested cores went way up since the update. So much that I run into issues with scaling. No other changes, 22 may I requested a higher quota, which was approved. It did solve my issue temporarily, but the unrequested cores went way up again, possibly causing another blocking issue.

@garisingh Do you have any idea what is happening (posted details in my other messages)?

We introduced a new default computing model for Autopilot clusters recently. It allows for much faster scaling as we are able to “resize” nodes in place rather than provision new ones.

The one side effect of this is that the nodes which are provisioned consume quota for the max size of the node even though they are not technically being used. (in your case we were provisoning nodes which could scale to 32 cores). If you mix in using the “safe-to-evict:false” (which basically does a pod per node deployment model under the covers), you end up with multiple of these larger nodes.

We did recently change the default resizable nodes to 16 cores, but even if you only deploy a single core pod, quota will still use 16 cores.

1 Like

Thanks for this. Just a little feedback: safe-to-evict:false already had some issues with Autopilot, but with this change it just over provisions massively: if I start a pod with this, it will boot up that 32/16 core machine, but then if I add another one of these pod, it will boot up another 32/16 core machine, so it doesn’t actually work for faster scaling. I’m guessing this makes autopilot just unusable for these kind of workloads, and I should use standard instead?

With Autopilot, we allow up to 7 days for run-to-completion workloads, so that’s why “safe-to-evict:false” ends up being a pod-per-node model.

If you’d like to keep using Autopilot, one option would be to pick one of our other built-in compute classes instead of using the default one. For example

spec:
  nodeSelector:
    cloud.google.com/compute-class: "Balanced"

uses the Balanced compute class (which uses N2 or N2D machines). I think the default quota for N2 is 48 as well (at least thats the quota for my account which use my Gmail address as my Google work account has much higher quotas).

2 Likes