Autopilot cluster keep restarting nodes

I’m running a GKE Autopilot cluster and recently noticed my pods restarting frequently without any apparent resource pressure (CPU and memory usage remain low and stable).

After investigation, it appears that nodes are being continuously created and then marked for deletion. The pattern looks like this: a new node is created, then after a few minutes the node is marked for deletion; shortly after, one or two new nodes are created, and later another node is marked for deletion. This cycle repeats throughout the day, resulting in most nodes being recreated multiple times per day.

This behavior is very similar to the issue described in the following thread, which was closed without an official resolution:
https://discuss.google.dev/t/gke-autopilot-scaling-loop-scale-up-followed-by-scale-down/274178

Unfortunately, I can’t apply the suggested workaround because I have critical workloads that must run with a single replica, and I cannot use a PDB with minAvailable: 1 as it would block GKE maintenance and upgrades in Autopilot.

I’d appreciate guidance on whether this is a known Autopilot behavior, how to diagnose the root cause, or what configuration changes are recommended to prevent this continuous node churn.

Is this problem still persisting on your cluster? If not, what steps did you take to resolve it? If yes, which events or logs did you check to identify the cause?

This issue is still occurring. From several days of observation, I’ve noticed that every time I trigger a deployment (a new pod version), it causes a large-scale node recreation. The cluster then takes at least 1–2 days to gradually stabilize. So far, I’ve only seen it fully settle once—during a rare two-day period without deployments. I have been doing frequent deployments recently, nodes keep getting recreated and tainted repeatedly without stabilizing.

Did you check the cluster activity such as events, kube-system pod logs, and kubelet logs? If so, did you notice any errors or warnings?

No useful info in the logs, and no error found. I do some weird things happening in the autoscaler logs. Within a very short time window (3 minutes), it repeatedly decides to scale down → up → down, without any reasons in the logs.

From the autoscaler logs:

  • 00:30:39
    autoscaledNodesCount: 10 → autoscaledNodesTarget: 9

  • 00:31:34
    autoscaledNodesCount: 9, autoscaledNodesTarget: 9

  • 00:31:55
    Scale-down blocked with reason:
    no.scale.down.node.pod.not.backed.by.controller
    Pod involved: lookahead-virtual-pod-default-0

  • 00:32:01
    Scale-down blocked again due to:
    no.scale.down.in.backoff

  • 00:32:01
    autoscaledNodesCount: 9 → autoscaledNodesTarget: 10

  • 00:32:50
    autoscaledNodesCount: 10, autoscaledNodesTarget: 10

  • 00:33:29
    autoscaledNodesCount: 10 → autoscaledNodesTarget: 9

And this has been keep going on for several hours. All the pods that we own are requesting far more resources they need

Interesting. Have you created any support ticket on GCP?

I’m not able to create support ticket because it keeps showing me this error “You don’t have permission to file tech-related support cases” even though i already have enough permissions, which is a different issue. That’s why i resorted to this forum