GKE Autopilot and preemted pods

I have an GKE autopilot cluster running, and every now and then it randomly starts preempting pods, which doesn’t make sense in these cases.

The last few days I have issues with pods being terminated early by GKE (Autopilot)

  • the pods have the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: “false”
  • Checked resource requests and usage: not even close to getting to the limits or requests (cpu, memory, ephemeral storage)
  • Pod sometimes get killed within a few seconds
  • Checked logs, not much found on why
  • only real log info I see is a message like this:
    • Preempted by pod 4252ce0e-970e-4f88-a824-660256e76221 on node
    • efficiency-daemon Adding pod (uid=4252ce0e-970e-4f88-a824-660256e76221 namespace=kube-system name=konnectivity-agent-67d96cfc85-np92w)
    • which sounds like the pod was killed because a system pod couldn’t fit (which means autopilot schedules nodes which are too small?
3 Likes

Hi paul-aldora,

Welcome to Google Cloud Community!

There might be a possibility that you’re running low on resources. When using the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false", it can extend the runtime of your pods. However, there are some considerations when using this functionality. One consideration is that a pod can still be evicted to make space for Kubernetes system components. Here are some recommendations that might help to mitigate the issue:

For further reference,please refer to these documentations:

If the issue still persists, please feel free to reach out to our Google Cloud Support.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

1 Like

Hi,

The main issue is that Autopilot spins up a machine especially for that specific pod, then kills the pod pretty soon after it started, just because it didn’t spun up the correct machine, or something else is going wrong. Request and Resource Requests and Limits are both fine for these pods. This seems to be a big issue with GKE Autopilot what every now and then pops up. I will review the eviction guide, but I already reviewed most or all possible causes. If I follow the log it’s literately like this:

  • Pod is submitted to kubernetes
  • No available Nodes
  • Autopilot starts node specifically for that pod
  • Starts pod on the machine
  • Then kills it because it needs to schedule a system pod: Which means the node was not big enough for the pod or, something else is going wrong. The pod stays within the resource requests.
1 Like

Hi @paul-aldora,

we are experiencing pretty much the same problem:

  • We have a GKE Autopilot cluster that starts our problem scenario with a few nodes (for example 3)
  • We schedule a few GPU workload pods (let’s say 2) which causes 2 new GPU nodes to be provisioned.
  • The konnectivity-agent-autoscaler scales up the konnectivity-agent deployment from 3 to 5 according to the konnectivity-agent-autoscaler-config ConfigMap
 Data                                                                                                                                                                                                                                                                                                                                                                  │
│ ====                                                                                                                                                                                                                                                                                                                                                                  │
│ ladder:                                                                                                                                                                                                                                                                                                                                                               │
│ ----                                                                                                                                                                                                                                                                                                                                                                  │
│ {                                                                                                                                                                                                                                                                                                                                                                     │
│   "coresToReplicas": [],                                                                                                                                                                                                                                                                                                                                              │
│   "nodesToReplicas":                                                                                                                                                                                                                                                                                                                                                  │
│   [                                                                                                                                                                                                                                                                                                                                                                   │
│     [1, 1],                                                                                                                                                                                                                                                                                                                                                           │
│     [2, 2],                                                                                                                                                                                                                                                                                                                                                           │
│     [3, 3],                                                                                                                                                                                                                                                                                                                                                           │
│     [4, 4],                                                                                                                                                                                                                                                                                                                                                           │
│     [5, 5],                                                                                                                                                                                                                                                                                                                                                           │
│     [6, 6],                                                                                                                                                                                                                                                                                                                                                           │
│     [10, 8],                                                                                                                                                                                                                                                                                                                                                          │
│     [100, 12],                                                                                                                                                                                                                                                                                                                                                        │
│     [250, 18],                                                                                                                                                                                                                                                                                                                                                        │
│     [500, 25],                                                                                                                                                                                                                                                                                                                                                        │
│     [2000, 50],                                                                                                                                                                                                                                                                                                                                                       │
│     [5000, 100]                                                                                                                                                                                                                                                                                                                                                       │
│   ]                                                                                                                                                                                                                                                                                                                                                                   │
│ }      

Those new konnectivity pods cannot be scheduled on the new GPU nodes because “2 node(s) had untolerated taint {nvidia.com/gpu: present}”. Our preexisting nodes cannot accommodate the konnectivity pods either: “3 Insufficient memory”.
Now the nodes would have enough memory if it weren’t for the gke-system-balloon-pod daemonset pods which hog the otherwise available memory…

Does somebody know more about those gke-system-balloon-pods? I’m guessing they are related to how google manages non-exclusive nodes for GKE Autopilot where we are after all only billed per requested CPU/memory. This may represent unavailable resources consumed by other customers?

Anyway sometimes the balloon pods are resized but maybe not in time. Other times they may not be resizable?

  • The consequence: We have system-cluster-critical konnectivity-agent pods waiting to be scheduled, so something ends up being evicted. This can also hit our cluster-autoscaler.kubernetes.io/safe-to-evict: "false" pods - just as @francislouie pointed out.

So for me it looks like a clean solution would be up to google.
Since we unfortunately have a base load of pods that can cause problems when they get preempted, the next thing I’ll try as a workaround is scheduling a small balloon/sacrifice pod myself with lower priority so that pod hopefully gets evicted instead of our critical workloads. The konnectivity-agent requirements are a miniscule sum of

cpu: 35m
memory: 60Mi

Please let me know if you have a better workaround or know more about what’s happening!

@paul-aldora: Are there also GPU nodes involved in your scenario? In your last post, it sounded like the exact pod you had just scheduled was preempted? Or was it also a pod on a different node?

Hi @paul-aldora. Im looking into this

Quick question is the pod deployed as a standalone pod? Or is it managed by a controller like Deployment, Statefulset, Daemonset?

Hi,

We’re experiencing the same issue @paul-aldora is describing.

We’re running a CronJob in a GKE Autopilot cluster and (almost) every week pods randomly get preempted, which is unexpected given that these pods (part of a cronjob) constitute the only workload in the cluster.

Context: The cronjob runs every 5 minutes and initiates a job that can vary significantly in duration, from a couple of minutes to several hours. As this workload is not fault-tolerant, these random shutdowns are highly disruptive and must be avoided.

Troubleshooting Steps Taken (without success):

Observed Preemption Process:

The sequence of events when preemption occurs is consistently as follows:

  • A new job is submitted to Kubernetes.
  • Initially, no available nodes are found for scheduling.
  • GKE Autopilot provisions a new node specifically for that pod.
  • The pod successfully starts on the newly provisioned machine.
  • Shortly after, the pod is unexpectedly terminated/killed because Autopilot needs to schedule a system pod on the same node.

This behavior suggests that either the newly provisioned node is not large enough to accommodate both our application pod (which consistently stays within its resource requests, e.g., X CPU, Y GB RAM) and the required system pods, or there is another underlying scheduling conflict.

We are currently running GKE Autopilot version 1.32.6-gke.1013000. The resource requests for our cronjob pods are 1500m CPU & 5Gi RAM.

@paul-aldora can you please upgrade to a newer version of GKE ? I think this is a known issue

https://cloud.google.com/kubernetes-engine/docs/release-notes#June_16_2025 

To which version do you recommend upgrading? I don’t see how this feature is related to the issues.

Feature: For clusters running GKE version 1.32.4-gke.1236000 or later, the cluster autoscaler can scale down nodes by evicting Pods in the kube-system namespace that have no Pod Disruption Budget (PDB) set and have been running for at least one hour.

1 Like

Upgrade to the latest version!

I’ve tried scheduling a sacrifice pod with lower priority. That saves our critical pods most of the time but not always.

I’ve first updated our Autopilot cluster from 1.32.6-gke.1096000 to 1.33.2-gke.1240000 which is the newest version in the Regular release channel. No help. Then I tried switching to the Rapid release channel and updated to the newest version currently available there: 1.33.3-gke.1266000 again no help.

This is not fixed. At least our scenario isn’t. I still don’t know if we have the same root cause as @paul-aldora did. We currently have several daily preemptions of our pods in favor of konnectivity-agent some of them still hitting our critical pods :grimacing:

Any other ideas @abdelfettah? Do we need to create a support issue or something?

1 Like

@Stefan_van_Kessel sounds like a support question. Go ahead and ask support. i cannot help you debug this way :slight_smile:

We have the same problem. Our pods are rather standard, using the default general-purpose compute class, requesting 0.5-1 vCPUs and 1-2 GB of RAM per pod. Until July, everything was fine, pod eviction was very rare - happening 1-2 times a month, and only with a few services, which is fine.

However, in July, evictions started happening more or less daily with almost all pods - and some days it’s much worse than others.
Just in the past hour, Keycloak, our single sign-on solution for our services, got evicted 5 times - it reached the point now where it’s degrading the end-user experience.

Based on the logs, the cluster autoscaler became very aggressive: it’s allocating pods to nodes that can just barely fit them - then system pods request more resources, causing a reallocation to a bigger node - afterwards, when system pods again request less resources, we go back to a smaller node, causing a vicious cycle. The triggering culprit seems to be konnectivity-agent most of the time, as with the others in the thread.

We don’t set safe-to-evict to false (or true for that matter - we use the defaults): we never needed it before, and from the documentation, it sounds like it’s more for special cases, e.g. game servers. We don’t need our pods to be online 100% of the time - we just want them to function normally, it’s fine if they get evicted every now and then, but not 30-40 times on a “bad day” - which can easily mean 1-2 hours daily downtime, as typically these evictions take 2-3 minutes to recover from, especially with services that need some time to start up.

We regularly update our cluster to the latest stable version, currently 1.33.3-gke.1136000.

1 Like

I didn’t see I had responses here. An update on my side:

  • we had this issue very often until 22 August, after that I only could find it once
  • As of yesterday we switched to a GKE standard, also because other issues with GKE autopilot and our workloads. Haven’t seen the issue there (but just 1 day), except for balloon pods created by ourself (which are meant to be preempted for better scaling).

I also created a bug report: Google Issue Tracker

But it looks like Google cloud only takes your bug reports seriously if you pay them for that privilege (payed “support”, unfortunately this is common with all big cloud providers).

Hi @abdelfettah, sorry I missed your responses in July). But as explained, this is a bug caused by the Autopilot Autoscaler. It keeps putting pods on nodes barely big enough to fit them (according to the resource requests), but then kills them a little bit later when some system pod is scheduled. On top of that the balloon pods scheduled by autopilot since a few months are very unhelpful if you have some workloads with safe-to-evict=false (it basically spins up a node specifically for that POD, but then also makes it large with balloon pods which will never get evicted).

The evict issue we are talking about here exists since an update in May/June, and causes issues by sometimes evicting the same pod several times, if it’s managed by a Deployment, or unmanaged (managed by Argo Workflow in some of my cases), doesn’t matter. It Just schedules pods on small nodes which doesn’t fit the required system pods, then evict the pod it was supposed to run.

We switched away from GKE Autopilot to GKE standard because of this issue, and the issue that Autopilot works especially bad with safe-to-evict=false pods as of May/June this year. Before that it was working a lot better. I also don’t like that I and others keep getting asked to pay for support, for a bug existing in a Google Managed black box.

2 Likes

Hi.
I’m experiencing this issue as well.

We schedule a cache in our gke autopilot cluster that takes quite a bit of time to start up due to loading data from disk and only run a single pod for it in our staging environment. Our pod getting preempted results in multiple minutes of downtime.

Is the autopilot preemption behavior really normal/expected?

Shouldn’t the gke-system-balloon-pod account for scheduling kube-system workloads?

Looks like there’s a known issue the GKE product team is working on.
Changing my compute class to Balanced “fixed” the issue

1 Like

We’ve been facing the same problem since our GKE Autopilot cluster was upgraded from 1.32 to 1.33.

Are you in contact with Google support? Have they acknowledged the issue?

1 Like

We’ve been facing the same issue. We run Dagster on GKE autpilot with cluster-autoscaler.kubernetes.io/safe-to-evict: “false” on our job runners. We tried using capacity placeholders to reduce the number of job evictions due to GKE system pods, but that didn’t help. I would like to see this issue resolved. We’re on v1.33.5

We’re seeing a the same issue. safe-to-evict: false workloads get evicted out of nothing. That’s especially bad when you got an sts without failover. What’s the recommended way of handling this?

In the end I moved to GKE standard. There is autoscaling there as well, it’s just a little bit more work setting up, but it seems to resolve this issue (and Google itself doesn’t seem to see this as a big issue as it’s still not resolved a few months later).