Hi @paul-aldora,
we are experiencing pretty much the same problem:
- We have a GKE Autopilot cluster that starts our problem scenario with a few nodes (for example 3)
- We schedule a few GPU workload pods (let’s say 2) which causes 2 new GPU nodes to be provisioned.
- The
konnectivity-agent-autoscaler scales up the konnectivity-agent deployment from 3 to 5 according to the konnectivity-agent-autoscaler-config ConfigMap
Data │
│ ==== │
│ ladder: │
│ ---- │
│ { │
│ "coresToReplicas": [], │
│ "nodesToReplicas": │
│ [ │
│ [1, 1], │
│ [2, 2], │
│ [3, 3], │
│ [4, 4], │
│ [5, 5], │
│ [6, 6], │
│ [10, 8], │
│ [100, 12], │
│ [250, 18], │
│ [500, 25], │
│ [2000, 50], │
│ [5000, 100] │
│ ] │
│ }
Those new konnectivity pods cannot be scheduled on the new GPU nodes because “2 node(s) had untolerated taint {nvidia.com/gpu: present}”. Our preexisting nodes cannot accommodate the konnectivity pods either: “3 Insufficient memory”.
Now the nodes would have enough memory if it weren’t for the gke-system-balloon-pod daemonset pods which hog the otherwise available memory…
Does somebody know more about those gke-system-balloon-pods? I’m guessing they are related to how google manages non-exclusive nodes for GKE Autopilot where we are after all only billed per requested CPU/memory. This may represent unavailable resources consumed by other customers?
Anyway sometimes the balloon pods are resized but maybe not in time. Other times they may not be resizable?
- The consequence: We have system-cluster-critical konnectivity-agent pods waiting to be scheduled, so something ends up being evicted. This can also hit our
cluster-autoscaler.kubernetes.io/safe-to-evict: "false" pods - just as @francislouie pointed out.
So for me it looks like a clean solution would be up to google.
Since we unfortunately have a base load of pods that can cause problems when they get preempted, the next thing I’ll try as a workaround is scheduling a small balloon/sacrifice pod myself with lower priority so that pod hopefully gets evicted instead of our critical workloads. The konnectivity-agent requirements are a miniscule sum of
cpu: 35m
memory: 60Mi
Please let me know if you have a better workaround or know more about what’s happening!
@paul-aldora: Are there also GPU nodes involved in your scenario? In your last post, it sounded like the exact pod you had just scheduled was preempted? Or was it also a pod on a different node?