Calico-node cant start up due to not enough resources on node and blocks other services to terminate

anonymous · August 15, 2022, 9:16am

We have seen issues where calico cant startup due to not enough resources available. Kubernetes will try to remove some pods to give calico space but due to calico being down these nodes gets stuck in terminating and cant be killed until you manually force delete them and then calico will have enough available resources to startup.

  Warning  FailedKillPod  2m2s (x138 over 31m)  kubelet  error killing pod: failed to "KillPodSandbox" for "bc24ce80-ad39-440d-9261-3b19542ef29c" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"multiple-round-service-74ddb6d49d-lc7fj_default\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"

Its really hard to reproduce the problem and put calico in this state. Any ideas on how to fix this properly. Was thinking of writing a script cleaning this up but this is happening more often now and it doesnt feel right.

It looks like all calico-node are restarted/re-deployed at the same time when the issue is occured, but we dont get the issue everytime or on every node.

edchai · August 18, 2022, 1:56am

We started encountering the issue after we migrated from v1.21 to v.22. This issue tracker seems to be related https://issuetracker.google.com/issues/239154504.

anonymous · August 18, 2022, 6:55am

Thanks did not know about the issuetracker site. I also found this which describe the same problem https://issuetracker.google.com/issues/237566158

anonymous · September 2, 2022, 8:12am

We have the same issue in production after we migrated from V1.21 to V1.22. Restarting a calico-node or kubelet doesn’t help. The only way to fix it is a decrease in the number of cpu.requests for calico-node daemonset.

cdsl · September 16, 2022, 8:11am

Same issue in same condition 1.21-> 1.22. We are testing this workaround:

change the CM calico-node-vertical-autoscaler with small values for calico-node.requests.cpu.base & calico-node.requests.cpu.max to avoid other rolling update and to be sure that the pod start correctly on each nodes.

calico-node-vertical-autoscaler must be restarted after the CM edition.

Topic		Replies	Views
GKE Autopilot Pod Stuck Terminating Serverless Applications	4	34	July 15, 2024
Pod is blocking scale down because it has local storage Serverless Applications gke	2	6	July 27, 2022
Unexpected behavior with google Kubernetes engine cluster (1.22.17- gke. 4000) Serverless Applications gke	1	5	April 13, 2023

Calico-node cant start up due to not enough resources on node and blocks other services to terminate

AI Suggested topics