We have seen issues where calico cant startup due to not enough resources available. Kubernetes will try to remove some pods to give calico space but due to calico being down these nodes gets stuck in terminating and cant be killed until you manually force delete them and then calico will have enough available resources to startup.
Warning FailedKillPod 2m2s (x138 over 31m) kubelet error killing pod: failed to "KillPodSandbox" for "bc24ce80-ad39-440d-9261-3b19542ef29c" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"multiple-round-service-74ddb6d49d-lc7fj_default\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
Its really hard to reproduce the problem and put calico in this state. Any ideas on how to fix this properly. Was thinking of writing a script cleaning this up but this is happening more often now and it doesnt feel right.
It looks like all calico-node are restarted/re-deployed at the same time when the issue is occured, but we dont get the issue everytime or on every node.