Hi, I have a kubernetes cluster on GCP that is running on Autopilot mode. Now this already had multiple services running for some time. Now I am trying to add 3 new services, but I am unable to deploy. I always get this -
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
The thing is this is random, say I delete 1 service then rescale other service, this error might go, but it will transfer to another container that is trying to come up. Can someone please help me regarding this? What can I do here?
Regarding your concern, I found this article by PjoterS, which they pointed out that it might be the Flannel that is causing the issue.
Flannel runs a small, single binary agent called flanneld **on each host, and is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space.**
As a proposed solution, they provided “You have to make flannel pods works correctly on each node.”
Hi
I am experiencing the exact same thing. I’m not using autopilot and I am using Istio with the CNI plugin.
Workloads run fine and when nodes are restarted and then after a couple of days this starts happening.
I’m also experiencing the same thing. When I try to create new Deployments, the pods get stuck in the Init:0/1 status. It happens intermittently with no obvious root cause. The pod events show this error:
FailedCreatePodSandBox pod/xrdm-portal-6bfb74964d-kpqpc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
I’m running on GKE version 1.27.8-gke.1067004 on the regular release channel, with autoscaler on.
It’s as if the auto scaler doesn’t scale up as it should. I’ve noticed that I tend to receive this error more often when the cluster has only 3 nodes. However, when it decides to scale to 4, I don’t see this issue as much. I’ve also noticed that he pods resume normal operation and get “unstuck” when I delete some other Deployments.
I rebuilt my node pool, and the problem went away. I noticed that the pods that were “stuck” and gave this error were landing on the same node. Once I rebuilt the node pool, I no longer had this issue.
I’ll keep an eye on my node pool to see if it happens again. However, the problem for me seemed to be specific to one node. Is there a difference between restarting and rebuilding? I rebuilt my node pools, meaning that I destroyed them and started a new one.
I am having the same problem in GKE Autopilot Cluster version 1.31.1-gke.2105000. I noticed that all pods having this issue were allocated to the same node and draining it allowed the pods to start with no problem. I can’t understand what caused this and then I had some pods allocated to this “faulty” node start smoothly.
I’m managing it by making my nodes not too big and limiting maximum allowed pods in one node (most of time error happens on reaching 60-100 pods per node).
Also I increased bpf_jit_limit as a temporary fix, but memory leak will reach this limit later anyway.