The backend service specified in the configuration seems to go down every now and then causing pod creation to fail. Since we have no visibility/control over GKE internal workloads we are forced to downgrade/upgrade the control plane to force the hidden workloads to “restart”.
Example Error:
Message: Internal error occurred: failed calling webhook "warden-mutating.common-webhooks.networking.gke.io": failed to call webhook:
Post "https://localhost:5443/webhook/warden-mutating?timeout=10s": dial tcp [::1]:5443: connect: connection refused.
I was just on a call with GCP Support about this. My understanding is that they opened Google Cloud Support 45084323 to track this.
We are seeing this sporadically and repeatedly in multiple workloads in our cluster.
had a case created - 43454827 , never got to bottom of it. Google didnt say its an issue at their side or didnt spend enough time to conclude where the issue exist.
I found a workaround to this case, hence I closed this ticket. I want to know the root cause why only some POD creation fail with this WEBHOOK call. in our case only when STRIMZI KAFKA is installed we get this error. and 2 out of 5 mutating webhook connection timepout
Hi @trentza , yes I did.
The latest information I have from GCP Support is that a fix for this is supposed to be released to GKE Production today (June 13th), at least in the region where my cluster is running (europe-west1).
I am waiting to see if they will make good on that, after already missing 2 previous promised deadlines for fixing this…
The root cause of the problem is apparently a race condition in a GKE component that monitors whether a standard cluster might be suitable for upgrade to auto-pilot. Apparently this race condition manifests itself when there is a high rate of workload creation happening (which corroborates with our use case).