GKE Pod Creation failing

:waving_hand: We are having issues with an internal GKE admission webhook (warden-mutating.common-webhooks.networking.gke.io) which appeared when we upgraded GKE to v1.26.3-gke.1000.

The backend service specified in the configuration seems to go down every now and then causing pod creation to fail. Since we have no visibility/control over GKE internal workloads we are forced to downgrade/upgrade the control plane to force the hidden workloads to “restart”.

Example Error:

Message: Internal error occurred: failed calling webhook "warden-mutating.common-webhooks.networking.gke.io": failed to call webhook:
Post "https://localhost:5443/webhook/warden-mutating?timeout=10s": dial tcp [::1]:5443: connect: connection refused.

The Mutating Webhook Configuration:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  creationTimestamp: "2023-05-04T12:08:27Z"
  generation: 2
  labels:
    networking.gke.io/common-webhooks: "true"
  name: warden-mutating.config.common-webhooks.networking.gke.io
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: <TheCABundle>
    url: https://localhost:5443/webhook/warden-mutating
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: warden-mutating.common-webhooks.networking.gke.io
  namespaceSelector: {}
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 25

Has anyone seen this before? Any ideas on how to deal with it?

Thanks!

2 Likes

Just seeing this exact error, after recently upgrading our cluster to 1.26.4-gke.500

2 Likes

I am having this issue on some of my cluster with POD creation failing, version 1.24, v1.25 private clusters

2 Likes

Will look into this and get back to you. Have you filed a bug on this?

3 Likes

I was just on a call with GCP Support about this. My understanding is that they opened Google Cloud Support 45084323 to track this.
We are seeing this sporadically and repeatedly in multiple workloads in our cluster.

4 Likes

Haven’t reported the bug yet, do you maybe know the best place to do this?

Also should I still open one given the information Yoni shared.

Thanks!

1 Like

had a case created - 43454827 , never got to bottom of it. Google didnt say its an issue at their side or didnt spend enough time to conclude where the issue exist.

I found a workaround to this case, hence I closed this ticket. I want to know the root cause why only some POD creation fail with this WEBHOOK call. in our case only when STRIMZI KAFKA is installed we get this error. and 2 out of 5 mutating webhook connection timepout

1 Like

Great. Engineering is looking at it. I believe we found the issue. Will link back to the case you posted above.

1 Like

Hi Yoni,

Did you recieve any feedback on this from Google?

1 Like

Hi @trentza , yes I did.
The latest information I have from GCP Support is that a fix for this is supposed to be released to GKE Production today (June 13th), at least in the region where my cluster is running (europe-west1).
I am waiting to see if they will make good on that, after already missing 2 previous promised deadlines for fixing this…

The root cause of the problem is apparently a race condition in a GKE component that monitors whether a standard cluster might be suitable for upgrade to auto-pilot. Apparently this race condition manifests itself when there is a high rate of workload creation happening (which corroborates with our use case).

4 Likes

Thanks for the feedback, really appreciated :raising_hands:

1 Like

Simple, Just Scale down your Pod count to “Zero” and Redeploy the same service and then scale up the Pods. it Worked for me…

1 Like

Does their have any status update for this issue?

4 Likes

Having similar issue here:

Internal error occurred: failed calling webhook “warden-mutating.common-webhooks.networking.gke.io”: failed to call webhook: Post “https://localhost:5443/webhook/warden-mutating?timeout=10s”: stream error: stream ID 7214191; INTERNAL_ERROR; received from peer

Did you guys receive any updates?

1 Like