Hi, close to the GCP outage, at 2025-06-12 13:13:40 EDT I started having issues on multiple load balancer rules. When accessing the URL that resolves to a load balancer created with Gateway API resources that points to specific backends I only see a unconditional drop overload error as a response.
The backends are created by Gateway API HttpRoutes and point to ClusterIP Kubernetes services that point to a healthy Kubernetes deployment, I know the deployment is healthy since it had enough resources to operate and the deployment pod could be accessed and responds through in-cluster requests, I also know that the requests made to the load balancer IPs would end there since I got these messages from the load balancer logs when trying to query the urls that reached the faulty backends:
jsonPayload: {
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
backendTargetProjectNumber: "projects/xxxxxxxxxxxx"
cacheDecision: [2]
remoteIp: "x.x.x.x"
statusDetails: "failed_to_connect_to_backend"
}
After looking at the load balancer backend services availability, I noticed none of the backends were actually available (0 of 0 in all backends) but I’m confused since there’s no documentation on Google side for the unconditional drop overload, after further inspection I see it’s an error related to Envoy Proxy, probably on GCP side.
Backends have been regenerating slowly after a few hours of the issue, but I am still having the same problem even when associating new HttpRoutes and services to our Gateways to these kind of workloads.
I’d be really grateful to know if this is an issue Google Cloud acknowledges, if there’s a workaround or even if someone else is experiencing this issue.
Thanks in advance.
Hi, after a lot of hours of debugging I found a solution for the issue, not really sure what combination of actions was the final solution but I’ll tell you what I did:
I recreated all of the Gateway HttpRoutes associated to the every service that was having troubles to be connected to the load balancer, it did nothing, then I deleted the service and the pod and it didn’t work, but I noticed the ServiceNetworkEndpointGroup wouldn’t get deleted, even if the service was deleted.
I went ahead and inspected the resource, this is what I got:
Name: k8s1-xxxxxx-nginx-redirect-service-xxxx
Namespace: xxxx
Labels: networking.gke.io/managed-by=neg-controller
networking.gke.io/service-name=nginx-redirect-service-xxx
networking.gke.io/service-port=80
Annotations: <none>
API Version: networking.gke.io/v1beta1
Kind: ServiceNetworkEndpointGroup
Metadata:
Creation Timestamp: 2024-11-13T12:13:06Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2025-06-13T01:28:00Z
Finalizers:
networking.gke.io/neg-finalizer
Generation: 165
Owner References:
API Version: v1
Block Owner Deletion: false
Controller: true
Kind: Service
Name: nginx-redirect-service-xxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Resource Version: 9701xxxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Spec:
Status:
Conditions:
Last Transition Time: 2025-06-12T18:46:20Z
Message: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegInitializationFailed
Status: False
Type: Initialized
Last Transition Time: 2025-06-12T18:14:03Z
Message: failed to get NEG for service: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegSyncFailed
Status: False
Type: Synced
Last Sync Time: 2025-06-12T19:12:30Z
Network Endpoint Groups:
Id: xxxx
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-b/networkEndpointGroups/k8s1-xxxxxx-nginx-redirectwservice-xxx
Id: yyyy
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-c/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
Id: zzzz
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-f/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
The error "failed to get NEG for service: googleapi: Error 503: Policy checks " was weird since this resource is managed automatically by the service (NEG controller)
I proceeded to delete the ServiceNetworkEndpointGroup by myself:
kubectl delete servicenetworkendpointgroup k8s1-xxxxxx-nginx-redirect-service-xxx -n xxx
Then I noticed it had finalizers, it was deleting forever so I patched the finalizers:
kubectl patchk8s1-xxxxxx-nginx-redirect-service-xxx --type=json -p=‘[{“op”: “remove”, “path”: “/metadata/finalizers”}]’ -n xxxx
The issue wasn’t fixed, so I basically deleted and recreated the service and did the same for the deployment, after a few seconds-minutes, I could access the urls routing to those backends with no unconditional drop overload errors anymore.
I hope this comes helpful to someone who’s having the same issue.