Hi, close to the GCP outage, at 2025-06-12 13:13:40 EDT I started having issues on multiple load balancer rules. When accessing the URL that resolves to a load balancer created with Gateway API resources that points to specific backends I only see a unconditional drop overload error as a response.
The backends are created by Gateway API HttpRoutes and point to ClusterIP Kubernetes services that point to a healthy Kubernetes deployment, I know the deployment is healthy since it had enough resources to operate and the deployment pod could be accessed and responds through in-cluster requests, I also know that the requests made to the load balancer IPs would end there since I got these messages from the load balancer logs when trying to query the urls that reached the faulty backends:
jsonPayload: {
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
backendTargetProjectNumber: "projects/xxxxxxxxxxxx"
cacheDecision: [2]
remoteIp: "x.x.x.x"
statusDetails: "failed_to_connect_to_backend"
}
After looking at the load balancer backend services availability, I noticed none of the backends were actually available (0 of 0 in all backends) but I’m confused since there’s no documentation on Google side for the unconditional drop overload, after further inspection I see it’s an error related to Envoy Proxy, probably on GCP side.
Backends have been regenerating slowly after a few hours of the issue, but I am still having the same problem even when associating new HttpRoutes and services to our Gateways to these kind of workloads.
I’d be really grateful to know if this is an issue Google Cloud acknowledges, if there’s a workaround or even if someone else is experiencing this issue.
Thanks in advance.
Hi, after a lot of hours of debugging I found a solution for the issue, not really sure what combination of actions was the final solution but I’ll tell you what I did:
I recreated all of the Gateway HttpRoutes associated to the every service that was having troubles to be connected to the load balancer, it did nothing, then I deleted the service and the pod and it didn’t work, but I noticed the ServiceNetworkEndpointGroup wouldn’t get deleted, even if the service was deleted.
I went ahead and inspected the resource, this is what I got:
Name: k8s1-xxxxxx-nginx-redirect-service-xxxx
Namespace: xxxx
Labels: networking.gke.io/managed-by=neg-controller
networking.gke.io/service-name=nginx-redirect-service-xxx
networking.gke.io/service-port=80
Annotations: <none>
API Version: networking.gke.io/v1beta1
Kind: ServiceNetworkEndpointGroup
Metadata:
Creation Timestamp: 2024-11-13T12:13:06Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2025-06-13T01:28:00Z
Finalizers:
networking.gke.io/neg-finalizer
Generation: 165
Owner References:
API Version: v1
Block Owner Deletion: false
Controller: true
Kind: Service
Name: nginx-redirect-service-xxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Resource Version: 9701xxxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Spec:
Status:
Conditions:
Last Transition Time: 2025-06-12T18:46:20Z
Message: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegInitializationFailed
Status: False
Type: Initialized
Last Transition Time: 2025-06-12T18:14:03Z
Message: failed to get NEG for service: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegSyncFailed
Status: False
Type: Synced
Last Sync Time: 2025-06-12T19:12:30Z
Network Endpoint Groups:
Id: xxxx
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-b/networkEndpointGroups/k8s1-xxxxxx-nginx-redirectwservice-xxx
Id: yyyy
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-c/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
Id: zzzz
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-f/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
The error "failed to get NEG for service: googleapi: Error 503: Policy checks " was weird since this resource is managed automatically by the service (NEG controller)
I proceeded to delete the ServiceNetworkEndpointGroup by myself:
kubectl delete servicenetworkendpointgroup k8s1-xxxxxx-nginx-redirect-service-xxx -n xxx
Then I noticed it had finalizers, it was deleting forever so I patched the finalizers:
kubectl patchk8s1-xxxxxx-nginx-redirect-service-xxx --type=json -p=‘[{“op”: “remove”, “path”: “/metadata/finalizers”}]’ -n xxxx
The issue wasn’t fixed, so I basically deleted and recreated the service and did the same for the deployment, after a few seconds-minutes, I could access the urls routing to those backends with no unconditional drop overload errors anymore.
I hope this comes helpful to someone who’s having the same issue.
This was a veeery useful, saved lots of time
I ran into this same unconditional drop overload error and the steps of manually destroying/recreating the resources as you describe did solve it for a while. When it came back I dug in further and noticed this error in the service:
Warning ProcessServiceFailed 3m42s (x132 over 12h) neg-controller error processing service "default/<service-name>": port 443 specified in "cloud.google.com/neg" doesn't exist in the service
That would be related to this service annotation:
cloud.google.com/neg:`` '{"exposed_ports":{"443": {},"50051":{},"80":{}}}'
While developing at some point I had added and removed port 443 from the service, but hadn’t removed it from the annotation. Oddly enough this configuration worked for quite literally months before randomly breaking again. Just updating the annotation to remove the non-existent port was enough to fix the error without doing anything else.
his error is a symptom of a broken NEG state, not an actual load-balancer overload.
From the cases above, the root cause consistently comes down to NEG reconciliation failures under the Gateway API. Two concrete issues can trigger it:
-
Stuck / orphaned ServiceNetworkEndpointGroups
If a Service is deleted or recreated but its ServiceNetworkEndpointGroup remains stuck (often due to the networking.gke.io/neg-finalizer), the NEG controller never fully reconciles. Backends remain unusable and the load balancer starts unconditionally dropping traffic.
Deleting the NEG (and removing the finalizer if needed), then recreating the Service/Deployment forces a clean rebuild and restores traffic.
-
Mismatch between Service ports and cloud.google.com/neg annotation
If the NEG annotation references a port that no longer exists in the Service spec, the neg-controller repeatedly fails with ProcessServiceFailed. This can work for months and then suddenly break once the controller re-syncs. Simply fixing the annotation to match the Service ports immediately resolves the issue.
The “unconditional drop overload” message is downstream behavior when the load balancer has no valid backends, not a capacity problem.
Recommendation:
When you see this error, always inspect:
-
ServiceNetworkEndpointGroup status and finalizers
-
neg-controller events on the Service
-
Exact alignment between Service ports and the NEG annotation
In most cases, correcting NEG state or annotation drift is sufficient without touching Gateway or HttpRoute resources.