Good day!
We have a classic GKE 1.27.2-gke.1200 + Application Load Balancer + Service Network Endpoint Group like a backend for ALB deployed using classic k8s service and annotation:
cloud.google.com/neg: ‘{“exposed_ports”:{“80”:{“name”:“neg-name”}}}’
Problem:
This construction looks good, but when GKE is updated NEG gets this message and freezes:
Message: [neg name personal-info-neg is already in use, found conflicting description: expected description of NEG object “us-central1-a”/“personal-info-neg” to be {“cluster-uid”:“16ce1ed1-44d0-465c-bedd-8a59e3387ef2”,“namespace”:“pi-main”,“service-name”:“pi-main-service”,“port”:“80”}, but got {“cluster-uid”:“89e370fb-16ed-45c5-946e-5e9ab9e56c8b”,“namespace”:“pi-main”,“service-name”:“pi-main-service”,“port”:“80”}, neg name personal-info-neg is already in use, found conflicting description: expected description of NEG object “us-central1-b”/“personal-info-neg” to be {“cluster-uid”:“16ce1ed1-44d0-465c-bedd-8a59e3387ef2”,“namespace”:“pi-main”,“service-name”:“pi-main-service”,“port”:“80”}, but got {“cluster-uid”:“89e370fb-16ed-45c5-946e-5e9ab9e56c8b”,“namespace”:“pi-main”,“service-name”:“pi-main-service”,“port”:“80”}]
Reason: NegInitializationFailed
Status: False
When we get this problem:
Around spring 2023 we caught this issue. From November 2022 to spring 2023 everything worked fine and k8s svcneg didn’t crash.
Possible solution:
- I know that I can remove ALB + NEG’s in GCP Console, and after that remove svcneg service in k8s, but it’s not convenient because after a few weeks, GKE will update and NEG will freeze.
- I can manually add endpoints in NEG in GCP Console, but it’s the same inconvenient.
2.1 When I did it, READINESS GATES will move status 0 to 1, approximately 5-10 min. We deployed infra and k8s services using Terraform, and every new deployment crashed because I had to go to NEG, manually add endpoints, and wait until READINESS GATES moved status 0 to 1. It’s really inconvenient.
What do I want?
Maybe you have some ideas on how I can patch NEG to solve this issue? Or something else?
We should use ALB to improve SSL Certs e.t.c, it’s the reason why we use NEG as a backend.