Hi everybody, my autopilot-cluster first and my nodes then, have been auto-upgraded on 06/01/2023. After a while my alert policies has fired a mail to inform me that the services where down.
I’ve tried to restart everything, to review the firewall policies and so on, but my NodePort services cannot be accessed by EXTERNAL_IP:NODE_PORT and my Ingress now fails for errors on HealthChecks that seems related to lost of connectivity between nodes.
Here the command outputs:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gk3-autopilot-cluster-1-nap-z76sqx2u-37293196-q30g Ready <none> 6d6h v1.24.5-gke.600
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-9qag Ready <none> 6d6h v1.24.5-gke.600
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-v6rr Ready <none> 6d6h v1.24.5-gke.600
gcloud container operations list
NAME TYPE LOCATION TARGET STATUS_MESSAGE STATUS START_TIME END_TIME
operation-1672970624599-0763d427 UPGRADE_MASTER europe-west1 autopilot-cluster-1 DONE 2023-01-06T02:03:44.599057045Z 2023-01-06T02:34:25.192667243Z
operation-1672981183724-e661ddd0 UPGRADE_NODES europe-west1 nap-z76sqx2u DONE 2023-01-06T04:59:43.724322639Z 2023-01-06T05:18:18.826081143Z
operation-1673261783620-a79a6273 AUTO_REPAIR_NODES europe-west1 gk3-autopilot-cluster-1-nap-z76sqx2u-d8fb9724-tnbh DONE 2023-01-09T10:56:23.620227191Z 2023-01-09T10:59:31.391167809Z
It is a public cluster.
I’ve tried to use the port-forwarding and it worked (container port 8761)
gcloud container clusters get-credentials autopilot-cluster-1 --region europe-west1 --project concise-ivy-name \
&& kubectl port-forward $(kubectl get pod --selector="io.kompose.service=service-name" --output jsonpath='{.items[0].metadata.name}') 8080:8761
The NodePort services can be accessed by other NodePort services on the container port.
About your first question I’ve tried to enter ssh on a node but I’ve understood that I cannot in autopilot mode
Hmm … interesting. I’m not aware of any known issues art this point. Seems like this is likely a firewall issue.
There should be a firewall rule named something like *"*k8s-fw-l7-[random-hash]" which permits health checks for NodePort services as well as NEGs. This should be automatically created by the Ingress controller, but you might want to doublecheck that it still exists / is accurate.
The Ingress controller does create the firewall rule that you mention and this rule allows ingress for 130.211.0.0/22 and 35.191.0.0/16 on tcp:30000-32767 and tcp:8761.
However the health check seems to fail (but it was ok before the auto-upgrade) and it is still not possible to reach the NodePort service using EXTERNAL_IP:NODE_PORT.
UPDATE: I’ve created also two insecured egress and ingress firewall rules on 0.0.0.0/0 and it does not work either.
I have similar issue. GKE Autopilot cluster was updated to latest version from Regular channel. After upgrade I can’t reach(ping, telnet) to pod network over service network, all time I get timeout, in the same time I can able to connect to pod directly.
BTW. Upgrade was done yesterday, but issue appears today.
Try this one!! I’ve just fixed my issue.
Consistently unreliable workload performance on a specific node
In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by using the following command:
kubectl drain NODE_NAME --ignore-daemonsets
Replace NODE_NAME with the name of the problematic node. You can find the node name by running kubectl get nodes.
GKE does the following:
Evicts existing workloads from the node and stops scheduling workloads on that node.
Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
Terminates any workloads that remain on the node and repairs or recreates the node over time.
1 Like
Thanks for the help. Unfortunately it did not help, I’ve also tried to upgrade the cluster at the version 1.24.8-gke.2000 in order to trigger an upgrade/recreate of the nodes, but the nodes were recreated at a misleading version and I can not still access my NodePort services due to a timeout
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
gk3-autopilot-cluster-1-nap-wnkht46x-61886929-p3uy Ready <none> 17h v1.24.7-gke.900 10.132.0.2 34.78.222.0 Container-Optimized OS from Google 5.10.147+ containerd://1.6.6
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-mjvh Ready <none> 74m v1.24.5-gke.600 10.132.0.3 34.77.206.85 Container-Optimized OS from Google 5.10.133+ containerd://1.6.6