GKE autopilot - connectivity Issues after nodes auto upgrade

Hi everybody, my autopilot-cluster first and my nodes then, have been auto-upgraded on 06/01/2023. After a while my alert policies has fired a mail to inform me that the services where down.
I’ve tried to restart everything, to review the firewall policies and so on, but my NodePort services cannot be accessed by EXTERNAL_IP:NODE_PORT and my Ingress now fails for errors on HealthChecks that seems related to lost of connectivity between nodes.

Here the command outputs:

kubectl get nodes
NAME                                                 STATUS   ROLES    AGE    VERSION
gk3-autopilot-cluster-1-nap-z76sqx2u-37293196-q30g   Ready    <none>   6d6h   v1.24.5-gke.600
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-9qag   Ready    <none>   6d6h   v1.24.5-gke.600
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-v6rr   Ready    <none>   6d6h   v1.24.5-gke.600
gcloud container operations list

NAME                              TYPE               LOCATION      TARGET                                              STATUS_MESSAGE  STATUS  START_TIME                      END_TIME
operation-1672970624599-0763d427  UPGRADE_MASTER     europe-west1  autopilot-cluster-1                                                 DONE    2023-01-06T02:03:44.599057045Z  2023-01-06T02:34:25.192667243Z
operation-1672981183724-e661ddd0  UPGRADE_NODES      europe-west1  nap-z76sqx2u                                                        DONE    2023-01-06T04:59:43.724322639Z  2023-01-06T05:18:18.826081143Z
operation-1673261783620-a79a6273  AUTO_REPAIR_NODES  europe-west1  gk3-autopilot-cluster-1-nap-z76sqx2u-d8fb9724-tnbh                  DONE    2023-01-09T10:56:23.620227191Z  2023-01-09T10:59:31.391167809Z
  1. Are you able access the NodePort services from the same VPC?
  2. Is this a private or public cluster?
  3. Can you post the event log for your Ingress?

It is a public cluster.

I’ve tried to use the port-forwarding and it worked (container port 8761)

gcloud container clusters get-credentials autopilot-cluster-1 --region europe-west1 --project concise-ivy-name \
 && kubectl port-forward $(kubectl get pod --selector="io.kompose.service=service-name" --output jsonpath='{.items[0].metadata.name}') 8080:8761

The NodePort services can be accessed by other NodePort services on the container port.
About your first question I’ve tried to enter ssh on a node but I’ve understood that I cannot in autopilot mode

Hmm … interesting. I’m not aware of any known issues art this point. Seems like this is likely a firewall issue.

There should be a firewall rule named something like *"*k8s-fw-l7-[random-hash]" which permits health checks for NodePort services as well as NEGs. This should be automatically created by the Ingress controller, but you might want to doublecheck that it still exists / is accurate.

The Ingress controller does create the firewall rule that you mention and this rule allows ingress for 130.211.0.0/22 and 35.191.0.0/16 on tcp:30000-32767 and tcp:8761.
However the health check seems to fail (but it was ok before the auto-upgrade) and it is still not possible to reach the NodePort service using EXTERNAL_IP:NODE_PORT.

UPDATE: I’ve created also two insecured egress and ingress firewall rules on 0.0.0.0/0 and it does not work either.

I have similar issue. GKE Autopilot cluster was updated to latest version from Regular channel. After upgrade I can’t reach(ping, telnet) to pod network over service network, all time I get timeout, in the same time I can able to connect to pod directly.

BTW. Upgrade was done yesterday, but issue appears today.

Try this one!! I’ve just fixed my issue.

Consistently unreliable workload performance on a specific node

In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by using the following command:

kubectl drain NODE_NAME --ignore-daemonsets

Replace NODE_NAME with the name of the problematic node. You can find the node name by running kubectl get nodes.

GKE does the following:

Evicts existing workloads from the node and stops scheduling workloads on that node.

Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.

Terminates any workloads that remain on the node and repairs or recreates the node over time.

1 Like

Thanks for the help. Unfortunately it did not help, I’ve also tried to upgrade the cluster at the version 1.24.8-gke.2000 in order to trigger an upgrade/recreate of the nodes, but the nodes were recreated at a misleading version and I can not still access my NodePort services due to a timeout

kubectl get nodes -o wide
NAME                                                 STATUS                     ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gk3-autopilot-cluster-1-nap-wnkht46x-61886929-p3uy   Ready                      <none>   17h   v1.24.7-gke.900   10.132.0.2    34.78.222.0      Container-Optimized OS from Google   5.10.147+        containerd://1.6.6
gk3-autopilot-cluster-1-nap-z76sqx2u-63ad9d0b-mjvh   Ready                      <none>   74m   v1.24.5-gke.600   10.132.0.3    34.77.206.85     Container-Optimized OS from Google   5.10.133+        containerd://1.6.6