Update to GKE 1.31.6-gke.1221000 broke metrics-server

Today I’ve update two different cluster to 1.31.6-gke.1221000, one from 1.30.9-gke.1127000 and the other from 1.31.6-gke.1099000.

After the upgrade both clusters have the metrics-server-v1.31.0 deployment in a crashloop state; the logs that the pod emit are the following for both clusters:

metrics-server I0324 15:44:16.834338 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
metrics-server I0324 15:44:23.435808 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
metrics-server I0324 15:44:24.733858 1 secure_serving.go:213] Serving securely on [::]:10250
metrics-server I0324 15:44:24.734200 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734222 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734256 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
metrics-server I0324 15:44:24.734377 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
metrics-server I0324 15:44:24.734759 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
metrics-server I0324 15:44:24.734779 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:24.734804 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
metrics-server I0324 15:44:24.734809 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server I0324 15:44:24.934399 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
metrics-server I0324 15:44:25.131735 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:25.231752 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server E0324 15:54:34.386399 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.1.20.30:10250/metrics/resource\": context deadline exceeded" node="gke-test-cluster-application-a1cd6ffd-tgrx" timeout="10s"

and the metrics-server-nanny pod simply don’t start and don’t write any log to stdout. I’ve checked my network firewall rules and the 10250 port is correctly add to the allowed ones and all the nodes of the cluster are targeted by the rule.

Both cluster before the update were working fine.

Someone else has any idea where I have to look for getting it up and running?

1 Like

Check if Kubelet Secure Port (10250) is Open:
You mentioned that firewall rules allow 10250, but let’s double-check:

Run the following command from a node inside the cluster:

nc -zv 10.1.20.30 10250

If the connection fails, there might be a network policy blocking access.

Ensure that the GKE master can reach the nodes on port 10250:

kubectl get nodes -o wide # Check node IPs

gcloud compute firewall-rules list --filter=“name~‘kubernetes’” # Verify rules

Thanks,
Darwin Vinoth.
Linkedin

Hello, I’ve tried the three commands and I don’t see anything wrong. (different ip because this particular cluster has spot instances)

nc command:

Screenshot 2025-03-25 at 14.22.57.png

get nodes:

firewall list:

both cluster has the insecure kubelet port turned off following the official gke guides, so the duplicated exkubelt and inkubelet rules. I reiterated that in both cluster with the older version I’ve never have any trouble.

Today the errors are somewhat different, it seems that there is some kind of tls error on the kubelet?

I’ve managed to change the image of the pod-nanny to the previous tag 1.8.23-gke.1 and the pod is up and running without problems. But then gke see the change and revert it to 1.8.19-gke.6 and it broke again. The 1.8.19 has been tagged 4 days ago and the 1.8.23 19 days ago…

This was fixed with a patch version ending in 001. For me it was 1.30.10-gke.1227001 when 1.30.10-gke.1227000 was broken.

2 Likes

Yeah I’m seeing right now that they have released 1.31.6-gke.1221001 and 1.31.7-gke.1013001 yesterday. I will have to wait the rollup in my regions because I cannot already see them in my update list…

1 Like

Also seeing this. I’m not seeing the *001 in my dropdown list options yet either. I’ve only rolled this out to my lower envs, going to give it through today to see if the patch gets rolled out, if not I guess it’s rollback time :disappointed_face:

Google says it could be up to Saturday, March 29 to be rolled out everywhere. I had the patch everywhere except one environment, so I rolled back.

If someone needs it, I can confirm that I can see these new versions on europe-west1 region. The rollout is proceeding slowly but surely :slightly_smiling_face:

And I can confirm that the crash loop cause has been fixed!