Update to GKE 1.31.6-gke.1221000 broke metrics-server

jgiola · March 24, 2025, 4:23pm

Today I’ve update two different cluster to 1.31.6-gke.1221000, one from 1.30.9-gke.1127000 and the other from 1.31.6-gke.1099000.

After the upgrade both clusters have the metrics-server-v1.31.0 deployment in a crashloop state; the logs that the pod emit are the following for both clusters:

metrics-server I0324 15:44:16.834338 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
metrics-server I0324 15:44:23.435808 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
metrics-server I0324 15:44:24.733858 1 secure_serving.go:213] Serving securely on [::]:10250
metrics-server I0324 15:44:24.734200 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734222 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734256 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
metrics-server I0324 15:44:24.734377 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
metrics-server I0324 15:44:24.734759 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
metrics-server I0324 15:44:24.734779 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:24.734804 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
metrics-server I0324 15:44:24.734809 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server I0324 15:44:24.934399 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
metrics-server I0324 15:44:25.131735 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:25.231752 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server E0324 15:54:34.386399 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.1.20.30:10250/metrics/resource\": context deadline exceeded" node="gke-test-cluster-application-a1cd6ffd-tgrx" timeout="10s"

and the metrics-server-nanny pod simply don’t start and don’t write any log to stdout. I’ve checked my network firewall rules and the 10250 port is correctly add to the allowed ones and all the nodes of the cluster are targeted by the rule.

Both cluster before the update were working fine.

Someone else has any idea where I have to look for getting it up and running?

DarwinVinoth · March 25, 2025, 11:25am

Check if Kubelet Secure Port (10250) is Open:
You mentioned that firewall rules allow 10250, but let’s double-check:

Run the following command from a node inside the cluster:

nc -zv 10.1.20.30 10250

If the connection fails, there might be a network policy blocking access.

Ensure that the GKE master can reach the nodes on port 10250:

kubectl get nodes -o wide # Check node IPs

gcloud compute firewall-rules list --filter=“name~‘kubernetes’” # Verify rules

Thanks,
Darwin Vinoth.
Linkedin

jgiola · March 25, 2025, 1:32pm

Hello, I’ve tried the three commands and I don’t see anything wrong. (different ip because this particular cluster has spot instances)

nc command:

get nodes:

firewall list:

both cluster has the insecure kubelet port turned off following the official gke guides, so the duplicated exkubelt and inkubelet rules. I reiterated that in both cluster with the older version I’ve never have any trouble.

jgiola · March 26, 2025, 8:51am

Today the errors are somewhat different, it seems that there is some kind of tls error on the kubelet?

jgiola · March 26, 2025, 9:04am

I’ve managed to change the image of the pod-nanny to the previous tag 1.8.23-gke.1 and the pod is up and running without problems. But then gke see the change and revert it to 1.8.19-gke.6 and it broke again. The 1.8.19 has been tagged 4 days ago and the 1.8.23 19 days ago…

nbarnes22 · March 26, 2025, 11:42pm

This was fixed with a patch version ending in 001. For me it was 1.30.10-gke.1227001 when 1.30.10-gke.1227000 was broken.

jgiola · March 27, 2025, 7:49am

Yeah I’m seeing right now that they have released 1.31.6-gke.1221001 and 1.31.7-gke.1013001 yesterday. I will have to wait the rollup in my regions because I cannot already see them in my update list…

invisibleotis · March 27, 2025, 1:59pm

Also seeing this. I’m not seeing the *001 in my dropdown list options yet either. I’ve only rolled this out to my lower envs, going to give it through today to see if the patch gets rolled out, if not I guess it’s rollback time

nbarnes22 · March 27, 2025, 5:03pm

Google says it could be up to Saturday, March 29 to be rolled out everywhere. I had the patch everywhere except one environment, so I rolled back.

jgiola · March 28, 2025, 1:57pm

If someone needs it, I can confirm that I can see these new versions on europe-west1 region. The rollout is proceeding slowly but surely

And I can confirm that the crash loop cause has been fixed!

Topic		Replies	Views
GKE: metric server crashlooping (crosspost from r/googlecloud) Serverless Applications gke	3	1	September 29, 2023
pods were in CrashLoopBackOff when cluster with CP 1.27 and NP 1.23 and WI is true Serverless Applications gke	6	1	July 22, 2024
Persistent GKE Metrics Agent Errors Following Manual Upgrade to 1.26.5-gke.1200 Serverless Applications gke	36	13	May 21, 2024

Update to GKE 1.31.6-gke.1221000 broke metrics-server

AI Suggested topics