Today I’ve update two different cluster to 1.31.6-gke.1221000, one from 1.30.9-gke.1127000 and the other from 1.31.6-gke.1099000.
After the upgrade both clusters have the metrics-server-v1.31.0
deployment in a crashloop state; the logs that the pod emit are the following for both clusters:
metrics-server I0324 15:44:16.834338 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
metrics-server I0324 15:44:23.435808 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
metrics-server I0324 15:44:24.733858 1 secure_serving.go:213] Serving securely on [::]:10250
metrics-server I0324 15:44:24.734200 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734222 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
metrics-server I0324 15:44:24.734256 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
metrics-server I0324 15:44:24.734377 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
metrics-server I0324 15:44:24.734759 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
metrics-server I0324 15:44:24.734779 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:24.734804 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
metrics-server I0324 15:44:24.734809 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server I0324 15:44:24.934399 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
metrics-server I0324 15:44:25.131735 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
metrics-server I0324 15:44:25.231752 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
metrics-server E0324 15:54:34.386399 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.1.20.30:10250/metrics/resource\": context deadline exceeded" node="gke-test-cluster-application-a1cd6ffd-tgrx" timeout="10s"
and the metrics-server-nanny pod simply don’t start and don’t write any log to stdout. I’ve checked my network firewall rules and the 10250 port is correctly add to the allowed ones and all the nodes of the cluster are targeted by the rule.
Both cluster before the update were working fine.
Someone else has any idea where I have to look for getting it up and running?