Hello,
I recently upgraded both the cluster and node pool manually on my GKE instance. Since the upgrade, the gke-metrics-agent has been consistently logging errors when attempting to scrape the gpu-maintenance-handler? This is particularly strange since none of the workloads are using GPU.
Previous Version: 1.25.8-gke**.1000**
Current Version: 1.26.5-gke.1200
These errors did not appear in the log history prior to the upgrade, and I have not identified any performance issues or disruptions associated with them.
Here are the errors in the order they appear:
{
"insertId": "insertId-value",
"labels": {
"compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
"k8s-pod/component": "gke-metrics-agent",
"k8s-pod/controller-revision-hash": "controller-revision-hash-value",
"k8s-pod/k8s-app": "gke-metrics-agent",
"k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
"labels": {
"cluster_name": "generic-cluster",
"container_name": "gke-metrics-agent",
"location": "europe-west3",
"namespace_name": "kube-system",
"pod_name": "gke-metrics-agent-generic",
"project_id": "generic-project"
},
"type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1687358126485, \"target_labels\": \"map[__name__:up gke_component_name:nodes/gpu_maintenance_handler instance:127.0.0.1:8526 job:gpu-maintenance-handler]\"}",
"timestamp": "2023-06-21T14:35:26.486549967Z"
}
{
"insertId": "insertId-value",
"labels": {
"compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
"k8s-pod/component": "gke-metrics-agent",
"k8s-pod/controller-revision-hash": "controller-revision-hash-value",
"k8s-pod/k8s-app": "gke-metrics-agent",
"k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
"labels": {
"cluster_name": "generic-cluster",
"container_name": "gke-metrics-agent",
"location": "europe-west3",
"namespace_name": "kube-system",
"pod_name": "gke-metrics-agent-generic",
"project_id": "generic-project"
},
"type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"gpu-maintenance-handler\", \"target\": \"http://127.0.0.1:8526/metrics\", \"err\": \"process_start_time_seconds metric is missing\"}",
"timestamp": "2023-06-21T14:35:26.486608287Z"
}
{
"insertId": "insertId-value",
"labels": {
"compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
"k8s-pod/component": "gke-metrics-agent",
"k8s-pod/controller-revision-hash": "controller-revision-hash-value",
"k8s-pod/k8s-app": "gke-metrics-agent",
"k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:34.993277578Z",
"resource": {
"labels": {
"cluster_name": "generic-cluster",
"container_name": "gke-metrics-agent",
"location": "europe-west3",
"namespace_name": "kube-system",
"pod_name": "gke-metrics-agent-generic",
"project_id": "generic-project"
},
"type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:30.523Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus/nostarttime\", \"scrape_timestamp\": 1687358130522, \"target_labels\": \"map[__name__:up instance:127.0.0.1:10231 job:netd]\"}",
"timestamp": "2023-06-21T14:35:30.523522336Z"
}
So far, Iâve considered adding an exclusion filter for these errors, but I wanted to understand them better before doing so. Has anyone else experienced similar issues after an upgrade? Does anyone have insights about what might be causing these errors, or any potential impacts on my cluster that I should be aware of?
Any help or advice would be greatly appreciated.
Thanks!

