Persistent GKE Metrics Agent Errors Following Manual Upgrade to 1.26.5-gke.1200

ethansq · June 22, 2023, 4:57pm

Hello,

I recently upgraded both the cluster and node pool manually on my GKE instance. Since the upgrade, the gke-metrics-agent has been consistently logging errors when attempting to scrape the gpu-maintenance-handler? This is particularly strange since none of the workloads are using GPU.

Previous Version: 1.25.8-gke**.1000**
Current Version: 1.26.5-gke.1200

These errors did not appear in the log history prior to the upgrade, and I have not identified any performance issues or disruptions associated with them.

Here are the errors in the order they appear:

{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1687358126485, \"target_labels\": \"map[__name__:up gke_component_name:nodes/gpu_maintenance_handler instance:127.0.0.1:8526 job:gpu-maintenance-handler]\"}",
"timestamp": "2023-06-21T14:35:26.486549967Z"
}

{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"gpu-maintenance-handler\", \"target\": \"http://127.0.0.1:8526/metrics\", \"err\": \"process_start_time_seconds metric is missing\"}",
"timestamp": "2023-06-21T14:35:26.486608287Z"
}

{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:34.993277578Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:30.523Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus/nostarttime\", \"scrape_timestamp\": 1687358130522, \"target_labels\": \"map[__name__:up instance:127.0.0.1:10231 job:netd]\"}",
"timestamp": "2023-06-21T14:35:30.523522336Z"
}

So far, I’ve considered adding an exclusion filter for these errors, but I wanted to understand them better before doing so. Has anyone else experienced similar issues after an upgrade? Does anyone have insights about what might be causing these errors, or any potential impacts on my cluster that I should be aware of?

Any help or advice would be greatly appreciated.

Thanks!

garisingh · June 23, 2023, 10:12am

I’d go ahead and add the exclusion filters for now as they won’t cause any issues. This issue has been reported internally and the team is currently figuring out the appropriate fix.

Enceladus · October 23, 2023, 9:21am

Hi,

do you have some reference to track that issue? Otherwise guess I will have to open another internal case.

BR Thomas

yellowhat · January 10, 2024, 8:28am

We are using 1.28.3-gke.1203001 and get this error.

Any update?

garisingh · January 11, 2024, 9:16am

Is this a new cluster or an existing cluster which was upgraded?
Also, can you post the log message(s) you are seeing?

yellowhat · January 11, 2024, 9:29am

The cluster was updated from 1.27 to 1.28 last month, but I have noticed the errors only this week, therefore I am not sure if they were already there:

{
  "textPayload": "2024-01-11T09:20:01.616Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1704964801615, \"target_labels\": \"map[__name__:up gke_component_name:addons/gke_metadata_server instance:127.0.0.1:989 job:addons]\"}",
  "timestamp": "2024-01-11T09:20:01.617190050Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent",
    "compute.googleapis.com/resource_name": "gke-default-pool-40924eba-02kp",
    "k8s-pod/pod-template-generation": "3",
    "k8s-pod/controller-revision-hash": "76c6ff9889"
  },
  "receiveTimestamp": "2024-01-11T09:20:01.693056134Z"
}

{
  "textPayload": "2024-01-11T09:21:01.617Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://127.0.0.1:989/metricz\", \"err\": \"process_start_time_seconds metric is missing\"}",
  "insertId": "a43lb9nc1ty7lxac",
  "timestamp": "2024-01-11T09:21:01.617966077Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-default-pool-40924eba-02kp",
    "k8s-pod/pod-template-generation": "3",
    "k8s-pod/controller-revision-hash": "76c6ff9889",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent"
  },
  "receiveTimestamp": "2024-01-11T09:21:01.690544703Z"
}

Do you suggest to recreate the cluster from scratch?

garisingh · January 11, 2024, 9:49am

You should not need to (re)create it from scratch. Just trying to see where the issue might be.

hollandj44 · January 15, 2024, 3:53pm

Seeing the same on 1.26.6-gke.1700. Is there a version with a fix for this?

yellowhat · January 18, 2024, 7:42am

We just updated to 1.28.3-gke.1286000 and still see this error.

broody · February 5, 2024, 5:53pm

Hello, I’m getting the same errors, is there an active issue that’s tracking this? Would like to follow along

To give more context, the errors started when I self deployed a promtheus operator. But now even if I remove all prometheus related deployments, the error persists. Every few seconds, there gke-metrics-agent will log two error messages:

{
  "textPayload": "2024-02-05T22:38:10.056Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1707172690055, \"target_labels\": \"map[__name__:up gke_component_name:addons/gke_metadata_server instance:127.0.0.1:989 job:addons]\"}",
  "insertId": "9punzz5v2b34nf25",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "c7e-prod",
      "container_name": "gke-metrics-agent",
      "location": "us-east4-a",
      "pod_name": "gke-metrics-agent-bmss7",
      "namespace_name": "kube-system",
      "cluster_name": "us-east4-a"
    }
  },
  "timestamp": "2024-02-05T22:38:10.057060435Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-us-east4-a-nap-e2-highmem-4-4nbvk-ea1fc62c-plqs",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/pod-template-generation": "21",
    "k8s-pod/k8s-app": "gke-metrics-agent",
    "k8s-pod/controller-revision-hash": "6bccd476f"
  },
  "logName": "projects/c7e-prod/logs/stderr",
  "receiveTimestamp": "2024-02-05T22:38:12.868037001Z"
}

{
  "textPayload": "2024-02-05T22:38:10.056Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://127.0.0.1:989/metricz\", \"err\": \"process_start_time_seconds metric is missing\"}",
  "insertId": "ce9jqt5ufoogep5p",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-east4-a",
      "namespace_name": "kube-system",
      "project_id": "c7e-prod",
      "pod_name": "gke-metrics-agent-bmss7",
      "cluster_name": "us-east4-a",
      "container_name": "gke-metrics-agent"
    }
  },
  "timestamp": "2024-02-05T22:38:10.057149815Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/controller-revision-hash": "6bccd476f",
    "k8s-pod/pod-template-generation": "21",
    "compute.googleapis.com/resource_name": "gke-us-east4-a-nap-e2-highmem-4-4nbvk-ea1fc62c-plqs",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent"
  },
  "logName": "projects/c7e-prod/logs/stderr",
  "receiveTimestamp": "2024-02-05T22:38:12.868037001Z"
}

IvanUkhov · March 20, 2024, 2:24pm

I am seeing the same errors flooding the log with 1.27.7-gke.1121002 and autopilot. Has anybody managed to figure out what is going on and how to address it?

usman_58 · March 22, 2024, 12:05pm

Did you manage you get them away ?

sathiyananthan · April 4, 2024, 11:44am

Hi all,
[Need latest info/update regarding solution for GKE Metrics error - related to prometheus]
We are using GKE version 1.26.10-gke.1101000, with Release Channel as Stable channel.
We are getting the below errors frequently in gke-metrics-agent, but we couldn’t find the cause.
Is there any way to suppress these error logs? or Is there any fix or workaround available?.

error ==scrape==/==scrape==.go:1202 ==Scrape== commit failed {“kind”: “receiver”, “name”: “prometheus”, “==scrape==_pool”: “gpu-maintenance-handler”, “target”: “http://127.0.0.1:8526/metrics”, “err”: “process_start_time_seconds metric is missing”}

warn internal/metricsbuilder.go:124 Failed to ==scrape== Prometheus endpoint {“kind”: “receiver”, “name”: “prometheus”, “==scrape==_timestamp”: 1712227367385, “target_labels”: “map[name:up gke_component_name:nodes/gpu_maintenance_handler instance:127.0.0.1:8526 job:gpu-maintenance-handler]”}

Thanks in Advance.

CatherineF-dev · April 12, 2024, 1:34pm

I’d go ahead and add the exclusion filters

An example to exclude gke-metadata-server INFO log is:

gcloud logging sinks update _Default --add-exclusion=name=exclude-unimportant-gke-metadata-server-logs,filter=' resource.type = "k8s_container" resource.labels.namespace_name = "kube-system" resource.labels.pod_name =~ "gke-metadata-server-.*" resource.labels.container_name = "gke-metadata-server" severity <= "INFO" '

You can modify the above filter to exclude the above spammy log where payload matches “gpu-maintenance-handler” and container name is gke-metrics-agent.

https://cloud.google.com/logging/docs/export/configure_export_v2#filter-examples

CatherineF-dev · April 12, 2024, 1:37pm

Could you paste one log to show which endpoint is failed to be scraped?

IvanUkhov · April 12, 2024, 1:52pm

Sure! Would this be sufficient?

CatherineF-dev · April 12, 2024, 4:05pm

Add exclusion filters for now as they won’t cause any issues. This issue (spam logs around gke_metadata_server) has been reported internally and the team is currently figuring out & rolling out the appropriate fix.

https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/Persistent-GKE-Metrics-Agent-Errors-Following-Manual-Upgrade-to/m-p/737177/highlight/true#M1769

pml · April 19, 2024, 11:57am

We are seeing the same errors plus additional ones that seem related. This is on a new Autopilot cluster

{
"insertId": "61ztn0mr4hn580go",
"jsonPayload": {
"stacktrace": "google3/cloud/kubernetes/metrics/components/collector/collector.runScrapeLoop\n\tcloud/kubernetes/metrics/components/collector/collector.go:86\ngoogle3/cloud/kubernetes/metrics/components/collector/collector.Run\n\tcloud/kubernetes/metrics/components/collector/collector.go:62\nmain.main\n\tcloud/kubernetes/metrics/components/collector/main.go:40\nruntime.main\n\tthird_party/go/gc/src/runtime/proc.go:267",
"caller": "collector/collector.go:86",
"error": "failed to process 70 (out of 1313) input lines",
"msg": "Failed to process metrics",
"scrape_target": "[http://localhost:9990/metrics](http://localhost:9990/metrics)",
"level": "error",
"ts": 1713527688.3943982
},
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "*****",
"cluster_name": "*****",
"location": "us-east1",
"namespace_name": "kube-system",
"pod_name": "anetd-g7f9p",
"container_name": "cilium-agent-metrics-collector"
}
},
"timestamp": "2024-04-19T11:54:48.394748813Z",
"severity": "ERROR",
"labels": {
"k8s-pod/controller-revision-hash": "56b47ff86",
"k8s-pod/k8s-app": "cilium",
"k8s-pod/pod-template-generation": "1",
"compute.googleapis.com/resource_name": "gk3-*****-clust-pool-2-3e896fe4-v7gq"
},
"logName": "projects/*****/logs/stderr",
"receiveTimestamp": "2024-04-19T11:54:50.753109177Z"
}

CatherineF-dev · April 19, 2024, 5:03pm

Thanks for reporting this! What’s your GKE cluster version?

pml · April 19, 2024, 5:29pm

I believe i was on 1.26 but after upgrading to 1.29 this morning most of the errors have gone away. After the upgrade, I went from a few hundred thousand of these errors to a few thousand. The most frequent errors now are:

{
"textPayload": "2024-04-19T13:59:53.651Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://10.142.0.52:9965/metrics\", \"err\": \"process_start_time_seconds metric is missing\"}",
"insertId": "b2i0irp6br8bsy47",
"resource": {
"type": "k8s_container",
"labels": {
"container_name": "gke-metrics-agent",
"cluster_name": "*****",
"pod_name": "gke-metrics-agent-zw9mn",
"project_id": "*****,
"namespace_name": "kube-system",
"location": "us-east1"
}
},
"timestamp": "2024-04-19T13:59:53.653623276Z",
"severity": "ERROR",
"labels": {
"k8s-pod/pod-template-generation": "2",
"k8s-pod/k8s-app": "gke-metrics-agent",
"k8s-pod/controller-revision-hash": "77f87b67bb",
"compute.googleapis.com/resource_name": "gk3-*****-clust-pool-2-3e896fe4-v7gq",
"k8s-pod/component": "gke-metrics-agent"
},
"logName": "projects/*****/logs/stderr",
"receiveTimestamp": "2024-04-19T13:59:57.734292060Z"
}

And

{
"insertId": "9krsksykukai5amv",
"jsonPayload": {
"error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized metric labels: [scanning_mode]",
"level": "error",
"msg": "Failed to export metrics to Cloud Monitoring",
"stacktrace": "google3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).exportBuffer\n\tcloud/kubernetes/metrics/common/gcm/export.go:434\ngoogle3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).flush\n\tcloud/kubernetes/metrics/common/gcm/export.go:383\ngoogle3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).Flush\n\tcloud/kubernetes/metrics/common/gcm/export.go:369\ngoogle3/cloud/kubernetes/distro/containers/image_package_extractor/pkg/metrics/metrics.ExportPushMetrics\n\tcloud/kubernetes/distro/containers/image_package_extractor/pkg/metrics/metrics.go:193\nmain.main\n\tcloud/kubernetes/distro/containers/image_package_extractor/img_pkg_extractor/main.go:112\nruntime.main\n\tthird_party/go/gc/src/runtime/proc.go:271",
"caller": "gcm/export.go:434",
"ts": 1713535108.3320804
},
"resource": {
"type": "k8s_container",
"labels": {
"cluster_name": "******",
"pod_name": "image-package-extractor-nzrmb",
"location": "us-east1",
"container_name": "image-package-extractor",
"project_id": "******",
"namespace_name": "kube-system"
}
},
"timestamp": "2024-04-19T13:58:28.332394487Z",
"severity": "ERROR",
"labels": {
"k8s-pod/controller-revision-hash": "76b5dd6d95",
"k8s-pod/k8s-app": "image-package-extractor",
"compute.googleapis.com/resource_name": "gk3-******-clust-pool-2-d073bcf5-sk6j",
"k8s-pod/pod-template-generation": "2"
},
"logName": "projects/******/logs/stderr",
"receiveTimestamp": "2024-04-19T13:58:32.985203921Z"
}

Topic		Replies	Views
GKE pod metrics not collected, timeout when accessing kubelet Serverless Applications gke	2	127	February 21, 2025
Update to GKE 1.31.6-gke.1221000 broke metrics-server Serverless Applications gke	9	113	March 28, 2025
Cloud Autoscaling API Errors 100% Serverless Applications gke	4	14	July 1, 2024

Persistent GKE Metrics Agent Errors Following Manual Upgrade to 1.26.5-gke.1200

Here are the errors in the order they appear:

AI Suggested topics