Persistent GKE Metrics Agent Errors Following Manual Upgrade to 1.26.5-gke.1200

Hello,

I recently upgraded both the cluster and node pool manually on my GKE instance. Since the upgrade, the gke-metrics-agent has been consistently logging errors when attempting to scrape the gpu-maintenance-handler? This is particularly strange since none of the workloads are using GPU.

Previous Version: 1.25.8-gke**.1000**
Current Version: 1.26.5-gke.1200

These errors did not appear in the log history prior to the upgrade, and I have not identified any performance issues or disruptions associated with them.

Here are the errors in the order they appear:
{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1687358126485, \"target_labels\": \"map[__name__:up gke_component_name:nodes/gpu_maintenance_handler instance:127.0.0.1:8526 job:gpu-maintenance-handler]\"}",
"timestamp": "2023-06-21T14:35:26.486549967Z"
}
{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:27.987750145Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:26.486Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"gpu-maintenance-handler\", \"target\": \"http://127.0.0.1:8526/metrics\", \"err\": \"process_start_time_seconds metric is missing\"}",
"timestamp": "2023-06-21T14:35:26.486608287Z"
}
{
"insertId": "insertId-value",
"labels": {
  "compute.googleapis.com/resource_name": "gke-generic-cluster-generic-node-pool-hash",
  "k8s-pod/component": "gke-metrics-agent",
  "k8s-pod/controller-revision-hash": "controller-revision-hash-value",
  "k8s-pod/k8s-app": "gke-metrics-agent",
  "k8s-pod/pod-template-generation": "pod-template-generation-value"
},
"logName": "projects/generic-project/logs/stderr",
"receiveTimestamp": "2023-06-21T14:35:34.993277578Z",
"resource": {
  "labels": {
    "cluster_name": "generic-cluster",
    "container_name": "gke-metrics-agent",
    "location": "europe-west3",
    "namespace_name": "kube-system",
    "pod_name": "gke-metrics-agent-generic",
    "project_id": "generic-project"
  },
  "type": "k8s_container"
},
"severity": "ERROR",
"textPayload": "2023-06-21T14:35:30.523Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus/nostarttime\", \"scrape_timestamp\": 1687358130522, \"target_labels\": \"map[__name__:up instance:127.0.0.1:10231 job:netd]\"}",
"timestamp": "2023-06-21T14:35:30.523522336Z"
}

So far, I’ve considered adding an exclusion filter for these errors, but I wanted to understand them better before doing so. Has anyone else experienced similar issues after an upgrade? Does anyone have insights about what might be causing these errors, or any potential impacts on my cluster that I should be aware of?

Any help or advice would be greatly appreciated.

Thanks!

I’d go ahead and add the exclusion filters for now as they won’t cause any issues. This issue has been reported internally and the team is currently figuring out the appropriate fix.

Hi,

do you have some reference to track that issue? Otherwise guess I will have to open another internal case.

BR Thomas

We are using 1.28.3-gke.1203001 and get this error.

Any update?

Is this a new cluster or an existing cluster which was upgraded?
Also, can you post the log message(s) you are seeing?

The cluster was updated from 1.27 to 1.28 last month, but I have noticed the errors only this week, therefore I am not sure if they were already there:

{
  "textPayload": "2024-01-11T09:20:01.616Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1704964801615, \"target_labels\": \"map[__name__:up gke_component_name:addons/gke_metadata_server instance:127.0.0.1:989 job:addons]\"}",
  "timestamp": "2024-01-11T09:20:01.617190050Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent",
    "compute.googleapis.com/resource_name": "gke-default-pool-40924eba-02kp",
    "k8s-pod/pod-template-generation": "3",
    "k8s-pod/controller-revision-hash": "76c6ff9889"
  },
  "receiveTimestamp": "2024-01-11T09:20:01.693056134Z"
}
{
  "textPayload": "2024-01-11T09:21:01.617Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://127.0.0.1:989/metricz\", \"err\": \"process_start_time_seconds metric is missing\"}",
  "insertId": "a43lb9nc1ty7lxac",
  "timestamp": "2024-01-11T09:21:01.617966077Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-default-pool-40924eba-02kp",
    "k8s-pod/pod-template-generation": "3",
    "k8s-pod/controller-revision-hash": "76c6ff9889",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent"
  },
  "receiveTimestamp": "2024-01-11T09:21:01.690544703Z"
}

Do you suggest to recreate the cluster from scratch?

You should not need to (re)create it from scratch. Just trying to see where the issue might be.

Seeing the same on 1.26.6-gke.1700. Is there a version with a fix for this?

We just updated to 1.28.3-gke.1286000 and still see this error.

Hello, I’m getting the same errors, is there an active issue that’s tracking this? Would like to follow along

To give more context, the errors started when I self deployed a promtheus operator. But now even if I remove all prometheus related deployments, the error persists. Every few seconds, there gke-metrics-agent will log two error messages:

{
  "textPayload": "2024-02-05T22:38:10.056Z\twarn\tinternal/metricsbuilder.go:124\tFailed to scrape Prometheus endpoint\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_timestamp\": 1707172690055, \"target_labels\": \"map[__name__:up gke_component_name:addons/gke_metadata_server instance:127.0.0.1:989 job:addons]\"}",
  "insertId": "9punzz5v2b34nf25",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "c7e-prod",
      "container_name": "gke-metrics-agent",
      "location": "us-east4-a",
      "pod_name": "gke-metrics-agent-bmss7",
      "namespace_name": "kube-system",
      "cluster_name": "us-east4-a"
    }
  },
  "timestamp": "2024-02-05T22:38:10.057060435Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-us-east4-a-nap-e2-highmem-4-4nbvk-ea1fc62c-plqs",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/pod-template-generation": "21",
    "k8s-pod/k8s-app": "gke-metrics-agent",
    "k8s-pod/controller-revision-hash": "6bccd476f"
  },
  "logName": "projects/c7e-prod/logs/stderr",
  "receiveTimestamp": "2024-02-05T22:38:12.868037001Z"
}
{
  "textPayload": "2024-02-05T22:38:10.056Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://127.0.0.1:989/metricz\", \"err\": \"process_start_time_seconds metric is missing\"}",
  "insertId": "ce9jqt5ufoogep5p",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-east4-a",
      "namespace_name": "kube-system",
      "project_id": "c7e-prod",
      "pod_name": "gke-metrics-agent-bmss7",
      "cluster_name": "us-east4-a",
      "container_name": "gke-metrics-agent"
    }
  },
  "timestamp": "2024-02-05T22:38:10.057149815Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/controller-revision-hash": "6bccd476f",
    "k8s-pod/pod-template-generation": "21",
    "compute.googleapis.com/resource_name": "gke-us-east4-a-nap-e2-highmem-4-4nbvk-ea1fc62c-plqs",
    "k8s-pod/component": "gke-metrics-agent",
    "k8s-pod/k8s-app": "gke-metrics-agent"
  },
  "logName": "projects/c7e-prod/logs/stderr",
  "receiveTimestamp": "2024-02-05T22:38:12.868037001Z"
}

I am seeing the same errors flooding the log with 1.27.7-gke.1121002 and autopilot. Has anybody managed to figure out what is going on and how to address it?

Did you manage you get them away ?

Hi all,
[Need latest info/update regarding solution for GKE Metrics error - related to prometheus]
We are using GKE version 1.26.10-gke.1101000, with Release Channel as Stable channel.
We are getting the below errors frequently in gke-metrics-agent, but we couldn’t find the cause.
Is there any way to suppress these error logs? or Is there any fix or workaround available?.

error ==scrape==/==scrape==.go:1202 ==Scrape== commit failed {“kind”: “receiver”, “name”: “prometheus”, “==scrape==_pool”: “gpu-maintenance-handler”, “target”: “http://127.0.0.1:8526/metrics”, “err”: “process_start_time_seconds metric is missing”}

warn internal/metricsbuilder.go:124 Failed to ==scrape== Prometheus endpoint {“kind”: “receiver”, “name”: “prometheus”, “==scrape==_timestamp”: 1712227367385, “target_labels”: “map[name:up gke_component_name:nodes/gpu_maintenance_handler instance:127.0.0.1:8526 job:gpu-maintenance-handler]”}

Thanks in Advance.

I’d go ahead and add the exclusion filters

An example to exclude gke-metadata-server INFO log is:

gcloud logging sinks update _Default --add-exclusion=name=exclude-unimportant-gke-metadata-server-logs,filter=' resource.type = "k8s_container" resource.labels.namespace_name = "kube-system" resource.labels.pod_name =~ "gke-metadata-server-.*" resource.labels.container_name = "gke-metadata-server" severity <= "INFO" '

You can modify the above filter to exclude the above spammy log where payload matches “gpu-maintenance-handler” and container name is gke-metrics-agent.

Could you paste one log to show which endpoint is failed to be scraped?

Sure! Would this be sufficient?

Add exclusion filters for now as they won’t cause any issues. This issue (spam logs around gke_metadata_server) has been reported internally and the team is currently figuring out & rolling out the appropriate fix.

https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/Persistent-GKE-Metrics-Agent-Errors-Following-Manual-Upgrade-to/m-p/737177/highlight/true#M1769

We are seeing the same errors plus additional ones that seem related. This is on a new Autopilot cluster

{
"insertId": "61ztn0mr4hn580go",
"jsonPayload": {
"stacktrace": "google3/cloud/kubernetes/metrics/components/collector/collector.runScrapeLoop\n\tcloud/kubernetes/metrics/components/collector/collector.go:86\ngoogle3/cloud/kubernetes/metrics/components/collector/collector.Run\n\tcloud/kubernetes/metrics/components/collector/collector.go:62\nmain.main\n\tcloud/kubernetes/metrics/components/collector/main.go:40\nruntime.main\n\tthird_party/go/gc/src/runtime/proc.go:267",
"caller": "collector/collector.go:86",
"error": "failed to process 70 (out of 1313) input lines",
"msg": "Failed to process metrics",
"scrape_target": "[http://localhost:9990/metrics](http://localhost:9990/metrics)",
"level": "error",
"ts": 1713527688.3943982
},
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "*****",
"cluster_name": "*****",
"location": "us-east1",
"namespace_name": "kube-system",
"pod_name": "anetd-g7f9p",
"container_name": "cilium-agent-metrics-collector"
}
},
"timestamp": "2024-04-19T11:54:48.394748813Z",
"severity": "ERROR",
"labels": {
"k8s-pod/controller-revision-hash": "56b47ff86",
"k8s-pod/k8s-app": "cilium",
"k8s-pod/pod-template-generation": "1",
"compute.googleapis.com/resource_name": "gk3-*****-clust-pool-2-3e896fe4-v7gq"
},
"logName": "projects/*****/logs/stderr",
"receiveTimestamp": "2024-04-19T11:54:50.753109177Z"
}

Thanks for reporting this! What’s your GKE cluster version?

I believe i was on 1.26 but after upgrading to 1.29 this morning most of the errors have gone away. After the upgrade, I went from a few hundred thousand of these errors to a few thousand. The most frequent errors now are:

{
"textPayload": "2024-04-19T13:59:53.651Z\terror\tscrape/scrape.go:1202\tScrape commit failed\t{\"kind\": \"receiver\", \"name\": \"prometheus\", \"scrape_pool\": \"addons\", \"target\": \"http://10.142.0.52:9965/metrics\", \"err\": \"process_start_time_seconds metric is missing\"}",
"insertId": "b2i0irp6br8bsy47",
"resource": {
"type": "k8s_container",
"labels": {
"container_name": "gke-metrics-agent",
"cluster_name": "*****",
"pod_name": "gke-metrics-agent-zw9mn",
"project_id": "*****,
"namespace_name": "kube-system",
"location": "us-east1"
}
},
"timestamp": "2024-04-19T13:59:53.653623276Z",
"severity": "ERROR",
"labels": {
"k8s-pod/pod-template-generation": "2",
"k8s-pod/k8s-app": "gke-metrics-agent",
"k8s-pod/controller-revision-hash": "77f87b67bb",
"compute.googleapis.com/resource_name": "gk3-*****-clust-pool-2-3e896fe4-v7gq",
"k8s-pod/component": "gke-metrics-agent"
},
"logName": "projects/*****/logs/stderr",
"receiveTimestamp": "2024-04-19T13:59:57.734292060Z"
}

And

{
"insertId": "9krsksykukai5amv",
"jsonPayload": {
"error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized metric labels: [scanning_mode]",
"level": "error",
"msg": "Failed to export metrics to Cloud Monitoring",
"stacktrace": "google3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).exportBuffer\n\tcloud/kubernetes/metrics/common/gcm/export.go:434\ngoogle3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).flush\n\tcloud/kubernetes/metrics/common/gcm/export.go:383\ngoogle3/cloud/kubernetes/metrics/common/gcm/gcm.(*exporter).Flush\n\tcloud/kubernetes/metrics/common/gcm/export.go:369\ngoogle3/cloud/kubernetes/distro/containers/image_package_extractor/pkg/metrics/metrics.ExportPushMetrics\n\tcloud/kubernetes/distro/containers/image_package_extractor/pkg/metrics/metrics.go:193\nmain.main\n\tcloud/kubernetes/distro/containers/image_package_extractor/img_pkg_extractor/main.go:112\nruntime.main\n\tthird_party/go/gc/src/runtime/proc.go:271",
"caller": "gcm/export.go:434",
"ts": 1713535108.3320804
},
"resource": {
"type": "k8s_container",
"labels": {
"cluster_name": "******",
"pod_name": "image-package-extractor-nzrmb",
"location": "us-east1",
"container_name": "image-package-extractor",
"project_id": "******",
"namespace_name": "kube-system"
}
},
"timestamp": "2024-04-19T13:58:28.332394487Z",
"severity": "ERROR",
"labels": {
"k8s-pod/controller-revision-hash": "76b5dd6d95",
"k8s-pod/k8s-app": "image-package-extractor",
"compute.googleapis.com/resource_name": "gk3-******-clust-pool-2-d073bcf5-sk6j",
"k8s-pod/pod-template-generation": "2"
},
"logName": "projects/******/logs/stderr",
"receiveTimestamp": "2024-04-19T13:58:32.985203921Z"
}