Score:1

GKE metrics agent logging many errors

cn flag

We have created GKE cluster and we are getting errors from gke-metrics-agent. The errors shows up every cca 30 minutes. It's always the same 62 errors.

All the errors have label k8s-pod/k8s-app: "gke-metrics-agent".

First error is:

error   exporterhelper/queued_retry.go:245  Exporting failed. Try enabling retry_on_failure config option.  {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."  

This error is followed by these errors in order

  • "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send"
  • "/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245"
  • go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
  • /go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120

There are cca 40 errors like this. Two errors which stand out are:

- error exporterhelper/queued_retry.go:175  Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.  {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}"

- warn  batchprocessor/batch_processor.go:184   Sender failed   {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"

I tried to search those errors on google but I could not find anything. I can't even find any documentation for gke-metrics-agent.

Things I tried:

  • check quotas
  • update GKE to newer version (current version is 1.21.3-gke.2001)
  • update nodes
  • disable all firewall rules
  • give all permissions to k8s nodes

I can provide more information about our kubernetes cluster but I don't know what information may be important to solve this issue.

Srividya avatar
cn flag
**“Deadline exceeded”** is a [known issue](https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/releases/tag/v0.13.6) and starting from Kubernetes 1.16, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of [Open Telemetry](https://opentelemetry.io/). Can you provide the details about the version you are using for OpenCensus exporter and check by updating the OpenCensus exporter version which increases the timeout and let me know whether it works?
Melchy avatar
cn flag
Thanks for the response. It seems that I don't know how to update OpenCensus exporter. I found gke-metrics-agent pod in kubernetes and I tried to change the annotation components.gke.io/component-version: 0.6.0 to 0.13.6. This restarted the pods but the error is styl present. I also tried to change monitoring to open telemetry but I don't know how. Is it possible to set this using terraform? I found only monitoring_service setting which is set to monitoring.googleapis.com/kubernetes by default.
Srividya avatar
cn flag
Can you check this link for the [OpenCensus](https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/releases/tag/v0.13.6) exporter update and for [OpenTelemetry](https://github.com/GoogleCloudPlatform/opentelemetry-operations-java) operations on google cloud?
Maciek Leks avatar
kw flag
How did it end? I observe the same behaviour with 1.20.10-gke.301.
Melchy avatar
cn flag
I still have no idea what to do. I checked the link to OpenCensus and I can see that there is new version but I still have no idea how to update it. Maybe I should delete the default exporter and create custom exporter with new version?
Score:1
cn flag

“Deadline exceeded” is a known issue, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Currently there are two workarounds as following to resolve the issue:

1.Updating timeout.

Since the new release included a change that increases the default timeout from 5 to 12 seconds. So you might need to rebuild and redeploy the workload with the new version that could fix this rpc error.

2.To use higher GKE versions, this issue has a fix with gke-metrics-agent versions: 1.18.6-gke.6400+ 1.19.3-gke.600+ 1.20.0-gke.600+.

Chandra Kiran Pasumarti avatar
fr flag
@Melchy, If you think that the above answer helped you, please consider accepting it (✔️).
Score:0
cn flag

If you are still seeing those errors, please have a look at your metrics. Mainly kubernetes.io/container/... metrics for containers running on the same node as the gke-metrics-agent logging the errors. Do you see gaps in the metrics that shouldn't be there?

The context exceeded errors can happen once in a while, but should not in huge amounts. It may be networking issues or just occasional blips. Do you have any network policies/firewall rules that may prevent gke-metrics-agent from talking to Cloud Monitoring?

Sadly you can't update open-telemetry inside the gke-metrics-agent yourself. A newer cluster version can help too as it updates the agent, so try upgrading your cluster if possible. If the issue impacts your metrics, reach out to support.

Melchy avatar
cn flag
Hi, thanks for he response I don't see the errors anymore. After updating k8s cluster and waitning for cca one week the erros suddenly dissapeared. I have no idea why.
kwiesmueller avatar
cn flag
Then you might have received a new version of gke-metrics-agent with a fix.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.