I'm at the end of my patience with a prometheus setup leveraging kube-prometheus-stack 44.3.0 (latest being 45).
I have two environments, staging and prod. In staging, my prometheus runs smoothly. In prod it has started crashing with OOMKilled errors roughly every 4 minutes.
Things I already tried:
- Increased the scrape interval from 30s to 300s
- Identified heavy metrics and dropped them before ingestion [More on that later]
- Enabled the web.enable-admin-api, to query tsdb and clean the tombstones
- Deleted prometheusrules, having noticed that they tended to shorten the pod life until the next crash
- Upped the resources (limits and requests) to the maximum available considering the nodes I'm using (memory limit currently at 6Gi; staging works with under 1Gi memory)
- Reduced the number of targets to scrape (taking down e.g. etcd metrics)
Comparing TSDB status across staging and prod
when prod is up, it doesn't show higher numbers - until it crashes:
By looking at TSDB statistics I noticed I used to have kube_replicasets metrics swarming prometheus. Another component in the cluster has created a high number of replicasets due to a bug, thus increasing the metrics. I deactivated those metrics from the ingestion completely:
...
metricRelabelings:
- regex: '(kube_replicaset_status_observed_generation|kube_replicaset_status_replicas|kube_replicaset_labels|kube_replicaset_created|kube_replicaset_annotations|kube_replicaset_status_ready_replicas|kube_replicaset_spec_replicas|kube_replicaset_owner|kube_replicaset_status_fully_labeled_replicas|kube_replicaset_metadata_generation)'
action: drop
sourceLabels: [__name__]
I verified that those replicasets metrics are no longer present in the prod prometheus.
TL;DR:
Prometheus in my K8S environment is OOMkilled continuously, making the tool nigh impossible to use. I need insight on how to find and isolate the cause of the issue.
Right now the only reasonable culprit still seems to be kube-state-metrics (todo - I need to disable it to verify the idea).
Related questions I've already looked at: