Score:0

metrics-server fails to start - metric-storage not ready?

bz flag

Having a problem where the metric server is failing to start. Initial deployment was 2021 (version v0.6.1), and worked for a couple years. After a crash recovery, the metric server is failing to start with a TLS error in the logs. I've tried a redeploy of the current (v0.6.3) and older versions (v0.6.1), and get the same issue.

Deployment status

kube-state-metrics-898575cdb-rwrsq         1/1     Running   0          23d     10.233.92.183   node3   <none>           <none>
metrics-server-68c5fc6c44-676zj            0/1     Running   0          7m37s   10.233.96.18    node2   <none>           <none>

I think the problem is with the metrics storage - after going through everything below, I found this when I probed the readyz condition

[-]metric-storage-ready failed: reason withheld

looking at logs says tls error, but I think thats a symptom, not the cause -

$ kubectl logs metrics-server-68c5fc6c44-676zj -nkube-system
Error from server: Get "https://10.0.92.31:10250/containerLogs/kube-system/metrics-server-68c5fc6c44-676zj/metrics-server": remote error: tls: internal error

Searching I found Kubernetes metrics-server having SSL trouble from several years ago. I verified that we were and are using the --kubelet-insecure-tls flag.

Pod description

$ kubectl describe deployment metrics-server -nkube-system
Name:                   metrics-server
Namespace:              kube-system
CreationTimestamp:      Tue, 28 Mar 2023 11:37:53 -0400
Labels:                 k8s-app=metrics-server
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               k8s-app=metrics-server
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 25% max surge
Pod Template:
  Labels:           k8s-app=metrics-server
  Service Account:  metrics-server
  Containers:
   metrics-server:
    Image:      k8s.gcr.io/metrics-server/metrics-server:v0.6.1
    Port:       4443/TCP
    Host Port:  0/TCP
    Args:
      --cert-dir=/tmp
      --secure-port=4443
      --kubelet-preferred-address-types=InternalIP
      --kubelet-use-node-status-port
      --metric-resolution=15s
      --kubelet-insecure-tls
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /tmp from tmp-dir (rw)
  Volumes:
   tmp-dir:
    Type:               EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:             
    SizeLimit:          <unset>
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      False   MinimumReplicasUnavailable
OldReplicaSets:  <none>
NewReplicaSet:   metrics-server-68c5fc6c44 (1/1 replicas created)
Events:
  Type    Reason             Age                From                   Message
  ----    ------             ----               ----                   -------
  Normal  ScalingReplicaSet  34m                deployment-controller  Scaled up replica set metrics-server-6594d67d48 to 1
  Normal  ScalingReplicaSet  13m                deployment-controller  Scaled down replica set metrics-server-6594d67d48 to 0
  Normal  ScalingReplicaSet  13m                deployment-controller  Scaled down replica set metrics-server-68c5fc6c44 to 0
  Normal  ScalingReplicaSet  13m (x2 over 15m)  deployment-controller  Scaled up replica set metrics-server-68c5fc6c44 to 1

Now, further searching lead to Metrics-server is in CrashLoopBackOff with NEW install by rke

In this, there is a check for the response of livez and readyz

Here is what I get -

$ time curl -k https://10.233.96.18:4443/livez
ok
real    0m0.019s
user    0m0.000s
sys 0m0.010s

$ time curl -k https://10.233.96.18:4443/readyz
[+]ping ok
[+]log ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]informer-sync ok
[+]poststarthook/max-in-flight-filter ok
[-]metric-storage-ready failed: reason withheld
[+]metadata-informer-sync ok
[+]shutdown ok
readyz check failed

real    0m0.013s
user    0m0.009s
sys 0m0.000s

Now the question - [-]metric-storage-ready failed: reason withheld

What is that, and is that why its failing to deploy?

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.