
metrics-server fails to start - metric-storage not ready?

Having a problem where the metric server is failing to start. Initial deployment was 2021 (version v0.6.1), and worked for a couple years. After a crash recovery, the metric server is failing to start with a TLS error in the logs. I've tried a redeploy of the current (v0.6.3) and older versions (v0.6.1), and get the same issue.

Deployment status

kube-state-metrics-898575cdb-rwrsq         1/1     Running   0          23d   node3   <none>           <none>
metrics-server-68c5fc6c44-676zj            0/1     Running   0          7m37s    node2   <none>           <none>

I think the problem is with the metrics storage - after going through everything below, I found this when I probed the readyz condition

[-]metric-storage-ready failed: reason withheld

looking at logs says tls error, but I think thats a symptom, not the cause -

$ kubectl logs metrics-server-68c5fc6c44-676zj -nkube-system
Error from server: Get "": remote error: tls: internal error

Searching I found Kubernetes metrics-server having SSL trouble from several years ago. I verified that we were and are using the --kubelet-insecure-tls flag.

Pod description

$ kubectl describe deployment metrics-server -nkube-system
Name:                   metrics-server
Namespace:              kube-system
CreationTimestamp:      Tue, 28 Mar 2023 11:37:53 -0400
Labels:                 k8s-app=metrics-server
Annotations:   2
Selector:               k8s-app=metrics-server
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 25% max surge
Pod Template:
  Labels:           k8s-app=metrics-server
  Service Account:  metrics-server
    Port:       4443/TCP
    Host Port:  0/TCP
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
      /tmp from tmp-dir (rw)
    Type:               EmptyDir (a temporary directory that shares a pod's lifetime)
    SizeLimit:          <unset>
  Priority Class Name:  system-cluster-critical
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      False   MinimumReplicasUnavailable
OldReplicaSets:  <none>
NewReplicaSet:   metrics-server-68c5fc6c44 (1/1 replicas created)
  Type    Reason             Age                From                   Message
  ----    ------             ----               ----                   -------
  Normal  ScalingReplicaSet  34m                deployment-controller  Scaled up replica set metrics-server-6594d67d48 to 1
  Normal  ScalingReplicaSet  13m                deployment-controller  Scaled down replica set metrics-server-6594d67d48 to 0
  Normal  ScalingReplicaSet  13m                deployment-controller  Scaled down replica set metrics-server-68c5fc6c44 to 0
  Normal  ScalingReplicaSet  13m (x2 over 15m)  deployment-controller  Scaled up replica set metrics-server-68c5fc6c44 to 1

Now, further searching lead to Metrics-server is in CrashLoopBackOff with NEW install by rke

In this, there is a check for the response of livez and readyz

Here is what I get -

$ time curl -k
real    0m0.019s
user    0m0.000s
sys 0m0.010s

$ time curl -k
[+]ping ok
[+]log ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]informer-sync ok
[+]poststarthook/max-in-flight-filter ok
[-]metric-storage-ready failed: reason withheld
[+]metadata-informer-sync ok
[+]shutdown ok
readyz check failed

real    0m0.013s
user    0m0.009s
sys 0m0.000s

Now the question - [-]metric-storage-ready failed: reason withheld

What is that, and is that why its failing to deploy?

