Having a problem where the metric server is failing to start. Initial deployment was 2021 (version v0.6.1), and worked for a couple years. After a crash recovery, the metric server is failing to start with a TLS error in the logs. I've tried a redeploy of the current (v0.6.3) and older versions (v0.6.1), and get the same issue.
Deployment status
kube-state-metrics-898575cdb-rwrsq 1/1 Running 0 23d 10.233.92.183 node3 <none> <none>
metrics-server-68c5fc6c44-676zj 0/1 Running 0 7m37s 10.233.96.18 node2 <none> <none>
I think the problem is with the metrics storage - after going through everything below, I found this when I probed the readyz condition
[-]metric-storage-ready failed: reason withheld
looking at logs says tls error, but I think thats a symptom, not the cause -
$ kubectl logs metrics-server-68c5fc6c44-676zj -nkube-system
Error from server: Get "https://10.0.92.31:10250/containerLogs/kube-system/metrics-server-68c5fc6c44-676zj/metrics-server": remote error: tls: internal error
Searching I found Kubernetes metrics-server having SSL trouble from several years ago. I verified that we were and are using the --kubelet-insecure-tls flag.
Pod description
$ kubectl describe deployment metrics-server -nkube-system
Name: metrics-server
Namespace: kube-system
CreationTimestamp: Tue, 28 Mar 2023 11:37:53 -0400
Labels: k8s-app=metrics-server
Annotations: deployment.kubernetes.io/revision: 2
Selector: k8s-app=metrics-server
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 0 max unavailable, 25% max surge
Pod Template:
Labels: k8s-app=metrics-server
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
Port: 4443/TCP
Host Port: 0/TCP
Args:
--cert-dir=/tmp
--secure-port=4443
--kubelet-preferred-address-types=InternalIP
--kubelet-use-node-status-port
--metric-resolution=15s
--kubelet-insecure-tls
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: metrics-server-68c5fc6c44 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 34m deployment-controller Scaled up replica set metrics-server-6594d67d48 to 1
Normal ScalingReplicaSet 13m deployment-controller Scaled down replica set metrics-server-6594d67d48 to 0
Normal ScalingReplicaSet 13m deployment-controller Scaled down replica set metrics-server-68c5fc6c44 to 0
Normal ScalingReplicaSet 13m (x2 over 15m) deployment-controller Scaled up replica set metrics-server-68c5fc6c44 to 1
Now, further searching lead to Metrics-server is in CrashLoopBackOff with NEW install by rke
In this, there is a check for the response of livez and readyz
Here is what I get -
$ time curl -k https://10.233.96.18:4443/livez
ok
real 0m0.019s
user 0m0.000s
sys 0m0.010s
$ time curl -k https://10.233.96.18:4443/readyz
[+]ping ok
[+]log ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]informer-sync ok
[+]poststarthook/max-in-flight-filter ok
[-]metric-storage-ready failed: reason withheld
[+]metadata-informer-sync ok
[+]shutdown ok
readyz check failed
real 0m0.013s
user 0m0.009s
sys 0m0.000s
Now the question - [-]metric-storage-ready failed: reason withheld
What is that, and is that why its failing to deploy?