Score:0

No Pods reachable or schedulable on kubernetes cluster

ru flag

I have 2 kubernetes clusters in the IBM cloud, one has 2 Nodes, the other one 4.

The one that has 4 Nodes is working properly but at the other one I had to temporarily remove the worker nodes due to monetary reasons (shouldn't be payed while being idle).

When I reactivated the two nodes, everything seemed to start up fine and as long as I don't try to interact with Pods it still looks fine on the surface, no messages about inavailability or critical health status. OK, I deleted two obsolete Namespaces which got stuck in the Terminating state, but I could resolve that issue by restarting a cluster node (don't exactly know anymore which one it was).

When everything looked ok, I tried to access the kubernetes dashboard (everything done before was on IBM management level or in the command line) but surprisingly I found it unreachable with an error page in the browser stating:

503: Service Unavailable

There was a small JSON message at the bottom of that page, which said:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": { },
  "status": "Failure",
  "message": "error trying to reach service: read tcp 172.18.190.60:39946-\u003e172.19.151.38:8090: read: connection reset by peer",
  "reason": "ServiceUnavailable",
  "code": 503
}

I sent a kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system where the Pod was shown as running, but the result was not logs to be viewed, it was this message instead:

Error from server: Get "https://10.215.17.75:10250/containerLogs/kube-system/kubernetes-dashboard-54674bdd65-nf6w7/kubernetes-dashboard":
read tcp 172.18.135.195:56882->172.19.151.38:8090:
read: connection reset by peer

Then I found out I'm neither able to get the logs of any Pod running in that cluster, nor am I able to deploy any new custom kubernetes object that requires scheduling (I actually could apply Services or ConfigMaps, but no Pod, ReplicaSet, Deployment or similar).

I already tried to

  • reload the worker nodes in the workerpool
  • restart the worker nodes in the workerpool
  • restarted the kubernetes-dashboard Deployment

Unfortunately, none of the above actions changed the accessibility of the Pods.

There's another thing that might be related (though I'm not quite sure it actually is):

In the other cluster that runs fine, there are three calico Pods running and all three are up while in the problematic cluster only 2 of the three calico Pods are up and running, the third one stays in Pending state and a kubectl describe pod calico-blablabla-blabla reveals the reason, an Event

Warning  FailedScheduling  13s   default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.

Does anyone have a clue about what's going on in that cluster and can point me to possible solutions? I don't really want to delete the cluster and spawn a new one.

Edit

The result of kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system:

Name:                 kubernetes-dashboard-54674bdd65-4m2ch
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 10.215.17.82/10.215.17.82
Start Time:           Mon, 15 Nov 2021 09:01:30 +0100
Labels:               k8s-app=kubernetes-dashboard
                      pod-template-hash=54674bdd65
Annotations:          cni.projectcalico.org/containerID: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
                      cni.projectcalico.org/podIP: 172.30.184.2/32
                      cni.projectcalico.org/podIPs: 172.30.184.2/32
                      container.seccomp.security.alpha.kubernetes.io/kubernetes-dashboard: runtime/default
                      kubectl.kubernetes.io/restartedAt: 2021-11-10T15:47:14+01:00
                      kubernetes.io/psp: ibm-privileged-psp
Status:               Running
IP:                   172.30.184.2
IPs:
  IP:           172.30.184.2
Controlled By:  ReplicaSet/kubernetes-dashboard-54674bdd65
Containers:
  kubernetes-dashboard:
    Container ID:  containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
    Image:         registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard:v2.3.1
    Image ID:      registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard@sha256:f14f581d36b83fc9c1cfa3b0609e7788017ecada1f3106fab1c9db35295fe523
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --auto-generate-certificates
      --namespace=kube-system
    State:          Running
      Started:      Mon, 15 Nov 2021 09:01:37 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc9kw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-sc9kw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 600s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 600s
Events:                      <none>
Mikołaj Głodziak avatar
id flag
Hello, it is possible that the problem is connected to SSL certificate. Please look at [this question](https://stackoverflow.com/questions/46411598/kubernetes-dashboard-serviceunavailable-503-error) and let me know about the result. Which Kubernetes version did you use?
deHaar avatar
ru flag
Cesc @MikołajGłodziak, thanks for your suggestions. The cluster version is 1.22.2_1526 and the worker nodes have the version 1.22.2_1528. Next thing I will do (again now) is to update the cluster. I'll check the question you linked, thanks again!
Mikołaj Głodziak avatar
id flag
And how exactly did you set up your cluster? Is it bare metal or some cloud providor? It is important to reproduce your problem. Please check my suggestion and let me know ;)
deHaar avatar
ru flag
It's a classic cluster in the IBM Cloud which I set up using the web console (and a cli for partial interactions).
deHaar avatar
ru flag
@MikołajGłodziak could the reason be an old (maybe restored) tls certificate that was on the first nodes (that should have been deleted weeks ago)? I can see a suspicious `Secret`...
Mikołaj Głodziak avatar
id flag
Yes, sure, it is possible. Assuming you have a current certificate and have restored the old one (which should be removed) it is possible that it now looks like the newest one. However, it is out of date, so you get a error.
deHaar avatar
ru flag
Hmm, I don't have a new or current certificate, but possibly there was one generated when the new nodes came up (or the new workerpool). I have to dig into that a little deeper...
Mikołaj Głodziak avatar
id flag
Could you also run `kubectl describe pod <your dashbord pod>` and paste results to the question?
deHaar avatar
ru flag
It's now included in the question...
Mikołaj Głodziak avatar
id flag
Did you check the issue of SSL certificate?
deHaar avatar
ru flag
Not so far, I couldn't find out how to... The answer in the question you linked was not applicable in the IBM Cloud.
Mikołaj Głodziak avatar
id flag
You have told "could the reason be an old (maybe restored) tls certificate that was on the first nodes (that should have been deleted weeks ago)? I can see a suspicious Secret..." Are you sure, that you have only one valid certificate?
deHaar avatar
ru flag
I'm not sure about that, but the cloud provider has found out this issue was caused by updating the cluster version past 1.21 with public and private endpoint enabled as VRF disabled. This constellation lead to my problem, which is still unresolved and will most likely stay in this state. The provider says this isn't related to certificates.
deHaar avatar
ru flag
@MikołajGłodziak thanks for being interested in this matter, please view my own answer to this which I found out in a 3-day-fight with the IBM support. Someone from there finally pointed me to the solution.
Score:2
ru flag

Problem resolved…

The cause of the problem was an update of the cluster to the kubernetes version 1.21 while my cluster was meeting the following conditions:

  • private and public service endpoint enabled
  • VRF disabled

Root cause:

In Kubernetes version 1.21, Konnectivity replaces OpenVPN as the network proxy that is used to secure the communication of the Kubernetes API server master to worker nodes in the cluster.
When using Konnectivity, a problem exists with masters to cluster nodes communication when all of the above mentioned conditions are met.

Solution steps:

  • disabled the private service endpoint (the public one seems not to be a problem) by using the command
    ibmcloud ks cluster master private-service-endpoint disable --cluster <CLUSTER_NAME> (this command is provider specific, if you are experiencing the same problem with a different provider or on a local installation, find out how to disable that private service endpoint)
  • refreshed the cluster master using ibmcloud ks cluster master refresh --cluster <CLUSTER_NAME> and finally
  • reloaded all the worker nodes (in the web console, should be possible through a command as well)
  • waited for about 30 minutes:
    • Dashboard available / reachable again
    • Pods accessible and schedulable again

General recommendation:

BEFORE you update any cluster to kubernetes 1.21, check if you have enabled the private service endpoint. If you have, either disable it or delay the update until you can, or enable VRF (virtual routing and forwarding), which I couldn't but was told it was likely to resolve the issue.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.