I do have a GKE k8s cluster (k8s 1.22) that consists of preemptible nodes only, which includes critical services like kube-dns. It's a dev machine which can tolerate some broken minutes a day. Every time a node gets shut down which hosts a kube-dns pod, I run into DNS resolution problems that persist until I delete the failed pod (in 1.21, pods stay "Status: Failed" / "Reason: Shutdown" until manually deleted).
While I do expect some problems on preemptible nodes while they are being recycled, I would expect this to self-repair after some minutes. The underlying reason for the persistent problems seems to be that the failed pod does not get removed from the k8s Service
/ Endpoint
. This is what I can see in the system:
Status of the pods via kubectl -n kube-system get po -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
kube-dns-697dc8fc8b-47rxd 4/4 Terminated 0 43h
kube-dns-697dc8fc8b-mkfrp 4/4 Running 0 78m
kube-dns-697dc8fc8b-zfvn8 4/4 Running 0 19h
IP of the failed pod is 192.168.144.2 - and it still is listed as one of the endpoints of the service:
kubectl -n kube-system describe ep kube-dns
brings this:
Name: kube-dns
Namespace: kube-system
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-02-21T10:15:54Z
Subsets:
Addresses: 192.168.144.2,192.168.144.7,192.168.146.29
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
dns-tcp 53 TCP
dns 53 UDP
Events: <none>
I know others worked around these issues by Scheduling kube-dns to other pods, but I would rather want to make this self-healing instead, as node failures can still happen on non-preemptible nodes, they are just less likely.
My questions:
- Why is the failed pod still listed as one of the endpoints of the service, even hours after the initial node failure?
- What can I do to mitigate the problem (besides adding some non-ephemeral nodes)?
It seems that kube-dns in the default deployment in GKE does not have a readiness probe attached to dnsmasq (port 53), which is targeted in the kube-dns service, and that having that could solve the issue - but I suspect it's not there for a reason that I don't yet understand.
EDIT: Apparently this does not happen on 1.21.6-gke.1500 (regular channel), but it does on 1.22.6-gke.1500 (rapid channel). I do not have a good explanation, but despite having a few failed pods today the kube-dns service only contains the working ones.