Score:1

Bad address translation in Kubernetes UDP service by kube-proxy with iptables

uy flag

For educational purposes, I made myself a Kubernetes cluster according to Kubernetes the Hard Way guide - except I did it on my own set of local VMs rather than using Google Cloud.

Everything seemed to work nicely until I noticed some network communication issues. Specifically, DNS resolution seemed to be not-always-working.

I have CoreDNS installed from helm chart:

helm repo add coredns https://coredns.github.io/helm
helm install -n kube-system coredns coredns/coredns \
  --set service.clusterIP=10.32.0.10,replicaCount=2

Here's a view of my cluster:

$ kubectl get nodes -o wide
NAME      STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker0   Ready    <none>   5d9h   v1.27.3   192.168.64.20   <none>        Ubuntu 22.04.2 LTS   5.15.0-76-generic   containerd://1.7.2
worker1   Ready    <none>   5d9h   v1.27.3   192.168.64.21   <none>        Ubuntu 22.04.2 LTS   5.15.0-76-generic   containerd://1.7.2
worker2   Ready    <none>   5d9h   v1.27.3   192.168.64.22   <none>        Ubuntu 22.04.2 LTS   5.15.0-76-generic   containerd://1.7.2

$ kubectl get pod -A -o wide
NAMESPACE         NAME                                                              READY   STATUS    RESTARTS      AGE     IP            NODE      NOMINATED NODE   READINESS GATES
default           debu                                                              1/1     Running   0             13h     10.200.2.17   worker2   <none>           <none>
kube-system       coredns-coredns-7bbdc98b98-v6qtk                                  1/1     Running   0             42s     10.200.2.18   worker2   <none>           <none>
kube-system       coredns-coredns-7bbdc98b98-wj2f6                                  1/1     Running   0             5d7h    10.200.0.3    worker0   <none>           <none>

The DNS service:

$ kubectl get svc -n kube-system
NAME              TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
coredns-coredns   ClusterIP   10.32.0.10   <none>        53/UDP,53/TCP   5d7h

Now, when I do some DNS resolutions from the debu pod, it sometimes works:

kubectl exec -it debu -- nslookup -type=a kubernetes.default.svc.cluster.local.
Server:     10.32.0.10
Address:    10.32.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.32.0.1

but sometimes it does not:

keti debu -- nslookup -type=a kubernetes.default.svc.cluster.local.
;; communications error to 10.32.0.10#53: timed out
;; communications error to 10.32.0.10#53: timed out

I dug further and found out that the issue seems to be dependent on the coredns pod being chosen by kube-proxy:

  • when kube-proxy forwards the DNS request to 10.200.0.3 (a pod on different node than my debu pod) then the resolution works
  • when kube-proxy forwards the DNS request to 10.200.2.18 (a pod on the same node as my debu pod) then the resolution does not work

So I went deeper and captured some traffic:

$ kubectl exec -it debu -- tcpdump -vn udp port 53
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
08:39:34.299059 IP (tos 0x0, ttl 64, id 5764, offset 0, flags [none], proto UDP (17), length 82)

# here kube-proxy chose the "remote" coredns pod
08:39:34.299059 IP (tos 0x0, ttl 64, id 5764, offset 0, flags [none], proto UDP (17), length 82)
    10.200.2.17.35002 > 10.32.0.10.53: 25915+ A? kubernetes.default.svc.cluster.local. (54)
08:39:34.299782 IP (tos 0x0, ttl 62, id 48854, offset 0, flags [DF], proto UDP (17), length 134)
    10.32.0.10.53 > 10.200.2.17.35002: 25915*- 1/0/0 kubernetes.default.svc.cluster.local. A 10.32.0.1 (106)

# here kube-proxy chose the "local" coredns pod
08:39:36.588485 IP (tos 0x0, ttl 64, id 31594, offset 0, flags [none], proto UDP (17), length 82)
    10.200.2.17.45242 > 10.32.0.10.53: 33921+ A? kubernetes.default.svc.cluster.local. (54)
08:39:36.588670 IP (tos 0x0, ttl 64, id 17121, offset 0, flags [DF], proto UDP (17), length 134)
    10.200.2.18.53 > 10.200.2.17.45242: 33921*- 1/0/0 kubernetes.default.svc.cluster.local. A 10.32.0.1 (106)

Take note of the source address of the DNS response. When talking to a remote coredns pod, the response comes from 10.32.0.10 (the service address) but when talking to a local coredns pod, the response comes from 10.200.2.18 (the pod address) which is inconsistent with the destination address of the request (the service IP) and it likely causes the DNS client to not receive this response at all.

The component responsible for this is, as far as I understand, kube-proxy and the iptables rules that it sets up. Why is it not doing the address translation correctly?

Here's a dump of iptables rules from worker2, set up by kube-proxy:

$ sudo iptables-save
# Generated by iptables-save v1.8.7 on Thu Jul 20 08:50:55 2023
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:KUBE-IPTABLES-HINT - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-PROXY-CANARY - [0:0]
COMMIT
# Completed on Thu Jul 20 08:50:55 2023
# Generated by iptables-save v1.8.7 on Thu Jul 20 08:50:55 2023
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:KUBE-EXTERNAL-SERVICES - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-NODEPORTS - [0:0]
:KUBE-PROXY-CANARY - [0:0]
:KUBE-PROXY-FIREWALL - [0:0]
:KUBE-SERVICES - [0:0]
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes load balancer firewall" -j KUBE-PROXY-FIREWALL
-A INPUT -m comment --comment "kubernetes health check service ports" -j KUBE-NODEPORTS
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes load balancer firewall" -j KUBE-PROXY-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes load balancer firewall" -j KUBE-PROXY-FIREWALL
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
# Completed on Thu Jul 20 08:50:55 2023
# Generated by iptables-save v1.8.7 on Thu Jul 20 08:50:55 2023
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:CNI-6c816826e5cedfcfc87d8961 - [0:0]
:CNI-a465ef0ed6a0180a9e27d1cb - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODEPORTS - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-PROXY-CANARY - [0:0]
:KUBE-SEP-2PLY6KCKFADRAI56 - [0:0]
:KUBE-SEP-EA76CHRYFR6YRKN6 - [0:0]
:KUBE-SEP-EVM2BZXZKR6FG27U - [0:0]
:KUBE-SEP-IGILC3MHHXCPPD2V - [0:0]
:KUBE-SEP-TVM3X65DZPREBP7U - [0:0]
:KUBE-SEP-VVBZLDDCGYIIOLML - [0:0]
:KUBE-SEP-ZY5Q3ULAQ5ZYZJLS - [0:0]
:KUBE-SERVICES - [0:0]
:KUBE-SVC-3MN7Q5WEBLVAXORV - [0:0]
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
:KUBE-SVC-S3NG4EFDCNWS3YQS - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 10.200.2.17/32 -m comment --comment "name: \"bridge\" id: \"49fabffb4b0f772ce5a80a41b7062980872561d679a2f32dcae016058125e1eb\"" -j CNI-6c816826e5cedfcfc87d8961
-A POSTROUTING -s 10.200.2.18/32 -m comment --comment "name: \"bridge\" id: \"d7ffc93685956ccff585923788c11383bb3d10f7d8633249a966f51fede7c1ef\"" -j CNI-a465ef0ed6a0180a9e27d1cb
-A CNI-6c816826e5cedfcfc87d8961 -d 10.200.2.0/24 -m comment --comment "name: \"bridge\" id: \"49fabffb4b0f772ce5a80a41b7062980872561d679a2f32dcae016058125e1eb\"" -j ACCEPT
-A CNI-6c816826e5cedfcfc87d8961 ! -d 224.0.0.0/4 -m comment --comment "name: \"bridge\" id: \"49fabffb4b0f772ce5a80a41b7062980872561d679a2f32dcae016058125e1eb\"" -j MASQUERADE
-A CNI-a465ef0ed6a0180a9e27d1cb -d 10.200.2.0/24 -m comment --comment "name: \"bridge\" id: \"d7ffc93685956ccff585923788c11383bb3d10f7d8633249a966f51fede7c1ef\"" -j ACCEPT
-A CNI-a465ef0ed6a0180a9e27d1cb ! -d 224.0.0.0/4 -m comment --comment "name: \"bridge\" id: \"d7ffc93685956ccff585923788c11383bb3d10f7d8633249a966f51fede7c1ef\"" -j MASQUERADE
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A KUBE-SEP-2PLY6KCKFADRAI56 -s 192.168.64.11/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-2PLY6KCKFADRAI56 -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 192.168.64.11:6443
-A KUBE-SEP-EA76CHRYFR6YRKN6 -s 10.200.0.3/32 -m comment --comment "kube-system/coredns-coredns:tcp-53" -j KUBE-MARK-MASQ
-A KUBE-SEP-EA76CHRYFR6YRKN6 -p tcp -m comment --comment "kube-system/coredns-coredns:tcp-53" -m tcp -j DNAT --to-destination 10.200.0.3:53
-A KUBE-SEP-EVM2BZXZKR6FG27U -s 10.200.2.18/32 -m comment --comment "kube-system/coredns-coredns:tcp-53" -j KUBE-MARK-MASQ
-A KUBE-SEP-EVM2BZXZKR6FG27U -p tcp -m comment --comment "kube-system/coredns-coredns:tcp-53" -m tcp -j DNAT --to-destination 10.200.2.18:53
-A KUBE-SEP-IGILC3MHHXCPPD2V -s 10.200.2.18/32 -m comment --comment "kube-system/coredns-coredns:udp-53" -j KUBE-MARK-MASQ
-A KUBE-SEP-IGILC3MHHXCPPD2V -p udp -m comment --comment "kube-system/coredns-coredns:udp-53" -m udp -j DNAT --to-destination 10.200.2.18:53
-A KUBE-SEP-TVM3X65DZPREBP7U -s 192.168.64.12/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-TVM3X65DZPREBP7U -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 192.168.64.12:6443
-A KUBE-SEP-VVBZLDDCGYIIOLML -s 192.168.64.10/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-VVBZLDDCGYIIOLML -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 192.168.64.10:6443
-A KUBE-SEP-ZY5Q3ULAQ5ZYZJLS -s 10.200.0.3/32 -m comment --comment "kube-system/coredns-coredns:udp-53" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZY5Q3ULAQ5ZYZJLS -p udp -m comment --comment "kube-system/coredns-coredns:udp-53" -m udp -j DNAT --to-destination 10.200.0.3:53
-A KUBE-SERVICES -d 10.32.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 10.32.0.10/32 -p udp -m comment --comment "kube-system/coredns-coredns:udp-53 cluster IP" -m udp --dport 53 -j KUBE-SVC-3MN7Q5WEBLVAXORV
-A KUBE-SERVICES -d 10.32.0.10/32 -p tcp -m comment --comment "kube-system/coredns-coredns:tcp-53 cluster IP" -m tcp --dport 53 -j KUBE-SVC-S3NG4EFDCNWS3YQS
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-3MN7Q5WEBLVAXORV ! -s 10.200.0.0/16 -d 10.32.0.10/32 -p udp -m comment --comment "kube-system/coredns-coredns:udp-53 cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-3MN7Q5WEBLVAXORV -m comment --comment "kube-system/coredns-coredns:udp-53 -> 10.200.0.3:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-ZY5Q3ULAQ5ZYZJLS
-A KUBE-SVC-3MN7Q5WEBLVAXORV -m comment --comment "kube-system/coredns-coredns:udp-53 -> 10.200.2.18:53" -j KUBE-SEP-IGILC3MHHXCPPD2V
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 10.200.0.0/16 -d 10.32.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https -> 192.168.64.10:6443" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-VVBZLDDCGYIIOLML
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https -> 192.168.64.11:6443" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-2PLY6KCKFADRAI56
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https -> 192.168.64.12:6443" -j KUBE-SEP-TVM3X65DZPREBP7U
-A KUBE-SVC-S3NG4EFDCNWS3YQS ! -s 10.200.0.0/16 -d 10.32.0.10/32 -p tcp -m comment --comment "kube-system/coredns-coredns:tcp-53 cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-S3NG4EFDCNWS3YQS -m comment --comment "kube-system/coredns-coredns:tcp-53 -> 10.200.0.3:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-EA76CHRYFR6YRKN6
-A KUBE-SVC-S3NG4EFDCNWS3YQS -m comment --comment "kube-system/coredns-coredns:tcp-53 -> 10.200.2.18:53" -j KUBE-SEP-EVM2BZXZKR6FG27U
COMMIT
# Completed on Thu Jul 20 08:50:55 2023
Score:0
uy flag

The solution was to do:

modprobe br_netfilter

on all worker nodes.

Why? Pods on the same node are connected to the same bridge interface, which means they are connected on layer 2. This means that any traffic between two pods on the same node will not go through iptables at all.

This behaviour overall makes sense - if pods are connected on layer 2 then there's no reason to forward the traffic through layer 3. However, it seems that this has not always been the default behaviour in Linux and it is often assumed otherwise. By loading the br_netfilter module, we force Linux to pass everything through iptables.

More info here

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.