So I've got a weird issue that I'm not sure how to solve:
Right now pods are reporting that they don't have internet. I've narrowed the issue down to a dns problem (cluster is on prem setup using kubespray which uses coredns). When I stand up a debug pod I get the following behavior:
- When I send a DNS to request to www.google.com (
dig www.google.com
) the tcpdump logs for port 53 look like how I'd expect:
21:10:33.025899 IP debug.59031 > 169.254.25.10.domain: 18350+ [lau] A? www.google.com. (43)
21:10:33.026542 IP debug.52810 > 169.254.2510.domain: 33725+ PTR? 10.25.254.169.in-addr.arpa. (44)
21:10:33.036522 IP 169.254.25.10.domain > debug.52810: 33725 NXDomain 0/0/0 (44)
21:10:33.036665 IP 169.254.25.10.domain > debug.59031: 18350 1/0/1 A 142.250.80.36 (73)
- When I send an http request to www.google.com (
curl https://www.google.com
) the tcpdump logs for port 53 show that it's appending the search domains to the dns requests which explains why the pods are reporting no internet.
21:10:40.068763 IP debug.43031 > 169.254.25.10.domain: 24294+ A? www.google.com.kube-system.svc.<kubernetes domain>. (63)
21:10:40.068826 IP debug.43031 > 169.254.25.10.domain: 7902+ AAAA? www.ggogle.com.kube-system.svc.<kubernetes domain>. (63)
21:10:40.069778 IP 169.254.25.10.domain > debug.43031: 7902 NXDomain*- 0/1/0 (159)
21:10:40.069891 IP 169.254.25.10.domain > debug.43031: 24294 NXDomain*- 0/1/0 (159)
21:10:40.070007 IP debug.38363 > 169.254.25.10.domain: 26807+ A? www.google.com.svc.<kubernetes domain>. (51)
21:10:40.070049 IP debug.38363 > 169.254.25.10.domain: 39068+ AAAA? www.google.com.svc.<kubernetes domain>. (51)
21:10:40.070643 IP 169.254.25.10.domain > debug.38363: 26807 NXDomain*- 0/1/0 (147)
21:10:40.070807 IP 169.254.25.10.domain > debug.38363: 39068 NXDomain*- 0/1/0 (147)
21:10:40.070891 IP debug.38087 > 169.254.25.10.domain: 40210+ A? www.google.com.<kuberenetes domain>. (487)
21:10:40.070935 IP debug.38087 > 169.254.25.10.domain: 41616+ AAAA? www.google.com.<kubernetes domain>. (47)
21:10:40.071461 IP 169.254.25.10.domain > debug.38087: 41616 NXDomain*- 0/1/0 (143)
21:10:40.071632 IP 169.254.25.10.domain > debug.38087: 40210 NXDomain*- 0/1/0 (143)
21:10:40.071706 IP debug.46700 > 169.254.25.10.domain: 3263+ A? www.google.com.<search domain of machine pod is running on>. (53)
21:10:40:071748 IP debug.46700 > 169.254.25.10.domain: 19702+ AAAA? www.google.com.<search domain of machine pod is running on>. (53)
21:10:40.089999 IP 169.254.25.10.domain > debug.46700: 3263 1/0/0 A <our public ip> (104)
21:10:40.093058 IP 169.254.25.10.domain > debug.46700: 19702 0/1/0 (147)
So I'm not sure if expected behavior is to append search domains first and then try the root domain or not. If it shouldn't be doing that then I'd like to know why it is and how to fix it. If that is the expected behavior then I need to figure out why coredns is resolving the domain as my public ip rather than the correct ip and how to fix that.