Context:
i recently encountered an issue where a kubernetes pod (blackbox-exporter) will receive an empty response whenever it tries to call an Ingress URL of a pod that resides in the same node as itself. This reflected as an intermittently failing probe on the dashboard.
The ingress controller used is ingress-nginx and sits behind an AWS NLB.
Example:
node1: 192.168.20.2
node2: 192.168.20.3
node3: 192.166.20.4
blackbox-exporter (deployed in node1, with clusterIP 10.244.2.21)
foo-pod (deployed in node1, with clusterIP 10.244.2.22)
foo-pod (deployed in node2, with clusterIP 10.244.2.23)
foo-pod (deployed in node3, with clusterIP 10.244.2.24)
Ingress-controller logs:
192.168.20.3 - - [21/Jun/2021:15:15:07 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
192.168.20.4 - - [21/Jun/2021:15:16:00 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
192.168.20.3 - - [21/Jun/2021:15:16:30 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
Tracing the ingress controller logs showed that "empty response" (timeout after 5s) only occurs when the pod that makes the ingress URL call is deployed in the same node as the target pod that is supposed to respond to that call.
Conclusion was made based on the fact that whenever the "empty response" was received, there is never a log with origin IP matching that on the node IP the blackbox-exporter is in, in this case it should be node1 192.168.20.2
.
Suspecting its related to "incorrect" source IP and as a result the target pod doesn't know how to return a response, i switched to use AWS Classic L7 LB and the issue is resolved.
Now the logs showed the source IP replaced with the actual pod ClusterIP and all probing calls from the blackbox-exporter are successful.
10.244.2.21 - - [21/Jun/2021:15:15:07 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
10.244.2.21 - - [21/Jun/2021:15:16:00 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
10.244.2.21 - - [21/Jun/2021:15:16:30 +0000] "GET /metrics HTTP/1.1" 200 29973 "-" "curl/7.47.0" 90 0.005 [foo-pod] [] 10.32.0.2:3000 30015 0.004 200 e39022b47e857cc48eb6a127a7b8ce24
More information:
Cluster version: AWS EKS v1.19
Question:
Linux/kubernetes networking isnt my strength so what i would like to ask is, what exactly is going on here?
Why does switching to use AWS Classic L7 load balancer solve the issue?
could any other components (kubernetes OR linux) be affecting this also?