Score:3

Tcpdump showing different redirection port after adding REDIRECT rule in iptables

us flag

I am attempting to direct client traffic to a kubernetes cluster NodePort listening on 192.168.1.100.30000.

Client's needs to make a request to 192.168.1.100.8000 so I added the following REDIRECT rule in iptables:

iptables -t nat -I PREROUTING -p tcp --dst 192.168.1.100 --dport 8000 -j REDIRECT --to-port 30000

I then issue a curl to 192.168.1.100:8000 however, in tcpdump i see a different port:

# tcpdump -i lo -nnvvv host 192.168.1.100 and port 8000
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
[Interface: lo] 20:39:22.685968 IP (tos 0x0, ttl 64, id 20590, offset 0, flags [DF], proto TCP (6), length 40)
[Interface: lo]     192.168.1.100.8000 > 192.168.1.100.49816: Flags [R.], cksum 0xacda (correct), seq 0, ack 3840205844, win 0, length 0
[Interface: lo] 20:39:37.519256 IP (tos 0x0, ttl 64, id 34221, offset 0, flags [DF], proto TCP (6), length 40)

I would expect the tcpdump to show something like

192.168.1.100.8000 > 192.168.1.100.30000

However, it is showing and causing a connection refused error since no process is listing on 192.168.1.100.49816.

192.168.1.100.8000 > 192.168.1.100.49816

I am using a test environment so i don't have access to remote devices that is why I am using curl to test the iptables REDIRECT path.

Is there a reason why adding a REDIRECT rule causes tcpdump to redirect the traffic to a different port than the one specified?

Edit:

After @A.B. suggestion added the following OUTPUT rule:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

and curl does proceed further, packet count for the OUTPUT chain does increase (PREROUTING REDIRECT chain packet didn't increase though):

2       10   600 REDIRECT   tcp  --  *      *       0.0.0.0/0            192.168.1.100         tcp dpt:8000 redir ports 30000

However, getting the following error:

# curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connected to 192.168.1.100 (192.168.1.100) port 8000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* NSS error -12263 (SSL_ERROR_RX_RECORD_TOO_LONG)
* SSL received a record that exceeded the maximum permissible length.
* Closing connection 0
curl: (35) SSL received a record that exceeded the maximum permissible length.

Also, tried adding a remotesystem net, this time the PREROUTING REDIRECT CHAIN packet count increases after executing remotesystem curl ... (but the OUTPUT CHAIN doesn't increase):

2       34  2040 REDIRECT   tcp  --  *      *       0.0.0.0/0            172.16.128.1         tcp dpt:8000 redir ports 30000

Error:

# ip netns exec remotesystem curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connection timed out
* Failed connect to 192.168.1.100:8000; Connection timed out
* Closing connection 0
curl: (7) Failed connect to 192.168.1.100:8000; Connection timed out
A.B avatar
cl flag
A.B
Your rule won't work with a test from the host. Test again from a remote system, not from the system to itself.
tiger_groove avatar
us flag
Why wouldn't it work, could you explain? Is there a way to make it work from the host with a loopback interface?
A.B avatar
cl flag
A.B
Please first add context to the question: you tell what you are doing, but I would like to know why you are doing it first (to solve what practical problem that made you use this?)
tiger_groove avatar
us flag
Added in the post
A.B avatar
cl flag
A.B
Can't tell the reason for local system (rather than remote) use is explained, but the answer won't need it in the end.
tiger_groove avatar
us flag
I am using a test environment so i don't have access to remote devices that is why I am using `curl` to test the iptables REDIRECT path. What do you mean the answer won't need it anyway?
jcaron avatar
co flag
Note that `192.168.1.100.8000 > 192.168.1.100.49816` doesn't mean "redirecting from port 8000 to port 49816", it means "a packet was sent from port 8000 to port 49816", which is simply the port used by your (local) client, and the packet is the TCP RST ("connection refused"). You should have a prior packet from 49816 to 8000 before that (the connection request, TCP SYN). And the connection refused is not because there isn't anything listening on 49816, but rather nothing listening on 8000.
Score:4
cl flag
A.B

To be clear: OP's test is done from the system 192.168.1.100 to itself, not from a remote system, and that's the cause of the problem. The port wasn't changed in this case because no NAT rule matched, while it would have matched if done from a remote system.

The schematic below shows how order of operations are performed on a packet:

Packet flow in Netfilter and General Networking

The reason is how NAT works on Linux: iptables sees a packet in the nat table only for the first packet of a new conntrack flow (which is thus in NEW state).

This rule works fine when from a remote system. In this case the first packet seen will be an incoming packet:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack --> nat/PREROUTING (iptables REDIRECT): to port 30000
--> routing decision --> ... --> local process receiving on port 30000

All following packets in the same flow will have conntrack handle directly the port change (or port reversion for replies) and will skip any iptables rule in the nat table (as written in the schematic: nat table only consulted for NEW connections). So, (skipping the reply packet part), the next incoming packet will undergo this instead:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack: to port 30000
--> routing decision --> ... --> local process receiving on port 30000

For a test on the system to itself, the first packet isn't an incoming packet but an outgoing packet. This happens instead, using the outgoing lo interface:

local process client curl --> routing decision --> conntrack --> nat/OUTPUT (no rule here)
--> reroute check --> AF_PACKET (tcpdump) --> to port 8000

And now this packet is looped back on the lo interface, it reappears as a packet which isn't anymore the first packet in a connection so follows second case as above: conntrack alone takes care of the NAT and doesn't call nat/PREROUTING. Except it wasn't instructed in the step before to do any NAT:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack
--> routing decision --> ... -->nolocal process receiving on port8000

as there's nothing listening on port 8000, the OS sends back a TCP RST.

For this to work on the local system, a REDIRECT rule must also be put in the nat/OUTPUT chain:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

Additional notes

  • if the case is intended for remote use, don't test from the local system: rules traversed by the test aren't the same. This makes the test not reflecting reality.

    Just use a network namespace to create a pocket remote system in case no other system is available. Example that should work with a system having only OP's nat/PREROUTING rule and doing curl http://192.168.1.100/ (which doesn't require DNS):

    ip netns add remotesystem
    ip link add name vethremote up type veth peer netns remotesystem name eth0
    ip address add 192.0.2.1/24 dev vethremote
    ip -n remotesystem address add 192.0.2.2/24 dev eth0
    ip -n remotesystem link set eth0 up
    ip -n remotesystem route add 192.168.1.100 via 192.0.2.1
    ip netns exec remotesystem curl http://192.168.1.100:8000/
    
  • tcpdump and NAT

    tcpdump happens at the AF_PACKET steps in the schematic above: very early for ingress and very late for egress. That means for a remote system case, it will never capture the port 30000 even when it's working. For the local system case, once the nat/OUTPUT rule is added, it will capture port 30000.

    Just don't trust blindly the address/port displayed by tcpdump when doing NAT: it depends on the case and where the capture happens.

tiger_groove avatar
us flag
Thank you so much for the detail explanation, I am having some issues still and have put more information in my post. It seems to be getting further than before but seems like something is still blocking.
A.B avatar
cl flag
A.B
I suspect that 1/ the local process isn't a local process but a container/pod (since it's about Kubernetes), so is also routed or further filtered (=> no connectivity with the new net namespace). I based my answer on the single iptables rule present in the question and nothing not available. Moreover I tried successfully what I presented before I made the answer 2/ as explained counter in the nat table increases only for the first packet (in state NEW): it will either increase in nat/OUTPUT or nat/PREROUTING not both. filter table will see all packets. 3/ initial question wasn't about https
A.B avatar
cl flag
A.B
I won't change further this answer. You'd have to create a new question, with ALL context given in advance. And preferably reproduce a problem that won't depend on this current Q/A (keep the use with the OUTPUT rule or with a remote system that is known to connect)
tiger_groove avatar
us flag
That is fine, I will create a new question. Really appreciate your help!
tiger_groove avatar
us flag
I have created a new question, please let me know if this makes sense to you https://serverfault.com/questions/1097511/iptables-redirect-to-kubernetes-nodeport-causes-request-to-hang
A.B avatar
cl flag
A.B
Your new problem is about doing a curl request to your API, not about a timeout because I suggested a method to replace a remote system that didn't happen to work in your specific setup. iptables shouldn't be involved at all in the new question. I'm sorry I didn't explain correctly how should have been the new question
tiger_groove avatar
us flag
It's weird because when i perform the curl like this `ip netns exec remotesystem curl -vk https://192.168.1.100:30000/v1/flight` it works fine and i get a response back, only when I change it to `192.168.1.100:8000` it hangs, i'm not entirely sure why, but it seems like something doesn't like the REDIRECT iptables rule.
A.B avatar
cl flag
A.B
I was talking about the case with SSL_ERROR_RX_RECORD_TOO_LONG which didn't hang.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.