Score:2

Conntrack, failed to NAT its own TCP packets from another VRF

us flag

I came across a tricky problem with source NAT when using multiple VRF on a Debian based router. It's a bit complex to explain, so I will try to be clear, but it will not be short, sorry for that. The problem should be easy to reproduce though.

To isolate the "management" part of the router (ssh and other services) from its router job (routing and NATing packets), I tried to set up the "mgmt" VRF in the default VRF (easier to deal with services sockets) and the routing one in a VRF called "firewall".

The diagram can be summarized like this:

Network diagram

The "management" network is 192.168.100.0/24, and it's routed by a L3 switch who has a L3 with the "firewall" VRF of the router through the network 10.254.5.0/24. The third router interface is it's "internet" one, and packets that goes through it are source NATed. This setup works quite nicely for everything in the mgmt subnet, except the router's own packets, cause of conntrack.

About iptables rules:

# Table filter

# chain INPUT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
(some INPUT rules, for ssh, snmp, etc)
-A INPUT -j DROP

# chain FORWARD
-A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -m conntrack --ctstate INVALID -j DROP
-A FORWARD -o eth2 -j ACCEPT
-A FORWARD -j DROP

# Table nat

# chain POSTROUTING
-A POSTROUTING -o eth2 -j SNAT --to-source 192.168.122.100

About the routing table:

# default VRF
default via 192.168.100.1 dev eth0 proto static metric 20 
192.168.100.0/24 dev eth0 proto kernel scope link src 192.168.100.90

# firewall VRF
default via 192.168.122.1 dev eth2 proto static metric 20
10.254.5.0/24 dev eth1 proto kernel scope link src 10.254.5.2
192.168.100.0/24 proto bgp metric 20 nexthop via 10.254.5.10 dev eth1 weight 1 
192.168.122.0/24 dev eth2 proto kernel scope link src 192.168.122.100 

So, when a packet from the default VRF tries to access internet, it goes out of eth0, is routed by the L3 switch, enter the firewall VRF by eth1 and is routed and NATed through eth2. Since I track the INPUT and FORWARD connections, conntrack is a bit confused when the packet comes back, and it's unable to know what to do with the packet.

I was able to fix this for ICMP and UDP by using conntrack zone in the raw table

# Table raw
# chain PREROUTING
-A PREROUTING -i eth0 -j CT --zone 5
# chain OUTPUT
-A OUTPUT -o eth0 -j CT --zone 5

With these rules, packet that originate from the router and go through eth0 are tag zone 5 and when packets enter eth0 they are also tagged zone 5.

With a ping to 8.8.8.8, it looks like this (with the command conntrack -E):

    [NEW] icmp     1 30 src=192.168.100.90 dst=8.8.8.8 type=8 code=0 id=1999 [UNREPLIED] src=8.8.8.8 dst=192.168.100.90 type=0 code=0 id=1999 zone=5
    [NEW] icmp     1 30 src=192.168.100.90 dst=8.8.8.8 type=8 code=0 id=1999 [UNREPLIED] src=8.8.8.8 dst=192.168.122.100 type=0 code=0 id=1999
 [UPDATE] icmp     1 30 src=192.168.100.90 dst=8.8.8.8 type=8 code=0 id=1999 src=8.8.8.8 dst=192.168.122.100 type=0 code=0 id=1999
 [UPDATE] icmp     1 30 src=192.168.100.90 dst=8.8.8.8 type=8 code=0 id=1999 src=8.8.8.8 dst=192.168.100.90 type=0 code=0 id=1999 zone=5

We can see here the first NEW connection is created when the packet goes through eth0 with the zone=5 tag, then a new when it enters the firewall VRF through eth1 without the tag. When the answer comes, the second connection is updated first (since it's the one facing internet) and then the first.

This also work with UDP, for example with a DNS query to 8.8.8.8

    [NEW] udp      17 30 src=192.168.100.90 dst=8.8.8.8 sport=53369 dport=53 [UNREPLIED] src=8.8.8.8 dst=192.168.100.90 sport=53 dport=53369 zone=5
    [NEW] udp      17 30 src=192.168.100.90 dst=8.8.8.8 sport=53369 dport=53 [UNREPLIED] src=8.8.8.8 dst=192.168.122.100 sport=53 dport=53369
 [UPDATE] udp      17 30 src=192.168.100.90 dst=8.8.8.8 sport=53369 dport=53 src=8.8.8.8 dst=192.168.122.100 sport=53 dport=53369
 [UPDATE] udp      17 30 src=192.168.100.90 dst=8.8.8.8 sport=53369 dport=53 src=8.8.8.8 dst=192.168.100.90 sport=53 dport=53369 zone=5

But with TCP it doesn't work. A telnet query to 172.16.10.10 port 80 looks like this:

    [NEW] tcp      6 120 SYN_SENT src=192.168.100.90 dst=172.16.10.10 sport=60234 dport=80 [UNREPLIED] src=172.16.10.10 dst=192.168.100.90 sport=80 dport=60234 zone=5
    [NEW] tcp      6 120 SYN_SENT src=192.168.100.90 dst=172.16.10.10 sport=60234 dport=80 [UNREPLIED] src=172.16.10.10 dst=192.168.122.100 sport=80 dport=60234
 [UPDATE] tcp      6 58 SYN_RECV src=192.168.100.90 dst=172.16.10.10 sport=60234 dport=80 src=172.16.10.10 dst=192.168.122.100 sport=80 dport=60234
 [UPDATE] tcp      6 57 SYN_RECV src=192.168.100.90 dst=172.16.10.10 sport=60234 dport=80 src=172.16.10.10 dst=192.168.122.100 sport=80 dport=60234
(The last line repeat multiple times)

If I tcpdump eth2 the answer it there:

IP 192.168.122.100.60236 > 172.16.10.10.80: Flags [S], seq 4203590660, win 62720, options [mss 1460,sackOK,TS val 1511828881 ecr 0,nop,wscale 7], length 0
IP 172.16.10.10.80 > 192.168.122.100.60236: Flags [S.], seq 3672808466, ack 4203590661, win 65535, options [mss 1430,sackOK,TS val 2474659117 ecr 1511828881,nop,wscale 8], length 0
IP 192.168.122.100.60236 > 172.16.10.10.80: Flags [S], seq 4203590660, win 62720, options [mss 1460,sackOK,TS val 1511829887 ecr 0,nop,wscale 7], length 0
IP 172.16.10.10.80 > 192.168.122.100.60236: Flags [S.], seq 3672808466, ack 4203590661, win 65535, options [mss 1430,sackOK,TS val 2474660123 ecr 1511828881,nop,wscale 8], length 0

But since the SIN ACK is never acknowledge, the router continue to send new SIN.

Now, if I tcpdump eth1:

IP 192.168.100.90.60238 > 172.16.10.10.80: Flags [S], seq 3124513394, win 62720, options [mss 1460,sackOK,TS val 1511928806 ecr 0,nop,wscale 7], length 0
IP 192.168.100.90.60238 > 172.16.10.10.80: Flags [S], seq 3124513394, win 62720, options [mss 1460,sackOK,TS val 1511929823 ecr 0,nop,wscale 7], length 0
IP 192.168.100.90.60238 > 172.16.10.10.80: Flags [S], seq 3124513394, win 62720, options [mss 1460,sackOK,TS val 1511931839 ecr 0,nop,wscale 7], length 0

We can see that the answer is never routed back to 192.168.100.90.

If I disabled the connection tracking and allow everything in iptables, it works. So I think that conntrack has troubles to managed TCP connections from itself to another zone when they are NAT ? If something isn't clear, I will gladly answer any questions about this.

Score:1
us flag

The issue was present on debian 10 with a kernel 4.19.0-12-amd64, but after an upgrade to debian 11 with a kernel 5.10.0-11-amd64, it works as expected, even for TCP flows.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.