TL;DR
ip route add 192.168.117.64/26 dev onboard-10Gb-1 table 1000
ip route add default via 192.168.117.65 dev onboard-10Gb-1 table 1000
ip rule add from 192.168.117.70 lookup 1000
For more details, and for proper handling of UDP services, read below.
Note: on Linux, route
and ifconfig
are obsolete commands. They are not suitable for advanced routing such as policy routing. One should systematically use iproute2 replacements instead: ip route
, ip link
and ip address
(and all other related commands from the iproute2 suite).
Policy routing
The server being multi-homed will by default reply directly to the attached LAN 192.168.200.0/24 (examples below will use an hypothetical system at 192.168.200.101) when queried from there rather than following the path the query came from. This is an asymmetric flow, which can fail for various reasons, among them:
- the server itself when configured with
rp_filter=1
will drop such asymmetric traffic following Strict Reverse Path Forwarding rules.
- A firewall in the path not seeing replies might be configured to drop traffic (e.g. when tracking the TCP window)
- if NAT happens somewhere: the direct reply isn't un-NAT-ed and is dropped by the client
- the server when hosting an UDP service that is not multi-homed aware will choose the wrong reply address, making the client drop such reply.
Even if without the presence of the 3 first cases TCP would probably work, UDP is even more difficult with the last case and policy routing alone is often not always enough for UDP (see Caveat below), but still required.
This requires policy routing so that each address is considered separately when doing a routing decision to reply back.
On Linux this is implemented by using routing rules with selectors to use alternate routing tables that will know only the needed path for the selected goal: only a partial copy of all the possible routes. The selector chosen is usually a criteria depending on something else than the destination (which is already provided with standard route entries). Most of the time it's the source address but it depends on the goal.
Here the goal is to have a routing table that doesn't know specifically about 192.168.200.0/24 so it gets routed using the default route over onboard-10Gb-1
instead of the LAN route on bond0
when a reply is made from 192.168.117.70.
Duplicate only the needed routes in routing table 1000 (value 1000 chosen arbitrarily):
ip route add 192.168.117.64/26 dev onboard-10Gb-1 table 1000
ip route add default via 192.168.117.65 dev onboard-10Gb-1 table 1000
When the source address is from 192.168.117.70, meaning it's the server's address, look up the alternate routing table 1000 (before looking up the main routing table: if the lookup succeeds in giving a route, the main table won't be used):
ip rule add from 192.168.117.70 lookup 1000
An equivalent table and rule for the other LAN could be added, but it's already handled by the main routing table. Incoming traffic is already handled first by the local routing table, so there's nothing more to do with this setup:
# ip rule
0: from all lookup local
32765: from 192.168.117.70 lookup 1000
32766: from all lookup main
32767: from all lookup default
Then on server, depending on the way services are used:
TCP service binding or not binding to a specific address always works
Any accept(2)
-ed socket has its source (local) address set to the destination address the query used, so emitted packets will match the routing rule selector when needed: nothing more to do for this case.
TCP client or UDP client can bind the source address when doing a query/connection, to change the path:
TCP and UDP examples:
ssh -b 192.168.117.70 [email protected]
traceroute -n -s 192.168.117.70 192.168.200.101
The intended alternate routing table 1000 will be selected according to the source address chosen:
# ip route get from 192.168.117.70 to 192.168.200.101
192.168.200.101 from 192.168.117.70 via 192.168.117.65 dev onboard-10Gb-1 table 1000 uid 0
cache
Caveat: UDP service
The way UDP and the BSD socket API works, by default an UDP socket that doesn't bind to an address (i.e. uses 0.0.0.0 aka INADDR_ANY), when it's used to reply to an UDP message that was received, doesn't have all the context of the query, in particular it won't have the local address of the server the query was sent to, contrary to TCP: it will just use the socket's 0.0.0.0 address.
So when replying to a query from 192.168.200.101 to 192.168.117.70, it will present to the routing stack a source of 0.0.0.0 to defer to the routing stack the selection of the actual source address. This won't match the routing rule selector for 192.168.117.70 in place, and the reply will use the main routing table, choosing the wrong source reply address: 192.168.200.45. When the client receives such reply (directly from the same LAN), it won't recognize it as a reply to its query: it's from an other address, and will ignore it.
There are two ways to have the UDP server application handle this correctly:
bind(2)
to a specific address.
Any reply using this socket will use the address it was bound to. Thus selecting the routing rule and the intended routing behavior. If the server has to provide service on all of its addresses, it should bind the same way multiple times: once per address.
There are settings in most daemons to do just that. For example by default, ISC's DNS server bind 9 binds to all addresses belonging to the server and follows dynamic changes. When a query arrives to such socket, the address of the bound socket is the address of the query's destination. It will be reused as source address, thus selecting the correct routing rule.
else use the socket option IP_PKTINFO
This enables the reception of ancillary data by the application, allowing it to know on what address and on what interface the packet was received and gives all information for a correct reply. This requires specific application support including use of additional functions such as cmsg(3)
.
For example that's the mode of operation of NLnet Labs's DNS server unbound's when using the server option interface-automatic: yes
.
If the UDP server application can't be changed at all, there are only bad choices left.
Using Netfilter's conntrack and iptables won't work: one could change the destination in the output hook, but it's the source that has to be changed. One could change the source in the postrouting hook, but as the name implies, it's after routing: too late for the alternate route to be chosen. Even if it was allowed, as it's about NAT-ing a reply to a now existing flow, Netfilter wouldn't cope correctly with it and would change the source port used for reply to avoid a supposed-only clash instead of reusing the same flow.
In such case one can use additional selectors that will choose an alternate source selection depending on the service and the destination (which in this case is the reply), to force the other choice: a source of 192.168.117.70 instead of 192.168.200.45 (and now a direct query to from 192.168.200.101 to 192.168.200.45 would fail for similar reasons instead).
For example if server were to host a simple UDP service on port 5555 that can't be configured to bind to 192.168.117.70 or use IP_PKTINFO
and that should never be used directly LAN-to-LAN on 192.168.200.0/24, one can nudge the correct route selection with (this requires kernel >= 4.17):
ip rule add from 0.0.0.0/32 ipproto udp sport 5555 to 192.168.200.0/24 lookup 1000
Here 0.0.0.0/32 is used in its INADDR_ANY role. The routing stack will replace it with an adequate source in the end, but this time chosen from using routing table 1000.
Before:
# ip route get ipproto udp sport 5555 to 192.168.200.101
192.168.200.101 dev bond0 src 192.168.200.45 uid 0
cache
After:
# ip route get ipproto udp sport 5555 to 192.168.200.101
192.168.200.101 via 192.168.117.65 dev onboard-10Gb-1 table 1000 src 192.168.117.70 uid 0
cache
Won't affect other cases (eg: source port 5556):
# ip route get ipproto udp sport 5556 to 192.168.200.101
192.168.200.101 dev bond0 src 192.168.200.45 uid 0
cache
nftables
Actually, an inferior solution also exists as an alternative to the 2nd ip rule
added just above, using nftables instead to do static NAT wich doesn't depend on Netfilter's conntrack. Here's the ruleset to load with nft -f ...
:
table t_statelessnat
delete table t_statelessnat
table ip t_statelessnat {
chain c_snat {
type route hook output priority raw; policy accept;
ip daddr 192.168.200.0/24 ip saddr != 192.168.117.70 udp sport 5555 ip saddr set 192.168.117.70
}
}
The type route hook will take care of rerouting the packet which will now traverse routing table 1000.
Better use the routing rule instead unless a complex filter that can't be used with ip rule
is required.