Score:0

How to improve TCP tolerance to out-of-order delivery in Linux balance-rr bonds and/or FreeBSD roundrobin laggs?

cn flag

I have 3 servers networ kwise configured as follows

  • A is a DELL R710 is running Linux 5.13.19-1-pve Proxmox VE 7.1 and has 4 NICs teamed in a balance-rr mode bond.
  • B is a DELL R610 is running Linux 5.13.19-1-pve Proxmox VE 7.1 and has 4 NICs teamed in a balance-rr mode bond.
  • C is a DELL R710 running FreeBSD 12.2-RELEASE-p1 with a lagg over 8 NICs in roundrobin (this is a TrueNAS distro).

All NICs are 1 GBps.

When I run iperf3 between the Linux blades, I max at about 3 GBps, and the window goes up to an average of ~300 KiB. However, between the TrueNAS (FreeBSD) blade and the Linux blades, the TCP stream maxes at 1.20 Gbps and caps the window at ~60 KiB average. If I run parallel streams (i.e., iperf3 ... -P 8) I can saturate the bond. On the other hand, as expected, the retransmit count is pretty high in both cases. So, my questions are,

  1. Why is FreeBSD not reaching the same throughput if supposedly both are approaching the problem in the same way? (maybe that's where I am wrong).
  2. Is there a tuning option or combination of options to make the TCP stack more tolerant to out-of-order without triggering immediate retransmits? (I am vaguely familiar with the 3-ACK reTX, basics of TCP congestion control, and so on).

I will include here some tunables and options I have used during my testing.

  • All ifaces are set to use jumbo frames (MTU 9000).
  • The Linux boxes are tuned as follows
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_mem = 1638400 1638400 1638400
net.ipv4.tcp_rmem = 10240 87380 16777216
net.ipv4.tcp_rmem = 10240 87380 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_reordering = 127
net.ipv4.tcp_max_reordering = 1000
net.core.netdev_max_backlog = 10000
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_congestion_control = reno
  • The FreeBSD (TrueNAS Core ~= FreeNAS) box is tuned as follows
kern.ipc.maxsockbuf=614400000
kern.ipc.somaxconn=1024
net.route.netisr_maxqlen=8192
net.inet.ip.intr_queue_maxlen=8192
net.inet.tcp.mssdflt=8948
net.inet.tcp.reass.maxqueuelen=1000
net.inet.tcp.recvbuf_inc=65536
net.inet.tcp.sendbuf_inc=65536
net.inet.tcp.sendbuf_max=307200000
net.inet.tcp.recvbuf_max=307200000
net.inet.tcp.recvspace=65228
net.inet.tcp.sendspace=65228
net.inet.tcp.minmss=536
net.inet.tcp.abc_l_var=52
net.inet.tcp.initcwnd_segments=52 # start fast
net.inet.udp.recvspace=1048576
net.inet.udp.sendspace=1048576
Effie avatar
ne flag
regarding retransmits: there is a [rfc4015](https://datatracker.ietf.org/doc/html/rfc4015) which basically raises 3-ACK to something larger on reordering. Last time I worked with linux kernel (around v4.0.2) it was implemented. Also, `net.ipv4.tcp_reordering = 127` means that the 3-ACK is actually 127-ACK.
Effie avatar
ne flag
my usual test strategy would be: tcp segmentation offload (dunno how it works with jumbo frames), then checking that flow control is not the limiting factor (buffers are large enough). On Linux I do it with tcp_probe and check that congestion window looks like reno and not squares, dunno if there is a similar tool for freebsd, then congestion control (your linux is set to reno, so it should at least be not worth than freebsd). probably does not help, but just in case.
Score:1
us flag

You could try using jumbo frames if your network supports it. It doesn't remove the main problem of triggering TCP out-of-order retransmissions. However, since the ethernet frames are six times bigger, the number of packets decreases which decreases the number of out-of-order events.

Otherwise you should check your use case, do you really need a single TCP connection to get the whole throughput? If there are multiple active TCP connections between the devices, then you should use TCP aware load balancing.

dacabdi avatar
cn flag
Jumbo frames do help a lot, by far it is the most positively impacting tunable, and I see why. On the other hand, I am trying to maximize throughput for one party, so, in a way, yes, I need the TCP stream to max out. I will be moving few huge files with very little concurrent work.
us flag
There is also Multipath TCP, which would be suited for your use case. However, it is still early days for it, so you might not find stable implementation for your environment. https://en.wikipedia.org/wiki/Multipath_TCP has more information.
dacabdi avatar
cn flag
That is pretty cool, I was not aware of MPTCP. I'll do some reading, even if there is not widespread adoption yet, it is something I can experiment with in some lab bench.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.