Score:2

How to ensure throughput from 10GbE network device on ubuntu 20.04 under heavy load

jp flag

I'm having trouble ensuring a required network throughput on a server connected to a Signal Hound spectrum analyzer via a 10GbE network interface. Basically, I can get good throughput when only the radio capture process is running, but when I run other processes, the throughput starts to drop. I'm using an Aquantia PCIe ethernet adapter with a QNAP SFP+ 10GbE Thunderbolt 3 adapter.

When I'm running a simple python program to poll from the spectrum analyzer API in streaming mode, it all works great at the maximum bandwidth (~800MB/s). When I do

$ stress --cpu 8 --io 8 --vm 8 --hdd 8

side-by-side, it lowers to about 600MB/s and I start dropping a lot of data.

Things I've tried:

  1. Updating drivers
  2. Messing with the coalescing parameters and many ethtool options (MTU, etc)
  3. Turning off hyperthreading and isolating the process to a single core (8 of 8) via cpu affinity pinning
    • This also involved isolating the networking interrupts to their own core (7 of 8)
    • I also change the core governor to be "performance" so it's always at maximum freq
    • I also tried turning off most of the other interrupts for cores 7 and 8 to prevent them from slowing down, verified by a netdata dashboard
    • I basically tried everything in here

Essentially, I know that it can run in real-time because it works fine when it's by itself confined to 2 cores. But for some reason, even though the other cores don't interfere with the CPU cycles or network IRQs, when cores 1-6 are at heavy load, they slow the main process down greatly.

If it helps, I find that the --vm 4 option for stress causes the most slowdown, so I suspect that it has something to do with memory allocation and perhaps the DRAM interface to the network card.

I'm basically pulling my hair out trying to get every packet from the radio on a (what should be very powerful) Ubuntu 20.04 machine. Does anyone have any experience with applications like this?

EDIT: I copied some of the performance curves here:

Here is the effect I'm seeing

So here's the utilization. Core 6 is at 100% with softirqs both during the high stress period and the "just capturing" period. I've tried splitting the network data onto two cores (5 and 6), but one of them always stays loaded while the other one seems clear, even if they have similar amounts of interrupts. CPU load

The actual number of softirqs unfortunately drops on CPU 6 during the period when the stress test is running. Soft IRQ Number

Here is the effect I'm seeing on CPU6 softnet. CPU6 Softnet

Also, the interrupts seem to stay relatively the same, though they get a little less consistent during the high stress period. Interrupts

Here's the straight network speed, and it looks a little inconsistent as well in both periods. Network info

I was looking pretty closely for anomalies (though there are a lot of plots in netstat), and it looks like there is no interprocess memory during the high stress period. Could this lead to issues? enter image description here

If anyone needs more plots, let me know. I can't deduce the issue from these, but I hope it's enough information to come up with potential solutions.

Thanks again!

jp flag
Brendan Gregg is waiting for you. Start with his web page https://www.brendangregg.com/, start collecting system performance metrics, look for bottlenecks.
Eric avatar
jp flag
Thanks Alex for the suggestions! I've edited the original post with more performance curves so that hopefully someone smarter than me can help me figure out what's going on.
Score:0
jp flag

Ok all, I think I've figured out an answer to my problem. I think the key graph here was the "softirq" graph. Under normal operation, I don't think it should be that high.

I had a little d'oh moment while profiling: basically, since I'm running CUDA and a bunch of other fiddly-install libraries, I was running all of this in a docker container (I know what you're all saying!). Since I didn't mess with the network stuff for the radio in docker, I kind of didn't think about it. And yep, you guessed it, the docker networking added enough processing to push me over the edge into dropping packets. I ended up setting the network_mode to host to use the host networking, and it solved my problem. Hopefully this can be helpful to someone else!

But that's not all--to figure this out, I spent a good bit of time profiling to figure out exactly why I was seeing the effect I was seeing (Thanks to @AlexD for the resources). Here's a flame graph of the pinned CPU 7 that was running the API drivers: enter image description here

As you can see, it spends a lot of time in page fault memory allocation (which should have been another clue, though I didn't post it here. Minor memory faults were through the roof during capture). That explains why running stress with --vm 4 gave the worst results--It was causing a contention for the memory which slowed down the driver significantly. Also, after testing it a little, I think it needs more than one core anyway (it was dropping packets pinned to core 7 exclusively, but worked pinned to 6 and 7). I was getting better results after overclocking (but still not perfect) and that explains why.

So there you have it: an explanation for why it was all happening the way it was, with graphs to back it up. I have about 60% utilization on two cores for the radio API, and it's pretty stable in getting all the packets (another core handles the softirqs at about 10%, down from 95% that you see in the graph above). I feel a little dumb for not thinking of docker slowing me down, but much better having figured this all out. Hopefully this post helps someone else!

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.