We have a Debian server with one link to the internal VLAN, and one to the external - both connect directly to the same switch.
On both links, we're intermittently seeing an unusually high amount of bad receive events, as well as high latency.
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
ethA 1500 0 884347583 0 49965509 49965509 1697514631 0 0 0 BMRU
ethB 1500 0 1611102819 0 77615811 77615811 819321274 0 0 0 BMRU
We're also seeing the ksoftirqd processes for both interfaces max out at 90+% most of the time, even when things should ostensibly be quiet.
44 root 20 0 0 0 0 R 98.8 0.0 2557:46 ksoftirqd/3
51 root 20 0 0 0 0 R 85.6 0.0 2722:33 ksoftirqd/5
As I understand it, this means the server is maxing out the assigned CPU to process all the packets coming in on this interface. But even when we're seeing ~50Mbps coming in, (and identical servers are handling >800Mbps) these processes can max out and RX-DRPs skyrocket. irqbalance is running, and /proc/interrupts confirms that these CPUs aren't busy handling really anything else.
Are there any clear potential causes for this?
Would assigning multiple CPUs (via smp_affinity) to handle the interrupts for those interfaces potentially help? I'm unable to find a single example of someone assigning multiple CPUs to a single network interface, so it seems unconventional at best, system-break-worthy at worst.
This has been causing issues in production for a while, so I'd happily accept any potential workaround.