My setup description:
storage system for vm storage: supermicro server, 1-socket Xeon E5-2620 v4, 82599ES-intel-based RA-Link 10GE network adapter, 8x pci-e 2.0, plugged in 8x pci-e 3.0 slot, debian 11
compute node for vm hosting: supermicro server, 2-socket Xeon Silver 4216, same adapter, pci-e configuration and OS as above.
BIOS set to performance mode, as cpu govenors in OS, all vulnerability mitigations are off.
I omit storage configuration, because situation is the same, if sharing ram-based disks between servers. So final testing configuration is 30GB ram-disk on both servers, testing is being done with fio.
Problem: locally on both servers i'm getting millions of iops, cause it's on ram disk, via 10G-network on 4k blosk size i'm stuck with 60-80k on write and 80-100k on read iops, both random. NFS/ISCSI-based - no matter, as sync/async mode. I googled maybe all actual kernel tweaks for 10G, ethtool playing, it gives no significant changes. Linux-based network stack or openvswitch-based - same results. RX/TX pps shows ~40k/50k on 4k block size fio. Driver is default kernel ixgbe.
If i raise tests with 16-32-64k block size or make reads/writes sequential, it fully saturates single 10G-link, about 1050-1100 mbytes/s vs 320-380 mbytes/s on 4k block size. CPU load during testing is ~40% or less.
Fio: fio --randrepeat=1 --size=30G --name=fiotest --filename=testfio2 --numjobs=32 --stonewall --ioengine=libaio --direct=1 --bs=4k --iodepth=128 --rw=randread
Results are same on host systems and inside VMs.
Current sysctl tweaks:
- net.core.rmem_max = 67108864
- net.core.wmem_max = 67108864
- net.core.netdev_max_backlog = 30000
- net.ipv4.tcp_rmem = 4096 87380 33554432
- net.ipv4.tcp_wmem = 4096 65536 33554432
- net.ipv4.tcp_congestion_control=htcp
- net.ipv4.tcp_mtu_probing=1
- net.core.default_qdisc = fq
- net.ipv4.tcp_slow_start_after_idle=0
- net.ipv4.tcp_sack=0
- net.ipv4.tcp_low_latency=1
- net.ipv4.tcp_timestamps=0
- net.ipv4.tcp_no_metrics_save=1
I'm not mentioning jumbo frames and other obvious stuff (and yes, 1500 or 9000 mtu gives same results).
Tested both through 10G switch and directly between servers. Storage setup itself can give about 300k iops on 4k random blocks (raid10 on 4 sas HDDs + lvm cache on raid10 4 sas ssds, when warmed up).
Right now i'm out of ideas. If i understand correctly, 10G network can give roughly 250-300k iops equivalent, but i have only 60-100k of random writes/reads on 4k blocks.
What am i missing?