I am experiencing what seems like uncharacteristically low performance from an NVMe SSD stripe in a server. The hardware is as follows:
- Motherboard: X9DR3-F
- CPU: Dual E5-2650v2
- RAM: 128GB DDR3-1333 UDIMM (16x8GB)
- NVMe drives: 4x MZVLB256HBHQ-000L7 via PCIe expander with bifurcated lanes
lspci -nvv
shows a 8GT/s x4 link for a device, showing it operating at PCIe 3.0 like the drive wants: LnkSta: Speed 8GT/s, Width x4
. Benchmarks for this drive show it capable of operating around 1.4GB/s writes.
When I try sequential writes to the drive, I get about a third of that performance. The following showed the 619MB/s during writes, then paused for another 50 seconds, presumably while the data was fully flushed to disk.
$ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=16M count=1k status=progress
16726884352 bytes (17 GB, 16 GiB) copied, 27 s, 619 MB/s
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 71.8953 s, 239 MB/s
Assuming this was just some quirk of my synthetic benchmark vs someone else's synthetic benchmark, I put all 4 devices into an MD RAID-0 and tried again:
$ sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 --force --run /dev/nvme?n1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ sudo dd if=/dev/zero of=/dev/md0 bs=16M count=2k status=progress
34191966208 bytes (34 GB, 32 GiB) copied, 57 s, 600 MB/s
2048+0 records in
2048+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 79.7502 s, 431 MB/s
Better, but much to be desired. If my public-school education math is to be believed, these drives are transferring somewhere between 430x10 and 600x10 megabits per second, so a best-case scenario of 6gbit. In ideal conditions, I would expect 4x drives in simple all-0's stripe-writes to hit 6GByte, based on synthetic benchmarks from others. Assuming this was just some limitation of the system, I tested the unrelated 40gbps ethernet card against a different host:
$ iperf -c 10.x.x.x -P 4
---------------------------------------------
Client connecting to 10.x.x.x, TCP port 5001
TCP window size: 325 KByte (default)
---------------------------------------------
[ 2] local 10.x.x.x port 53750 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/196)
[ 1] local 10.x.x.x port 53754 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/132)
[ 3] local 10.x.x.x port 53738 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/212)
[ 4] local 10.x.x.x port 53756 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/107)
[ ID] Interval Transfer Bandwidth
[ 2] 0.0000-10.0027 sec 12.4 GBytes 10.6 Gbits/sec
[ 1] 0.0000-10.0180 sec 12.7 GBytes 10.9 Gbits/sec
[ 3] 0.0000-10.0179 sec 10.6 GBytes 9.05 Gbits/sec
[ 4] 0.0000-10.0180 sec 10.5 GBytes 8.97 Gbits/sec
[SUM] 0.0000-10.0011 sec 46.1 GBytes 39.6 Gbits/sec
While this network card has nothing to do with SSD performance, to me it does show that the system is capable of saturating at least a 40gbit link via PCIe, especially since that card is only an x8 link instead of 4x4. One thing that may be of note is that the ethernet card is on CPU1_SLOT1, and the SSDs are on CPU2_SLOT4. I'm not sure if that would account for the enormous difference in performance though, since SLOT4 is hanging directly off CPU2, and SLOT1 is hanging directly off CPU1. There is a dual 8GT/s QPI link between the CPUs and no additional switches:
To me, it's worth noting that read performance is correspondingly low as well. There is no filesystem overhead here, this is just raw flash & PCIe performance in effect. This is about the read performance of 4x consumer SATA HDD's in RAID-5 on lesser hardware, so just absolutely unacceptably slow:
$ sudo dd if=/dev/md0 of=/dev/null bs=16M count=8k
8192+0 records in
8192+0 records out
137438953472 bytes (137 GB, 128 GiB) copied, 214.738 s, 640 MB/s
Checking top
during this read operation showed dd
consuming 100% CPU, 97% of it in system wait. The other 31 threads were more or less idle. Where can I start diagnosing the performance issues experienced here?
Assuming this was just an issue with DD, I tried again with fio. I keep the MD device, formatted it XFS allowing it to choose default settings, mounted it, and ran the tests outlined at https://cloud.google.com/compute/docs/disks/benchmarking-pd-performance :
Sequential write
Run status group 0 (all jobs):
WRITE: bw=1348MiB/s (1414MB/s), 1348MiB/s-1348MiB/s (1414MB/s-1414MB/s), io=80.8GiB (86.7GB), run=61368-61368msec
Disk stats (read/write):
md0: ios=0/710145, merge=0/0, ticks=0/397607236, in_queue=397607236, util=99.82%, aggrios=0/177558, aggrmerge=0/2, aggrticks=0/99452549, aggrin_queue=99465067, aggrutil=99.62%
nvme0n1: ios=0/177568, merge=0/5, ticks=0/56627328, in_queue=56635784, util=97.96%
nvme3n1: ios=0/177536, merge=0/1, ticks=0/145315089, in_queue=145331709, util=99.62%
nvme2n1: ios=0/177559, merge=0/3, ticks=0/151148103, in_queue=151165889, util=99.44%
nvme1n1: ios=0/177569, merge=0/0, ticks=0/44719677, in_queue=44726889, util=97.87%
Random write
Run status group 0 (all jobs):
WRITE: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=6074MiB (6370MB), run=60003-60003msec
Disk stats (read/write):
md0: ios=0/1604751, merge=0/0, ticks=0/623304, in_queue=623304, util=100.00%, aggrios=0/401191, aggrmerge=0/2, aggrticks=0/153667, aggrin_queue=153687, aggrutil=99.99%
nvme0n1: ios=0/402231, merge=0/3, ticks=0/156754, in_queue=156775, util=99.98%
nvme3n1: ios=0/401144, merge=0/2, ticks=0/149648, in_queue=149667, util=99.98%
nvme2n1: ios=0/400158, merge=0/0, ticks=0/150380, in_queue=150400, util=99.98%
nvme1n1: ios=0/401233, merge=0/4, ticks=0/157887, in_queue=157908, util=99.99%
Sequential read
Run status group 0 (all jobs):
READ: bw=6244MiB/s (6547MB/s), 6244MiB/s-6244MiB/s (6547MB/s-6547MB/s), io=367GiB (394GB), run=60234-60234msec
Disk stats (read/write):
md0: ios=3089473/14, merge=0/0, ticks=272954324/220, in_queue=272954544, util=99.98%, aggrios=779529/3, aggrmerge=6/1, aggrticks=68744470/104, aggrin_queue=68744621, aggrutil=99.60%
nvme0n1: ios=779520/6, merge=12/2, ticks=24023533/1, in_queue=24023534, util=98.84%
nvme3n1: ios=779519/2, merge=14/0, ticks=145571896/378, in_queue=145572449, util=99.60%
nvme2n1: ios=779536/3, merge=0/1, ticks=77038488/3, in_queue=77038492, util=98.90%
nvme1n1: ios=779544/3, merge=0/1, ticks=28343963/34, in_queue=28344012, util=98.81%
Random read
Run status group 0 (all jobs):
READ: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s), io=21.8GiB (23.4GB), run=60002-60002msec
Disk stats (read/write):
md0: ios=5902401/10, merge=0/0, ticks=2684388/0, in_queue=2684388, util=100.00%, aggrios=1475009/3, aggrmerge=608/0, aggrticks=685706/0, aggrin_queue=685706, aggrutil=99.90%
nvme0n1: ios=1475288/4, merge=632/1, ticks=697246/0, in_queue=697246, util=99.89%
nvme3n1: ios=1475328/2, merge=611/0, ticks=678849/1, in_queue=678850, util=99.89%
nvme2n1: ios=1474625/3, merge=588/1, ticks=673908/0, in_queue=673909, util=99.90%
nvme1n1: ios=1474795/3, merge=602/0, ticks=692822/1, in_queue=692822, util=99.90%
These are much faster, showing there is an advantage to multiple threads beating on the array, but again other benchmarks online showed these drives doing 1GB/s writes individually (whereas I am peaking at 1.4GB/s for all 4 combined), and I've seen UserBenchmark results putting the reads at 2.2GB/s per drive, so 6GB/s reads is doing pretty well in context.
Is there anything to be done to improve single-process performance then?