Score:8

Unexpected poor performance with NVMe drives on an X9DR3-F

pg flag

I am experiencing what seems like uncharacteristically low performance from an NVMe SSD stripe in a server. The hardware is as follows:

  • Motherboard: X9DR3-F
  • CPU: Dual E5-2650v2
  • RAM: 128GB DDR3-1333 UDIMM (16x8GB)
  • NVMe drives: 4x MZVLB256HBHQ-000L7 via PCIe expander with bifurcated lanes

lspci -nvv shows a 8GT/s x4 link for a device, showing it operating at PCIe 3.0 like the drive wants: LnkSta: Speed 8GT/s, Width x4. Benchmarks for this drive show it capable of operating around 1.4GB/s writes.

When I try sequential writes to the drive, I get about a third of that performance. The following showed the 619MB/s during writes, then paused for another 50 seconds, presumably while the data was fully flushed to disk.

$ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=16M count=1k status=progress
16726884352 bytes (17 GB, 16 GiB) copied, 27 s, 619 MB/s
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 71.8953 s, 239 MB/s

Assuming this was just some quirk of my synthetic benchmark vs someone else's synthetic benchmark, I put all 4 devices into an MD RAID-0 and tried again:

$ sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 --force --run /dev/nvme?n1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ sudo dd if=/dev/zero of=/dev/md0 bs=16M count=2k status=progress
34191966208 bytes (34 GB, 32 GiB) copied, 57 s, 600 MB/s
2048+0 records in
2048+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 79.7502 s, 431 MB/s

Better, but much to be desired. If my public-school education math is to be believed, these drives are transferring somewhere between 430x10 and 600x10 megabits per second, so a best-case scenario of 6gbit. In ideal conditions, I would expect 4x drives in simple all-0's stripe-writes to hit 6GByte, based on synthetic benchmarks from others. Assuming this was just some limitation of the system, I tested the unrelated 40gbps ethernet card against a different host:

$ iperf -c 10.x.x.x -P 4
---------------------------------------------
Client connecting to 10.x.x.x, TCP port 5001
TCP window size:  325 KByte (default)
---------------------------------------------
[  2] local 10.x.x.x port 53750 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/196)
[  1] local 10.x.x.x port 53754 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/132)
[  3] local 10.x.x.x port 53738 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/212)
[  4] local 10.x.x.x port 53756 connected with 10.x.x.x port 5001 (icwnd/mss/irtt=87/8948/107)
[ ID] Interval       Transfer     Bandwidth
[  2] 0.0000-10.0027 sec  12.4 GBytes  10.6 Gbits/sec
[  1] 0.0000-10.0180 sec  12.7 GBytes  10.9 Gbits/sec
[  3] 0.0000-10.0179 sec  10.6 GBytes  9.05 Gbits/sec
[  4] 0.0000-10.0180 sec  10.5 GBytes  8.97 Gbits/sec
[SUM] 0.0000-10.0011 sec  46.1 GBytes  39.6 Gbits/sec

While this network card has nothing to do with SSD performance, to me it does show that the system is capable of saturating at least a 40gbit link via PCIe, especially since that card is only an x8 link instead of 4x4. One thing that may be of note is that the ethernet card is on CPU1_SLOT1, and the SSDs are on CPU2_SLOT4. I'm not sure if that would account for the enormous difference in performance though, since SLOT4 is hanging directly off CPU2, and SLOT1 is hanging directly off CPU1. There is a dual 8GT/s QPI link between the CPUs and no additional switches: X9DR3-F block diagram

To me, it's worth noting that read performance is correspondingly low as well. There is no filesystem overhead here, this is just raw flash & PCIe performance in effect. This is about the read performance of 4x consumer SATA HDD's in RAID-5 on lesser hardware, so just absolutely unacceptably slow:

$ sudo dd if=/dev/md0 of=/dev/null bs=16M count=8k
8192+0 records in
8192+0 records out
137438953472 bytes (137 GB, 128 GiB) copied, 214.738 s, 640 MB/s

Checking top during this read operation showed dd consuming 100% CPU, 97% of it in system wait. The other 31 threads were more or less idle. Where can I start diagnosing the performance issues experienced here?


Assuming this was just an issue with DD, I tried again with fio. I keep the MD device, formatted it XFS allowing it to choose default settings, mounted it, and ran the tests outlined at https://cloud.google.com/compute/docs/disks/benchmarking-pd-performance :

Sequential write

Run status group 0 (all jobs):
  WRITE: bw=1348MiB/s (1414MB/s), 1348MiB/s-1348MiB/s (1414MB/s-1414MB/s), io=80.8GiB (86.7GB), run=61368-61368msec

Disk stats (read/write):
    md0: ios=0/710145, merge=0/0, ticks=0/397607236, in_queue=397607236, util=99.82%, aggrios=0/177558, aggrmerge=0/2, aggrticks=0/99452549, aggrin_queue=99465067, aggrutil=99.62%
  nvme0n1: ios=0/177568, merge=0/5, ticks=0/56627328, in_queue=56635784, util=97.96%
  nvme3n1: ios=0/177536, merge=0/1, ticks=0/145315089, in_queue=145331709, util=99.62%
  nvme2n1: ios=0/177559, merge=0/3, ticks=0/151148103, in_queue=151165889, util=99.44%
  nvme1n1: ios=0/177569, merge=0/0, ticks=0/44719677, in_queue=44726889, util=97.87%

Random write

Run status group 0 (all jobs):
  WRITE: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=6074MiB (6370MB), run=60003-60003msec

Disk stats (read/write):
    md0: ios=0/1604751, merge=0/0, ticks=0/623304, in_queue=623304, util=100.00%, aggrios=0/401191, aggrmerge=0/2, aggrticks=0/153667, aggrin_queue=153687, aggrutil=99.99%
  nvme0n1: ios=0/402231, merge=0/3, ticks=0/156754, in_queue=156775, util=99.98%
  nvme3n1: ios=0/401144, merge=0/2, ticks=0/149648, in_queue=149667, util=99.98%
  nvme2n1: ios=0/400158, merge=0/0, ticks=0/150380, in_queue=150400, util=99.98%
  nvme1n1: ios=0/401233, merge=0/4, ticks=0/157887, in_queue=157908, util=99.99%

Sequential read

Run status group 0 (all jobs):
   READ: bw=6244MiB/s (6547MB/s), 6244MiB/s-6244MiB/s (6547MB/s-6547MB/s), io=367GiB (394GB), run=60234-60234msec

Disk stats (read/write):
    md0: ios=3089473/14, merge=0/0, ticks=272954324/220, in_queue=272954544, util=99.98%, aggrios=779529/3, aggrmerge=6/1, aggrticks=68744470/104, aggrin_queue=68744621, aggrutil=99.60%
  nvme0n1: ios=779520/6, merge=12/2, ticks=24023533/1, in_queue=24023534, util=98.84%
  nvme3n1: ios=779519/2, merge=14/0, ticks=145571896/378, in_queue=145572449, util=99.60%
  nvme2n1: ios=779536/3, merge=0/1, ticks=77038488/3, in_queue=77038492, util=98.90%
  nvme1n1: ios=779544/3, merge=0/1, ticks=28343963/34, in_queue=28344012, util=98.81%

Random read

Run status group 0 (all jobs):
   READ: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s), io=21.8GiB (23.4GB), run=60002-60002msec

Disk stats (read/write):
    md0: ios=5902401/10, merge=0/0, ticks=2684388/0, in_queue=2684388, util=100.00%, aggrios=1475009/3, aggrmerge=608/0, aggrticks=685706/0, aggrin_queue=685706, aggrutil=99.90%
  nvme0n1: ios=1475288/4, merge=632/1, ticks=697246/0, in_queue=697246, util=99.89%
  nvme3n1: ios=1475328/2, merge=611/0, ticks=678849/1, in_queue=678850, util=99.89%
  nvme2n1: ios=1474625/3, merge=588/1, ticks=673908/0, in_queue=673909, util=99.90%
  nvme1n1: ios=1474795/3, merge=602/0, ticks=692822/1, in_queue=692822, util=99.90%

These are much faster, showing there is an advantage to multiple threads beating on the array, but again other benchmarks online showed these drives doing 1GB/s writes individually (whereas I am peaking at 1.4GB/s for all 4 combined), and I've seen UserBenchmark results putting the reads at 2.2GB/s per drive, so 6GB/s reads is doing pretty well in context.

Is there anything to be done to improve single-process performance then?

Score:7
ca flag

Samsung MZVLB256HBHQ-000L7 are small SSDs (256 GB), so you are going to hit the internal NAND bandwidth bottleneck for any writes spanning multiple GB. You can trim them (losing all data currently stored on the drives) to clean the internal pSLC cache, hitting more bandwidth for the first benchmarks, but you are going to quickly saturate it again.

Bryan Boettcher avatar
pg flag
So the solution is: "new drives"?
shodanshok avatar
ca flag
If you really need to max out performance - yes, new drives are required. If you can live with the current drives performance - continue using them. And remember that trimming them means *losing all data* already stored on the drives themselves.
cc flag
What does "trimming" exactly means in this context? It sounds like taking something off, cutting?
in flag
@OlivierDulac: Modern SSD commands support a `TRIM` command. This tells the SSD which blocks are in use. In this specific case, the intended use it to tell the SSD that _no_ block is in use. This logically restores the SSD to factory state (but not physically - SSD's have a finite number of write cycles, and `TRIM` does not reset that).
cc flag
oh ok. Could it be "hiding" defectuous blocks (mark them as valid again) ? or does it only mark as valid and empty all blocks that are not known as defectous?
Bryan Boettcher avatar
pg flag
@MarkSowul there is no RAID controller, just a card that routes lanes from a PCIe slot to the individual NVMe drives. Using this card requires support in the BIOS to split a single x16 slot into 4x4 electrical slots sharing a single x16 physical. The RAID is all in the host OS.
Mark Sowul avatar
gw flag
Ah, yes - I see that now, I'll just delete that comment
Score:1
in flag

My experience with Samsung MZVL* drives is abysmal. See https://superuser.com/questions/1721288/ssd-performance-falls-off-a-cliff

I'm trying to find reputable specs on your drive but my main guess is that the drives are missing DRAM.

Bryan Boettcher avatar
pg flag
Alright, time to try enterprise drives in my fakey-enterprise situation I guess.
MonkeyZeus avatar
in flag
@BryanBoettcher That's 100% up to you. You are running an Ivy Bridge Xeon after-all... If write-endurance and data loss aren't too big of a concern then go with good consumer-grade option like Samsung 980 Evo Pro. Per the spec-sheet, four 980 Evo Pros will exceed what's capable from a PCIe 3.0 x16 slot.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.