I'm trying to benchmark an all-NVMe ZFS disk array. I'm familiar with extremely fast benchmark results, with very little disk activity due to efficient ZFS caching. I ran into the reverse: massive amount of ZFS disk activity, but FIO shows only little bandwidth. How do I explain this behaviour?
Zpool iostat -v 1
only shows a single second, but the output is consistent every second: 16 to 20 GiB/s write bandwidth
capacity operations bandwidth
pool alloc free read write read write
------------ ----- ----- ----- ----- ----- -----
my_pool-10 1.92T 19.0T 2 155K 95.7K 16.9G
mirror-0 329G 3.16T 0 26.2K 3.99K 2.84G
nvme2n1 - - 0 13.2K 0 1.42G
nvme3n1 - - 0 13.1K 3.99K 1.42G
mirror-1 328G 3.16T 0 26.0K 0 2.83G
nvme4n1 - - 0 13.0K 0 1.41G
nvme5n1 - - 0 13.0K 0 1.41G
mirror-2 329G 3.16T 0 25.8K 0 2.83G
nvme6n1 - - 0 12.9K 0 1.42G
nvme7n1 - - 0 12.8K 0 1.42G
mirror-3 328G 3.16T 0 25.8K 3.99K 2.81G
nvme8n1 - - 0 12.9K 3.99K 1.40G
nvme9n1 - - 0 12.9K 0 1.40G
mirror-4 328G 3.16T 0 26.0K 87.7K 2.82G
nvme10n1 - - 0 13.0K 87.7K 1.41G
nvme11n1 - - 0 13.0K 0 1.41G
mirror-5 327G 3.16T 0 25.6K 0 2.82G
nvme12n1 - - 0 12.9K 0 1.41G
nvme13n1 - - 0 12.8K 0 1.41G
------------ ----- ----- ----- ----- ----- -----
As far as I understand, the write-stats in this table are doubled. The 16.9G at the top for writes counts writes on all devices. As these are mirrored vdevs, only half of these will be 'data', the other half is redundancy. Obviously, there still is some overhead. In that sense, I would expect my bandwidth to drop from 16.9GB/s to 8.45GB/s to about 6.75GB/s (20% overhead)
My FIO benchmark on the other hand, shows about 240MiB/s. Either I don't understand the output, or something is gobbling up my bandwidth.
I run 25 parallel jobs of 4k writes (testing an annoying network load). The command is:
sync; fio --name=asyncio --ioengine=posixaio --rw=randwrite --bs=4k --fsync=1 --size=100g --numjobs=25 --iodepth=64 --runtime=60 --time_based --filename=/my_pool-10/unbuffwrite2 --group_reporting
asyncio: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
...
fio-3.28
Starting 25 processes
asyncio: Laying out IO file (1 file / 102400MiB)
Jobs: 25 (f=25): [w(25)][100.0%][w=240MiB/s][w=61.4k IOPS][eta 00m:00s]
asyncio: (groupid=0, jobs=25): err= 0: pid=33672: Sat Aug 26 20:20:18 2023
write: IOPS=61.2k, BW=239MiB/s (251MB/s)(14.0GiB/60015msec); 0 zone resets
slat (nsec): min=90, max=5052.1k, avg=810.57, stdev=5386.98
clat (usec): min=52, max=47390, avg=9496.96, stdev=6324.85
lat (usec): min=53, max=47392, avg=9497.77, stdev=6324.94
clat percentiles (usec):
| 1.00th=[ 2442], 5.00th=[ 2835], 10.00th=[ 3195], 20.00th=[ 3785],
| 30.00th=[ 4555], 40.00th=[ 5866], 50.00th=[ 7635], 60.00th=[ 9765],
| 70.00th=[12256], 80.00th=[15139], 90.00th=[18744], 95.00th=[21890],
| 99.00th=[27395], 99.50th=[29754], 99.90th=[34341], 99.95th=[36439],
| 99.99th=[40633]
bw ( KiB/s): min=211040, max=272704, per=100.00%, avg=244851.53, stdev=462.61, samples=2975
iops : min=52760, max=68176, avg=61212.84, stdev=115.66, samples=2975
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.06%, 4=23.36%, 10=37.74%, 20=31.19%, 50=7.64%
fsync/fdatasync/sync_file_range:
sync (usec): min=267, max=53554, avg=16564.62, stdev=7006.70
sync percentiles (usec):
| 1.00th=[ 3589], 5.00th=[ 5145], 10.00th=[ 6849], 20.00th=[ 9896],
| 30.00th=[12518], 40.00th=[14877], 50.00th=[16909], 60.00th=[18482],
| 70.00th=[20317], 80.00th=[22414], 90.00th=[25297], 95.00th=[27919],
| 99.00th=[33817], 99.50th=[35914], 99.90th=[40109], 99.95th=[42206],
| 99.99th=[44827]
cpu : usr=0.99%, sys=0.26%, ctx=916290, majf=0, minf=702924
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.9%, 16=11.8%, 32=161.4%, >=64=25.1%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=95.3%, 8=1.4%, 16=1.6%, 32=1.1%, 64=0.6%, >=64=0.0%
issued rwts: total=0,3672021,0,3683331 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=239MiB/s (251MB/s), 239MiB/s-239MiB/s (251MB/s-251MB/s), io=14.0GiB (15.0GB), run=60015-60015msec
Any insights on where my bandwidth went?