Score:1

Tuning ZFS for bursty sequential writes

pf flag

This is a follow-up to: High speed network writes with large capacity storage. The setup has changed notably.

I have a pool with a single raid-z2 with 6 drives, all Exos X18 CMR drives. Using fio and manual tests I know that the array can sustain around 800 MB/s sequential writes on average this is fine and in-line with the expected performance of this array. The machine is an Ryzen5 Pro 2400 GE (4C/8T, 3.8 GHz boost) with 32G ECC RAM, NVMe boot/system drive and 2x10Gbps ethernet ports (Intel x550-T2). I'm running an up-to-date Arch system with zfs 2.1.2-1.

My use case is a video archive of mostly large (~30G) write once, read once, compressed video. I've disabled atime, set recordsize=1M, set compressios=off and dedup=off as the data is actually incompressible and testing showed worse performance with compression=lz4 than off despite what the internet said and there is no duplicate data by design. This pool is shared over the network via Samba. I've tuned my network and Samba to the point where transferring from NVMe NTFS on a Windows machine to NVMe ext4 reaches 1GB/s, i.e reasonably close to saturating the 10 Gbps link with 9K Jumbo Frames.

Here's where I run into problems. I want to be able to transfer one whole 30G video archive at 1GB/s to the raid-z2 array that can only support 800 MB/s sequential write. My plan is to use the RAM based dirty pages to absorb the spillover and let it flush to disk after the transfer is "completed" on the client side. I figured that all I would need is (1024-800)*30~=7G of dirty pages in RAM that can get flushed out to disk over ~10 seconds after the transfer completes. I understand the data integrity implications of this and the risk is acceptable as I can always transfer the file again later for up to a month in case a power loss causes the file to be lost or incomplete.

However I cannot get ZFS to behave in the way I expect... I've edited my /etc/modprobe.d/zfs.conf file like so:

options zfs zfs_dirty_data_max_max=25769803776
options zfs zfs_dirty_data_max_max_percent=50
options zfs zfs_dirty_data_max=25769803776
options zfs zfs_dirty_data_max_percent=50
options zfs zfs_delay_min_dirty_percent=80

I have ran the appropriate mkinitcpio -P command to refresh my initramfs and confirmed that the settings were applied after a reboot:

# arc_summary | grep dirty_data
        zfs_dirty_data_max                                   25769803776
        zfs_dirty_data_max_max                               25769803776
        zfs_dirty_data_max_max_percent                                50
        zfs_dirty_data_max_percent                                    50
        zfs_dirty_data_sync_percent                                   20

I.e. I set the max dirty pages to 24G which is waay more than the 7G that I need, and hold of to start delaying writes until 80% of this is used. As far as I understand, the pool should be able to absorb 19G into RAM before it starts to push back on writes from the client (Samba) with latency.

However what I observe writing from the Windows client is that after around 16 seconds at ~1 GB/s write speed the write performance falls off a cliff (iostat still shows the disks working hard to flush the data) which I can only assume is the pushback mechanism for the write throttling of ZFS. However this makes no sense as at the very least even if nothing was flushed out during the 16 seconds it should have set in 3 seconds later. In addition it falls off once again at the end, see picture: [enter image description here][https://i.stack.imgur.com/Yd9WH.png]

I've tried adjusting the zfs_dirty_data_sync_percent to start writing earlier because the dirty page buffer is so much larger than the default and I've alse tried adjusting the active io scaling with zfs_vdev_async_write_active_{min,max}_dirty_percent to kick in earlier as well to get the writes up to speed faster with the large dirty buffer. Both of these just moved the position of the cliff slightly but no where near what I expected.

Questions:

  1. Have I missunderstood how the write throttling delay works?
  2. Is what I'm trying to do possible?
  3. If so, what am I doing wrong?

Yes, I know, I'm literally chasing a couple of seconds and will never recoup the effort spent in achieving this. That's ok, it's personal between me and ZFS at this point, and a matter of prinicple ;)

ewwhite avatar
ng flag
What is your zfs_txg_timeout?
pf flag
I haven't changed it so whatever is the default I think? Can't check right now.
Score:1
ar flag

You need to also increase zfs_txg_timeout parameter from its current default of 5 seconds to something along the lines of 7G/0.2G/s = 35s so setting to 40s should be sufficient.

In your/etc/modprobe.d/zfs.conf:

options zfs zfs_txg_timeout=40

Note that the ARC is exactly that, a "read" cache with zero engagement on write cache so ensure that your ARC is not set up to consume the extra 7G+ of data that your block write cache must absorb per 30GB write stream. Write cache for ZFS is like any other simple block write cache (like the commit parameter for ext4 filesystems) so be sure to test in non-production to ensure no starvation of RAM during all transfer scenarios.

Monstieur avatar
cn flag
This is the answer, in addition to the values you already changed. I just tuned my NAS for 450 GiB RAM + SLOG burst writes over a 100 GbE network.
Score:0
cv flag

Every write will update the ARC if zfs primarycache = all (default). If read latency is unimportant for the data you are currently writing, I suggest setting zfs primarycache=meta.

Score:-1
ng flag

You don't currently have enough RAM or storage resources for what you're seeking.

Design around your desired I/O throughput levels and their worst-case performance.

If you need 1GB/s throughput under all conditions for the working set of data being described, then ensure the disk spindle count or interface throughput is capable of supporting this.

djdomi avatar
za flag
do we speak about 1gbit or 10gbit link?
pf flag
I don't need it under "all conditions" I need it in one very specific condition, a single 30GB burst.
pf flag
How is 32G RAM not enough to buffer 7G? The system RAM pressure is very low, less than 6G used most time so there's around 26G free. My NIC and Samba can do 1 GB/s as stated in OP. Can you explain why the dirty page buffer cannot be used in this way with this amount of memory? Because too me, it should be...
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.