Score:2

Server

Does the ZFS scrub support parallelization for increased performance, e.g., with a 64-core AMD Threadripper Pro?

docBrian

12/19/23, 12:07 AM

I have a 24 drive zpool comprised of 3 RAIDZ1 vdevs running 8 Seagate Exos X18 16TB drives per vdev. This is on a Supermicro MB with a 64-Core (128 thread) AMD Threadripper Pro and 256GB ECC RAM.

System utilization during scrubs shows at most 2 CPUs utilized at a time, and total scrub time looks like it could take five to seven days.

Is there a way to have all CPU cores working in parallel on the scrub to speed it up?

205

2 + 3

zfs

performance-tuning

smp

zpool

parallel-computing

Andrew Henle

12/19/23, 12:31 PM

*3 RAIDZ1 vdevs running 8 Seagate Exos X18 16TB drives per vdev* and *scrub time looks like it could take five to seven days* I sure hope you have actual backups and you're not expecting this ZFS pool to save your data. You have 8-drive RAIDZ pools with relatively slow 16TB drives that are limited to about 70 IOPs/sec. With the very long rebuild times that results (see those looooong scrub times...), when you lose a drive there's a huge window for a second drive failure to effectively wipe out all the data on the vdev.

ewwhite

12/19/23, 2:47 PM

Why are you concerned with scrub performance? I think you're starting with the misconception that scrubs need to run this often. In my experience, this is a 3-month or 6-month or even yearly thing.

freezed

12/19/23, 10:48 PM

Delay announced at scrub start could be over estimated, did you already run a full scrub?

Score:0

Server

docBrian

12/23/23, 8:55 PM

It appears that work is proceeding on parallelization of disk read/write operations for ZFS, but the work is not ready for testing.

Parameters and a bit of math to guide the responses:

Capacity per drive: 16,000,000,000,000 Bytes (not 16TB).

Sustained read/write: 270MB/s (258 MiB/s).

Mean time between failure: 285 years.

Nonrecoverable sector read errors per bit read: 1 bit error per 116,415 TB of data read.

Random Read 4K QD16 QCD: 170 IOPS.

Random Write 4K QD16 QCD: 550 IOPS.

Each 8-drive RAIDZ1 vdev is connected to an 8-channel PCIe 3.0x HBA that supports 512MB/sec sustained throughput per attached drive.

The HBA is attached to a PCI4.0 x16 slot on a 128 lane motherboard.

Running in parallel, the system supports a complete read of all 24 16TB drives in 22 hours.

My expectation is that the scrub should complete in less than 24 hours; therefore, the bottleneck is the CPU utilization for checksum verification. Given the availability of 5 computational threads/drive (this is a 128 thread/24 drive system), parallelization of checksums should solve the bottleneck problem.

Per reliability:

Stochastic theory predicts that drive failure is unlikely, given the manufacturer's MTBF of 285 years and assuming a confidence interval of six standard deviations. Nonetheless, I have 4 drives committed to error correction and disaster recovery.

Bit rot (nonrecoverable sector read errors per bit read) is a separate concern, which is why I am concerned about scrub operations. The expected error rate is 1 bit error per 116,415 TB of data read. That suggests one bit read error every 14 years IFF continuous reads at full throughput of 270MB/s are maintained 24x7 for 14 years.

This machine is part of a hot-failover 1024-node, 1 petabyte cluster.

+ 1

ewwhite

12/23/23, 9:11 PM

Did you consider dRAID?

Score:0

Server

John Mahowald

12/19/23, 2:29 PM

Very likely CPU is not the limiting factor for performance. 7200 RPM spindles are about 60 to 70 random IOPS. Even 24 disks does not leave a lot of spare performance for a lower priority integrity check.

Plan for the current performance of maybe one scrub per week. If your recovery point objective is from a nightly backup, the restore source will not have been fully scrubbed. Some snapshot perhaps. Which may be acceptable to you.

Consider making backups align to scrubs. If you were to take a full backup every week, and start a scrub at that point, it might finish before the next week's full. Giving extra assurance of the array's, and by proxy the backup's, integrity. However this is not a lot of time to have a backup with a good file system integrity check around. Consider keeping multiple full backups convenient. How useful many day old archives are to your restore objectives is up to you, but at least the associated scrub should be completed.

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Does the ZFS scrub support parallelization for increased performance, e.g., with a 64-core AMD Threadripper Pro?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.