Score:0

Slow read speed in ZFS mirror (slower than write speed, and very slow for small chunk sizes)

jp flag

I have a server running debian on top of a ZFS 3-way mirror of Exos X18 18TB (ST18000NM001J).

I'm benchmarking it and I'm finding some surprises for the read rate under certain conditions.

But first, for the benchmarking I created a benchmarking dataset (rpool/benchmarking) with primary and secondary cache set to none to avoid benchmarking the chache when reading, and also compression set to off to avoid inflated rates when writting arrays of 0's. Then I have created 3 subdatasets, named "8k", "128k" and "1M"; each one with its corresponding recordsize.

Then with the following dd script:

echo -e "bs=4M recordsize=1M\n"
dd if=/dev/zero of=/benchmarking/1M/ddfile bs=4M count=2000 conv=fdatasync
dd if=/benchmarking/1M/ddfile of=/dev/null bs=4M count=2000 #conv=fdatasync
rm /benchmarking/1M/ddfile
echo -e "------------------\n\n"

echo -e "bs=4k recordsize=1M\n"
dd if=/dev/zero of=/benchmarking/1M/ddfile bs=4k count=2000000 conv=fdatasync
dd if=/benchmarking/1M/ddfile of=/dev/null bs=4k count=2000000 #conv=fdatasync
rm /benchmarking/1M/ddfile
echo -e "------------------\n\n"

echo -e "bs=4M recordsize=128k\n"
dd if=/dev/zero of=/benchmarking/128k/ddfile bs=4M count=2000 conv=fdatasync
dd if=/benchmarking/128k/ddfile of=/dev/null bs=4M count=2000 #conv=fdatasync
rm /benchmarking/128k/ddfile
echo -e "------------------\n\n"

echo -e "bs=4k recordsize=128k\n"
dd if=/dev/zero of=/benchmarking/128k/ddfile bs=4k count=2000000 conv=fdatasync
dd if=/benchmarking/128k/ddfile of=/dev/null bs=4k count=2000000 #conv=fdatasync
rm /benchmarking/128k/ddfile
echo -e "------------------\n\n"

echo -e "bs=4M recordsize=8k\n"
dd if=/dev/zero of=/benchmarking/8k/ddfile bs=4M count=2000 conv=fdatasync
dd if=/benchmarking/8k/ddfile of=/dev/null bs=4M count=2000 #conv=fdatasync
rm /benchmarking/8k/ddfile
echo -e "------------------\n\n"

echo -e "bs=4k recordsize=8k\n"
dd if=/dev/zero of=/benchmarking/8k/ddfile bs=4k count=2000000 conv=fdatasync
dd if=/benchmarking/8k/ddfile of=/dev/null bs=4k count=2000000 #conv=fdatasync
rm /benchmarking/8k/ddfile
echo -e "------------------\n\n"

I got the following:

root@pbs:/benchmarking# ./dd_bench.sh

bs=4M recordsize=1M



2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 43.3219 s, 194 MB/s

2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 43.7647 s, 192 MB/s

------------------





bs=4k recordsize=1M



2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 38.7432 s, 211 MB/s

2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 5100.27 s, 1.6 MB/s

------------------





bs=4M recordsize=128k



2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 60.1265 s, 140 MB/s

2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 56.4249 s, 149 MB/s

------------------





bs=4k recordsize=128k



2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 52.044 s, 157 MB/s

2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 1242.29 s, 6.6 MB/s

------------------





bs=4M recordsize=8k



2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 111.594 s, 75.2 MB/s

2000+0 records in

2000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 60.547 s, 139 MB/s

------------------





bs=4k recordsize=8k



2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 96.3637 s, 85.0 MB/s

2000000+0 records in

2000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 771.967 s, 10.6 MB/s

When the bloscksize it's small (4kb) the read speed it's very limited (between 1-10 MB/S). It's not happening the same for the write speed.

Then I have run bonnie++ for all three datasets:

root@pbs:~# bonnie++ -d /benchmarking/1M/ -u root -n 160
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
pbs          63624M  527k  93  136m   6 56.1m   4    0k   3 2902k   3 168.4  21
Latency             12952us   27977us    3500ms   21656ms     599ms     990ms
Version  2.00       ------Sequential Create------ --------Random Create--------
pbs                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                160 163840  13 +++++ +++ 163840   6 163840  15 +++++ +++ 163840   6
Latency               277ms    2447us     353ms     287ms      27us     377ms
1.98,2.00,pbs,1,1665552058,63624M,,8192,5,527,93,139633,6,57398,4,0,3,2902,3,168.4,21,160,,,,,9606,13,+++++,+++,1264,6,9808,15,+++++,+++,1147,6,12952us,27977us,3500ms,21656ms,599ms,990ms,277ms,2447us,353ms,287ms,27us,377ms




root@pbs:~# bonnie++ -d /benchmarking/128k/ -u root -n 160

Using uid:0, gid:0.

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

Version  2.00       ------Sequential Output------ --Sequential Input- --Random-

                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--

Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

pbs          63624M  525k  93  126m   6 44.1m   6    1k   7 10.3m   7 311.3  41

Latency             13067us   17678us    2688ms    6693ms     206ms     390ms

Version  2.00       ------Sequential Create------ --------Random Create--------

pbs                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--

              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

                160 163840  13 +++++ +++ 163840   6 163840  14 +++++ +++ 163840   6

Latency               284ms    2643us     328ms     266ms      21us     356ms

1.98,2.00,pbs,1,1665335428,63624M,,8192,5,525,93,128601,6,45110,6,1,7,10548,7,311.3,41,160,,,,,8118,13,+++++,+++,1248,6,9634,14,+++++,+++,1173,6,13067us,17678us,2688ms,6693ms,206ms,390ms,284ms,2643us,328ms,266ms,21us,356ms




root@pbs:~# bonnie++ -d /benchmarking/8k/ -u root -n 160

Using uid:0, gid:0.

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

Version  2.00       ------Sequential Output------ --Sequential Input- --Random-

                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--

Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

pbs          63624M  528k  97 80.2m   6 54.4m   8    1k   4 15.1m   5 264.7  37

Latency             14231us     982us    1535ms    5087ms     342ms     284ms

Version  2.00       ------Sequential Create------ --------Random Create--------

pbs                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--

              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

                160 163840  13 +++++ +++ 163840   6 163840  14 +++++ +++ 163840   6

Latency               334ms     100us     325ms     311ms      27us     353ms

1.98,2.00,pbs,1,1668749456,63624M,,8192,5,528,97,82088,6,55756,8,1,4,15510,5,264.7,37,160,,,,,9254,13,+++++,+++,1276,6,9582,14,+++++,+++,1066,6,14231us,982us,1535ms,5087ms,342ms,284ms,334ms,100us,325ms,311ms,27us,353ms

And it's returning very low read rates too, as dd. (3, 10 & 15MB/S)

As a final step I have run another dd bench this time aligning dd bs with zfs recordsize:

root@pbs:~# /benchmarking/dd_bench_2.sh

bs=1M recordsize=1M



8000+0 records in

8000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 62.6119 s, 134 MB/s

8000+0 records in

8000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 65.2772 s, 129 MB/s

------------------





bs=128k recordsize=128k



64000+0 records in

64000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 64.6437 s, 130 MB/s

64000+0 records in

64000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 49.128 s, 171 MB/s

------------------





bs=8k recordsize=8k



1000000+0 records in

1000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 108.331 s, 75.6 MB/s

1000000+0 records in

1000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 344.981 s, 23.7 MB/s

------------------

Now there it's an important improvement, but still, I expected a bigger read speed for 8k.

Then I set atime to off and repeated this last test but nothing changed too much.(1M dataset was already atime=off all the time, sorry for that).

root@pbs:~# /benchmarking/dd_bench_2.sh

bs=1M recordsize=1M



8000+0 records in

8000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 44.505 s, 188 MB/s

8000+0 records in

8000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 40.3689 s, 208 MB/s

------------------





bs=128k recordsize=128k



64000+0 records in

64000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 67.7169 s, 124 MB/s

64000+0 records in

64000+0 records out

8388608000 bytes (8.4 GB, 7.8 GiB) copied, 56.0657 s, 150 MB/s

------------------





bs=8k recordsize=8k



1000000+0 records in

1000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 103.724 s, 79.0 MB/s

1000000+0 records in

1000000+0 records out

8192000000 bytes (8.2 GB, 7.6 GiB) copied, 343.753 s, 23.8 MB/s

So, trying to summarize:

  • Why I have such slow read rates for bonnie++ and small bs dd?
  • Read speed it's almost always equal or lower than write speed. How it could be like that in a 3-way mirror?? Where system can read from three devices at once but has to write 3x the data.


As extra info, the server it's running on enterprise grade disks but consumer grade (not low end, but consumer grade) mother board and the disks are connected to the motherboard sata controller. I know, they are low end sata controllers, but still, it's strange to have low read rates sometimes while write speeds are always nice.

Also, I have checked the drives are not SMR and I have repeated the tests done here in a similar server with similar hardware/setup obtaining similar results.

Finally, I attach the zfs get all from one of the benchmarking datasets:

root@pbs:~# zfs get all rpool/benchmarking/128k
NAME                     PROPERTY              VALUE                  SOURCE
rpool/benchmarking/128k  type                  filesystem             -
rpool/benchmarking/128k  creation              Wed Oct 26  8:45 2022  -
rpool/benchmarking/128k  used                  96K                    -
rpool/benchmarking/128k  available             12.5T                  -
rpool/benchmarking/128k  referenced            96K                    -
rpool/benchmarking/128k  compressratio         1.00x                  -
rpool/benchmarking/128k  mounted               yes                    -
rpool/benchmarking/128k  quota                 none                   default
rpool/benchmarking/128k  reservation           none                   default
rpool/benchmarking/128k  recordsize            128K                   default
rpool/benchmarking/128k  mountpoint            /benchmarking/128k     inherited from rpool/benchmarking
rpool/benchmarking/128k  sharenfs              off                    default
rpool/benchmarking/128k  checksum              on                     default
rpool/benchmarking/128k  compression           off                    inherited from rpool/benchmarking
rpool/benchmarking/128k  atime                 off                    local
rpool/benchmarking/128k  devices               on                     default
rpool/benchmarking/128k  exec                  on                     default
rpool/benchmarking/128k  setuid                on                     default
rpool/benchmarking/128k  readonly              off                    default
rpool/benchmarking/128k  zoned                 off                    default
rpool/benchmarking/128k  snapdir               hidden                 default
rpool/benchmarking/128k  aclmode               discard                default
rpool/benchmarking/128k  aclinherit            restricted             default
rpool/benchmarking/128k  createtxg             255400                 -
rpool/benchmarking/128k  canmount              on                     default
rpool/benchmarking/128k  xattr                 on                     default
rpool/benchmarking/128k  copies                1                      default
rpool/benchmarking/128k  version               5                      -
rpool/benchmarking/128k  utf8only              off                    -
rpool/benchmarking/128k  normalization         none                   -
rpool/benchmarking/128k  casesensitivity       sensitive              -
rpool/benchmarking/128k  vscan                 off                    default
rpool/benchmarking/128k  nbmand                off                    default
rpool/benchmarking/128k  sharesmb              off                    default
rpool/benchmarking/128k  refquota              none                   default
rpool/benchmarking/128k  refreservation        none                   default
rpool/benchmarking/128k  guid                  13557460337392366562   -
rpool/benchmarking/128k  primarycache          none                   inherited from rpool/benchmarking
rpool/benchmarking/128k  secondarycache        none                   inherited from rpool/benchmarking
rpool/benchmarking/128k  usedbysnapshots       0B                     -
rpool/benchmarking/128k  usedbydataset         96K                    -
rpool/benchmarking/128k  usedbychildren        0B                     -
rpool/benchmarking/128k  usedbyrefreservation  0B                     -
rpool/benchmarking/128k  logbias               latency                default
rpool/benchmarking/128k  objsetid              60174                  -
rpool/benchmarking/128k  dedup                 off                    default
rpool/benchmarking/128k  mlslabel              none                   default
rpool/benchmarking/128k  sync                  standard               inherited from rpool
rpool/benchmarking/128k  dnodesize             legacy                 default
rpool/benchmarking/128k  refcompressratio      1.00x                  -
rpool/benchmarking/128k  written               96K                    -
rpool/benchmarking/128k  logicalused           42K                    -
rpool/benchmarking/128k  logicalreferenced     42K                    -
rpool/benchmarking/128k  volmode               default                default
rpool/benchmarking/128k  filesystem_limit      none                   default
rpool/benchmarking/128k  snapshot_limit        none                   default
rpool/benchmarking/128k  filesystem_count      none                   default
rpool/benchmarking/128k  snapshot_count        none                   default
rpool/benchmarking/128k  snapdev               hidden                 default
rpool/benchmarking/128k  acltype               off                    default
rpool/benchmarking/128k  context               none                   default
rpool/benchmarking/128k  fscontext             none                   default
rpool/benchmarking/128k  defcontext            none                   default
rpool/benchmarking/128k  rootcontext           none                   default
rpool/benchmarking/128k  relatime              on                     inherited from rpool
rpool/benchmarking/128k  redundant_metadata    all                    default
rpool/benchmarking/128k  overlay               on                     default
rpool/benchmarking/128k  encryption            off                    default
rpool/benchmarking/128k  keylocation           none                   default
rpool/benchmarking/128k  keyformat             none                   default
rpool/benchmarking/128k  pbkdf2iters           0                      default
rpool/benchmarking/128k  special_small_blocks  0                      default

Thanks for your time!

EDIT: ashift is properly set to 12, dedup it's off and fragmentation it's 0%.

user189695 avatar
tn flag
Could you also post the smart data? But i think it's 4k is misaligned. Just a hunch. They ship in 512e by default. You need to do something with FastFormat, but i'm not an expert.
Héctor avatar
jp flag
Can't right now. Will try monday. What do you mean with it's misaligned. 4k disks need ashift 12.
user189695 avatar
tn flag
The drive most likely uses 512e for adressing, it does by default. Misalignment is very easy to do. If you dont put your partitions on a 4k boundry ýou get this behaviour for example. The easiest way to solve this is to use 4k native and not 512 emulated, this rules out any misalignment. Use FastFormat to change the drive. More info here about aligning : https://flashdba.com/4k-sector-size/ and more info on how to convert your drive (data on the disk will be LOST) : https://www.reddit.com/r/SynologyForum/comments/j57nwf/how_to_guide_for_format_a_sas_hdd_to_4kn/
Héctor avatar
jp flag
Interesting. I'm doing some extra benchmarking and then, I will think about reformating drives into 4kn. If I do, I will come back with new benchmarks. Do you have info about if it's worth it to reformat 512e into 4kn? I have found contradictory info.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.