Score:2

How to increase speed of RAID 5 with mdadm + luks + lvm

cn flag

I think I am kinda lost with my current server setup. It is an HP Proliant dl160 gen 6, and I put 4 spinning disks with a setup that has mdmadm + luks + lvm and on top of it btrfs (maybe I went too far?) and it is really suffering on IO speed it reads around 50MB/s and writes around 2MB/s and I have a feeling that I messed up something.

One of the things I noted is that I set up mdadm on the block device (sbd) and not on the partitions (sdb1), would that affect something?

Here you can see the output of fio fio --name=randwrite --rw=randwrite --direct=1 --bs=16k --numjobs=128 --size=200M --runtime=60 --group_reporting when there is almost no use on the machine.

randwrite: (groupid=0, jobs=128): err= 0: pid=54290: Tue Oct 26 16:21:50 2021
  write: IOPS=137, BW=2193KiB/s (2246kB/s)(131MiB/61080msec); 0 zone resets
    clat (msec): min=180, max=2784, avg=924.48, stdev=318.02
     lat (msec): min=180, max=2784, avg=924.48, stdev=318.02
    clat percentiles (msec):
     |  1.00th=[  405],  5.00th=[  542], 10.00th=[  600], 20.00th=[  693],
     | 30.00th=[  760], 40.00th=[  818], 50.00th=[  860], 60.00th=[  927],
     | 70.00th=[ 1011], 80.00th=[ 1133], 90.00th=[ 1267], 95.00th=[ 1452],
     | 99.00th=[ 2165], 99.50th=[ 2232], 99.90th=[ 2635], 99.95th=[ 2769],
     | 99.99th=[ 2769]
   bw (  KiB/s): min= 3972, max= 4735, per=100.00%, avg=4097.79, stdev= 1.58, samples=8224
   iops        : min=  132, max=  295, avg=248.40, stdev= 0.26, samples=8224
  lat (msec)   : 250=0.04%, 500=2.82%, 750=25.96%, 1000=40.58%, 2000=28.67%
  lat (msec)   : >=2000=1.95%
  cpu          : usr=0.00%, sys=0.01%, ctx=18166, majf=0, minf=1412
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8372,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2193KiB/s (2246kB/s), 2193KiB/s-2193KiB/s (2246kB/s-2246kB/s), io=131MiB (137MB), run=61080-61080msec

Update 1 sequencial writes with dd

root@hp-proliant-dl160-g6-1:~# dd if=/dev/zero of=disk-test oflag=direct bs=512k count=100
100+0 records in 100+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 5.81511 s, 9.0 MB/s

Kernel: 5.4.0-89-generic

OS: Ubuntu 20.04.3

mdadm: 4.1-5ubuntu1.2

lvm2: 2.03.07-1ubuntu1

blkid output

/dev/mapper/dm_crypt-0: UUID="r7TBdk-1GZ4-zbUh-007u-BfuP-dtis-bTllYi" TYPE="LVM2_member"
/dev/sda2: UUID="64528d97-f05c-4f34-a238-f7b844b3bb58" UUID_SUB="263ae70e-d2b8-4dfe-bc6b-bbc2251a9f32" TYPE="btrfs" PARTUUID="494be592-3dad-4600-b954-e2912e410b8b"
/dev/sdb: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="4aeb4804-6380-5421-6aea-d090e6aea8a0" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdc: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="9d5a4ddd-bb9e-bb40-9b21-90f4151a5875" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdd: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="f08b5e6d-f971-c622-cd37-50af8ff4b308" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sde: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="362025d4-a4d2-8727-6853-e503c540c4f7" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/md0: UUID="a5b5bf95-1ff1-47f9-b3f6-059356e3af41" TYPE="crypto_LUKS"
/dev/mapper/vg0-lv--0: UUID="6db4e233-5d97-46d2-ac11-1ce6c72f5352" TYPE="swap"
/dev/mapper/vg0-lv--1: UUID="4e1a5131-cb91-48c4-8266-5b165d9f5071" UUID_SUB="e5fc407e-57c2-43eb-9b66-b00207ea6d91" TYPE="btrfs"
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/loop8: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/sda1: PARTUUID="fa30c3f5-6952-45f0-b844-9bfb46fa0224"

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb[0] sdc[1] sdd[2] sde[4]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

unused devices: <none>

lshw -c disk

  *-disk
       description: SCSI Disk
       product: DT 101 G2
       vendor: Kingston
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: 1.00
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 7643MiB (8015MB)
       capabilities: removable
       configuration: ansiversion=4 logicalsectorsize=512 sectorsize=512
     *-medium
          physical id: 0
          logical name: /dev/sda
          size: 7643MiB (8015MB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: guid=6c166e3e-27c9-4edf-9b0d-e21892cbce41
  *-disk
       description: ATA Disk
       product: ST2000DM008-2FR1
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: 0001
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdb
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: ST2000DM008-2FR1
       physical id: 0.0.0
       bus info: scsi@2:0.0.0
       logical name: /dev/sdc
       version: 0001
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdc
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: WDC WD20EZBX-00A
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@3:0.0.0
       logical name: /dev/sdd
       version: 1A01
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdd
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: WDC WD20EZBX-00A
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@4:0.0.0
       logical name: /dev/sde
       version: 1A01
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sde
          size: 1863GiB (2TB)

Do you see anything that could be wrong in the setup? Do you think that adding a nvme with a PCI card and use it for caching would be helpful?

us flag
You should not use RAID5 if you want reliability. When one drive breaks, the chance of second drive breaking during resilvering is quite high. When second drive breaks, all your data is lost.
br flag
+1 to Tero - R5 has been essentially dead for well over a decade now - friends don't let friends use R5 :)
cn flag
Hey folks, what would you suggest for faster access, bigger size and still being redundant? I was willing to use 1 hard drive in order to get 6TB and still be safe by one
Score:3
ki flag
Wad

This is an old question, BUT I ran into the same problem and found the correct answer here. Hopefully this will help somebody else.

To summarize, you need to increase the stripe_cache_size. This can be done via:

echo 16384 > /sys/block/md0/md/stripe_cache_size

Be sure to point to the correct mdadm volume. You can try various values as discuss in the answer linked above, I had the best results with this.

Score:1
ca flag

The bad recorded performances stem from different factors:

  • mechanical disks are simply very bad at random read/write IO. To discover how bad they can be, simply append --sync=1 to your fio command (short story: they are incredibly bad, at least when compared to proper BBU RAID controllers or powerloss-protected SSDs);

  • RAID5 has an inherent write penalty due to stripe read/modify/write. Moreover it is strongly suggested to avoid it on multi-TB mechanical disks due to safety reasons. Having 4 disks, please seriously consider using RAID10 instead;

  • LUKS, providing software-based full-disk encryption, inevitably has its (significant) toll on both reads and writes;

  • using BTRFS, LVM is totally unnecessary. While a fat LVM-based volume will not impair performance in any meaningful way by itself, you are nonetheless inserting another IO layer and exposing yourself to (more) alignment issues;

  • finally, BTRFS itself is no particularly fast. Especially your slow sequential reads can be tracked to BTRFS horrible fragmentation (due it being CoW and enforcing 4K granularity - as a comparison, to obtain good performance from ZFS one should generally select 64K-128K records when using mechanical disks).

To have a baseline performance comparison, I strongly suggest redoing your IO stack measuring random & sequential read/write speed at each step. In other words:

  • create a RAID10 array and run dd and fio on the raw array (without a filesystem);

  • if full-disk encryption is really needed, use LUKS to create an encrypted device and re-run dd + fio on the raw encrypted device (again, with no filesystem). Compare to previous results to have an idea of what it means performance-wise;

  • try both XFS and BTRFS (running the usual dd + fio quick bench) to understand how two different filesystems behave. If BTRFS is too slow, try replacing it with lvmthin and XFS (but remember that in this case you will lose user data checksum, for which you need yet another layer - dmintegrity - itself commanding a significant performance hit).

If all this seems a mess, well, it really is so. By doing all the above you are just scratching storage performance: one had to consider real application behavior (rather than totally sequential dd or pure random fio results), cache hit ratio, IO pattern alignment, etc. But hey - few is much better than nothing, so lets start with something basic.

cn flag
So, what would you say about having luks then lvm with plain old ext4 and removing all the fanciness, would it be better on sequencial writes and reads? I have a couple applications that use sqlite databases and also write local metadata files: radarr, sonarr, readarr, and plex for example, but all these reads makes the whole system suffer a lot. I also had a lot of trouble running commands like kubectl get pod as it creates many small files for caching, so for this I just mounted a tmpfs on that folder but doing, so I will eventually get out of memory even working with this type of ephemeral
cn flag
I mean, I have no idea why I used lvm, before I'd have used luks with mdadm and ext 4
shodanshok avatar
ca flag
without LVM you lose block-level snapshots; if your filesystem does not natively support them (ie: ext3/4, xfs, etc) you will not be able to take any snapshots. Only you can evaluate if/how much losing snapshots is important (or not). btrfs, on the other side, has build-in snapshots so it does not need LVM, but I found its performance to be quite low for anything different than a simple fileserver.
cn flag
Thank you very much! I've changed it to raid 10 and used luks + lvm + ext, and it reached 150MB/s on write
Score:1
ng flag

The short version: I think it's likely that your problem is that your benchmark is using random writes that are much smaller than your RAID chunk size.

Is the performance problem something you noticed while using the system? Or, is it just that the benchmark results look bad? That 16K random write benchmark is approaching the worst case for that RAID 5 with a big 512K chunk.

RAID 5 has a parity chunk that has to be updated alongside the data. If you had a sequential workload that the kernel could chop up into 512K writes, you'd be simply computing new parity information, then writing the data chunk and parity chunks out. One write in translates to two writes out.

But with 16K writes that are much smaller than the chunk size, you've got to read the old data and the old parity first, then compute the new parity information, and then write out the new data and parity. That's read-read-write-write. One write in translates to four I/O's. With random writes, there's no way for even the best RAID controller on the planet to predict which chunks to cache.

If you're using the array to store large files, then you're in luck: you're just using the wrong benchmark to assess its performance. If you change randwrite to simply write in your benchmark so that the writes are sequential, it should get a lot better!

But if your workload is truly made of more random, small writes, then you're going to have to change something about the array. You'd be better served by a 4 disk RAID 10. But still, that's spinning media. It's not going to rock your world. I'd imagine that the performance of RAID 10 should be 2x to 3x what you've got now, something like 275 to 400 IOPS, maybe 4MiB/s to 6MiB/s on that benchmark?

As for using a SSD to cache, perhaps with something like bcache, you'd be eliminating your redundancy. Consider using a RAID 1 of two SSD's for caching? You definitely don't need NVMe here, given the speed of these drives. SATA would be fine.

(BTW, don't sweat partitions vs. raw devices. It doesn't make a difference. Personally, I don't use partitions.)

cn flag
Hey there Mike, thank you very much for your answer! The performance problem I notice was while using the system. First I did a simple dd test writing sequential 0 writes and still, it was not near the speed of that sata. So just to show you I've redone it and used the block site of the raid so you can see how it goes ``` root@hp-proliant-dl160-g6-1:~# dd if=/dev/zero of=disk-test oflag=direct bs=512k count=100 100+0 records in 100+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 5.81511 s, 9.0 MB/s ``` i wanted raid 5 so i could use more spc & still have safety in case of failure
Nikita Kipriyanov avatar
za flag
... the redundancy of SSD cache could be easily restored by using *two* SSDs, building a RAID1 array out of them and using *that array* as a caching device.
Mike Andrews avatar
ng flag
@JaysonReis, wow... that is slow. A couple ideas: first, just to sanity check, rule out the possibility that the drives have spun down. When you do that test, try the `dd`, then repeat it right afterwards. Take the timing from the 2nd run. Also, what @shodanshok described below is good advice: Profile the RAID directly, then add layers. See if you can figure out which layer is causing the problem.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.