Score:0

qemu virtual machines slow during RAID5 check

hu flag
  • Operating system: CentOS 7.9 (on both host and guests)
  • Host specs:
    • CPU - AMD EPYC 7502P 32-Core Processor
    • RAM - 250 GB (only 22G used by all VMs)
    • Disk - Four Samsung SSD 870 QVO 4TB, grouped in a mdadm RAID5

This system is running 12 virtual machines, which in aggregate use 19 cores out of the 64 processing threads available on the CPU. One of those VMs is our site's mail server, of critical importance to the users.

For the past two years, this system has worked just fine. Over this time I gradually added VMs to the set-up, the most recent one roughly a month ago.

About five days ago, the VMs all started running slowly. The most visibly affected was the mail-server VM (four cores, 5GB RAM allocated, 2.6GB in use, 20GB storage). I traced the correlation to CentOS' standard weekly raid check on the host, which I hadn't even known existed up until now.

When I cancelled the RAID check, the VMs' speed went back to normal:

echo frozen > /sys/devices/virtual/block/md127/md/sync_action

I tried slowing down the speed of the RAID5 check, following hints from here and the contents of /usr/sbin/raid-check:

echo 1000 > /proc/sys/dev/raid/speed_limit_min
echo 1000 > /proc/sys/dev/raid/speed_limit_max
echo idle > /sys/devices/virtual/block/md127/md/sync_action
ps -elf | grep resync
# Noting that the PID was 59065
renice -n 15 -p 59065
ionice -c3 -p 59065

It didn't matter, though the RAID5 sync's speed was quite slow:

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sdc1[2] sdd1[4] sda1[0] sdb1[1]
      11720570880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=============>.......]  check = 69.1% (2700123556/3906856960) finish=5411.6min speed=3716K/sec
      bitmap: 6/30 pages [24KB], 65536KB chunk

The symptom remained: While the SSD-based RAID5 was syncing at any speed, after a time the host's and mail server VM's load would increase and they would slow to the point of being unusable. By running top simultaneously on both host and guest, it was clear that first the host's load would increase, then the mail server's.

A typical qemu storage specification for my VMs is:

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/xen/images/imagefile.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

A typical guest network interface is:

    <interface type='bridge'>
      <mac address='52:54:00:59:83:0a'/>
      <source bridge='bridge0'/>
      <target dev='vnet17'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

I'll acknowledge:

  • RAID5 + SSDs may be overkill. But I've had SSDs fail on me before, and I wanted the freedom to hot-swap one if it failed.

  • Given they are are SSDs, regular running of raid-check is probably unnecessary. I sort-of like the idea of a semi-annual check, though I'm willing to give that up. But if I don't know what caused the problem, I won't know if the set-up will remain usable if I ever had to swap-and-resync a new SSD.

It's the classic sysadmin question: It worked before, why doesn't work now?

William Seligman avatar
hu flag
I never solved this problem. In the end, I copied all the files off the drives, and re-built the RAID as a RAID10 instead of a RAID5. It cost me about 4TB of drive storage, but the system has been speeding along ever since.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.