- Operating system: CentOS 7.9 (on both host and guests)
- Host specs:
- CPU - AMD EPYC 7502P 32-Core Processor
- RAM - 250 GB (only 22G used by all VMs)
- Disk - Four Samsung SSD 870 QVO 4TB, grouped in a mdadm RAID5
This system is running 12 virtual machines, which in aggregate use 19 cores out of the 64 processing threads available on the CPU. One of those VMs is our site's mail server, of critical importance to the users.
For the past two years, this system has worked just fine. Over this time I gradually added VMs to the set-up, the most recent one roughly a month ago.
About five days ago, the VMs all started running slowly. The most visibly affected was the mail-server VM (four cores, 5GB RAM allocated, 2.6GB in use, 20GB storage). I traced the correlation to CentOS' standard weekly raid check on the host, which I hadn't even known existed up until now.
When I cancelled the RAID check, the VMs' speed went back to normal:
echo frozen > /sys/devices/virtual/block/md127/md/sync_action
I tried slowing down the speed of the RAID5 check, following hints from here and the contents of /usr/sbin/raid-check
:
echo 1000 > /proc/sys/dev/raid/speed_limit_min
echo 1000 > /proc/sys/dev/raid/speed_limit_max
echo idle > /sys/devices/virtual/block/md127/md/sync_action
ps -elf | grep resync
# Noting that the PID was 59065
renice -n 15 -p 59065
ionice -c3 -p 59065
It didn't matter, though the RAID5 sync's speed was quite slow:
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdc1[2] sdd1[4] sda1[0] sdb1[1]
11720570880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
[=============>.......] check = 69.1% (2700123556/3906856960) finish=5411.6min speed=3716K/sec
bitmap: 6/30 pages [24KB], 65536KB chunk
The symptom remained: While the SSD-based RAID5 was syncing at any speed, after a time the host's and mail server VM's load would increase and they would slow to the point of being unusable. By running top
simultaneously on both host and guest, it was clear that first the host's load would increase, then the mail server's.
A typical qemu storage specification for my VMs is:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/xen/images/imagefile.qcow2'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>
A typical guest network interface is:
<interface type='bridge'>
<mac address='52:54:00:59:83:0a'/>
<source bridge='bridge0'/>
<target dev='vnet17'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
I'll acknowledge:
RAID5 + SSDs may be overkill. But I've had SSDs fail on me before, and I wanted the freedom to hot-swap one if it failed.
Given they are are SSDs, regular running of raid-check is probably unnecessary. I sort-of like the idea of a semi-annual check, though I'm willing to give that up. But if I don't know what caused the problem, I won't know if the set-up will remain usable if I ever had to swap-and-resync a new SSD.
It's the classic sysadmin question: It worked before, why doesn't work now?