I have been trying to figure out what's going on with some of our servers. These are KVM hosts that have 5 -8 VMs. RAM >= 64GB, 10 - 20 cores. These are running Ubuntu 18.04 LTS 4.15.0-142-generic Kernel, LUKS encrypted ext4 root partition.
Randomly, some of these servers will become very slow. All indications would point to Disk IO, but really there isn't much workload consuming IO (pidstat, iostat, vmstat). In short the system will enter a weird lock-up state where everything becomes slow and unresponsive.
One thing seems to be common with the unhealthy servers. The Writeback will become high ~ 2.5GB and will be stuck at that value without any changes. This might be a symptom or cause, I really don't know. I'm experimenting with reducing dirty_ratio, but can't say it worked yet.
Dirty: 1504 kB
Writeback: 2537628 kB
Here is a call trace of the stuck tasks collected using Sysrq-w
Call trace for Stuck tasks
I have also tried to tie the issue down to hardware and found this affects different disk hardware.
A restart seem to fix the issue temporarily. This will sometimes come back after a few days.
Any ideas would be helpful. Let me know if you need more information.
Thanks in advance