
Disk IO issue with high Page writeback

us flag

I have been trying to figure out what's going on with some of our servers. These are KVM hosts that have 5 -8 VMs. RAM >= 64GB, 10 - 20 cores. These are running Ubuntu 18.04 LTS 4.15.0-142-generic Kernel, LUKS encrypted ext4 root partition.

Randomly, some of these servers will become very slow. All indications would point to Disk IO, but really there isn't much workload consuming IO (pidstat, iostat, vmstat). In short the system will enter a weird lock-up state where everything becomes slow and unresponsive.

One thing seems to be common with the unhealthy servers. The Writeback will become high ~ 2.5GB and will be stuck at that value without any changes. This might be a symptom or cause, I really don't know. I'm experimenting with reducing dirty_ratio, but can't say it worked yet.

Dirty:              1504 kB
Writeback:       2537628 kB 

Here is a call trace of the stuck tasks collected using Sysrq-w Call trace for Stuck tasks

I have also tried to tie the issue down to hardware and found this affects different disk hardware.

A restart seem to fix the issue temporarily. This will sometimes come back after a few days. Any ideas would be helpful. Let me know if you need more information. Thanks in advance


Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.