Every once in a while, our Linux LAMP server (using PHP-FPM, XFS on thin LVM on HW RAID, Centos8) becomes inaccessible and stops responding to HTTP(S) requests.
Via centralized logging we found out that in those cases, load average quickly shoots up to hundreds, while more and more processes (systemd-journald, php processes, kernel xfs/dm threads...) get into a D state. According to iostat and pidstat, CPU and disk are not loaded much at all while load average hovers around 170, which is quite strange. From htop/ps output, there is no single or group of rogue processes that would explain this behaviour. It's just standard processes that seem to encounter some kind of "road block".
The only other strange thing with disk monitoring is that during those overload events, iostat intermitently reports quite high w_await for partition /var (2500-5000ms, while other partitions like /var/log, /var/lib/mysql mostly do not get over 10ms). This partition should be quiet most of the time, so it is not clear why iostat reports such large w_await times there.
The only solution then is to power cycle the server.
This happens on two servers of the same kind, never on others. It seems to be some kind of FS/block layer/controller/disk malfunction; lots of processes suddenly start waiting for disk or something else in the kernel, but according to iotop/iostat, disk is not doing much.
Is there a way to query Linux kernel FS/block layer/controller driver what exactly they are doing with storage and on behalf of which process? Standard tools like iotop/iostat tell me only names of I/O active processes and disk partition activity, but not which processes access which disk partition and what exactly they are doing there.