Score:0

How to query linux kernel which storage-related operations are currently being run on the level of FS / block layer / SATA controller?

cn flag

Every once in a while, our Linux LAMP server (using PHP-FPM, XFS on thin LVM on HW RAID, Centos8) becomes inaccessible and stops responding to HTTP(S) requests.

Via centralized logging we found out that in those cases, load average quickly shoots up to hundreds, while more and more processes (systemd-journald, php processes, kernel xfs/dm threads...) get into a D state. According to iostat and pidstat, CPU and disk are not loaded much at all while load average hovers around 170, which is quite strange. From htop/ps output, there is no single or group of rogue processes that would explain this behaviour. It's just standard processes that seem to encounter some kind of "road block".

The only other strange thing with disk monitoring is that during those overload events, iostat intermitently reports quite high w_await for partition /var (2500-5000ms, while other partitions like /var/log, /var/lib/mysql mostly do not get over 10ms). This partition should be quiet most of the time, so it is not clear why iostat reports such large w_await times there.

The only solution then is to power cycle the server.

This happens on two servers of the same kind, never on others. It seems to be some kind of FS/block layer/controller/disk malfunction; lots of processes suddenly start waiting for disk or something else in the kernel, but according to iotop/iostat, disk is not doing much.

Is there a way to query Linux kernel FS/block layer/controller driver what exactly they are doing with storage and on behalf of which process? Standard tools like iotop/iostat tell me only names of I/O active processes and disk partition activity, but not which processes access which disk partition and what exactly they are doing there.

Wilson Hauck avatar
jp flag
Posting your first page of htop would be quite informative.
cn flag
It's hard to get that, since the overload happens very fast and then it's not possible to log in. However from our logs of ps and iostat output it's clear that neither CPU nor disk is getting hammered. The greatest CPU and disk consumer is MariaDB but it doesn't seem to be the cause, as logs of SQL command "show full processlist;' show running queries are either very easy, or not present at all (empty query processlist).
Wilson Hauck avatar
jp flag
Please post htop and SHOW FULL PROCESSLIST; whenever you can get it, busy or not. Thanks
Score:2
ua flag

In situations like this, I find that it helps to throttle the number of connections higher up the stack.

When more than, say, 100 active processes are running, they stumble over each other. They are vying for resources (CPU, etc). The net effect is that all processes run slower, sometimes to the point where you feel like the only solution is to reboot the server.

In the case of MariaDB, I recommend turning on the slowlog so that you can identify the query that is having the most impact on the system. Then speed it up. If you want help, provide the query, its Explain and Create Table. More: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

Speeding up a few queries is likely to decrease the 170 Load Average and I/O, thereby alleviating the bottleneck.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.