How to query linux kernel which storage-related operations are currently being run on the level of FS / block layer / SATA controller?

Ján Lalinský

1/22/23, 5:20 PM

Every once in a while, our Linux LAMP server (using PHP-FPM, XFS on thin LVM on HW RAID, Centos8) becomes inaccessible and stops responding to HTTP(S) requests.

Via centralized logging we found out that in those cases, load average quickly shoots up to hundreds, while more and more processes (systemd-journald, php processes, kernel xfs/dm threads...) get into a D state. According to iostat and pidstat, CPU and disk are not loaded much at all while load average hovers around 170, which is quite strange. From htop/ps output, there is no single or group of rogue processes that would explain this behaviour. It's just standard processes that seem to encounter some kind of "road block".

The only other strange thing with disk monitoring is that during those overload events, iostat intermitently reports quite high w_await for partition /var (2500-5000ms, while other partitions like /var/log, /var/lib/mysql mostly do not get over 10ms). This partition should be quiet most of the time, so it is not clear why iostat reports such large w_await times there.

The only solution then is to power cycle the server.

This happens on two servers of the same kind, never on others. It seems to be some kind of FS/block layer/controller/disk malfunction; lots of processes suddenly start waiting for disk or something else in the kernel, but according to iotop/iostat, disk is not doing much.

Is there a way to query Linux kernel FS/block layer/controller driver what exactly they are doing with storage and on behalf of which process? Standard tools like iotop/iostat tell me only names of I/O active processes and disk partition activity, but not which processes access which disk partition and what exactly they are doing there.

0 + 0

troubleshooting

lamp

not-responding

php-fpm

mariadb

Wilson Hauck

1/23/23, 8:00 PM

Posting your first page of htop would be quite informative.

Ján Lalinský

1/23/23, 8:18 PM

It's hard to get that, since the overload happens very fast and then it's not possible to log in. However from our logs of ps and iostat output it's clear that neither CPU nor disk is getting hammered. The greatest CPU and disk consumer is MariaDB but it doesn't seem to be the cause, as logs of SQL command "show full processlist;' show running queries are either very easy, or not present at all (empty query processlist).

Wilson Hauck

1/24/23, 12:49 AM

Please post htop and SHOW FULL PROCESSLIST; whenever you can get it, busy or not. Thanks

Score:2

Server

Rick James

1/22/23, 9:35 PM

In situations like this, I find that it helps to throttle the number of connections higher up the stack.

When more than, say, 100 active processes are running, they stumble over each other. They are vying for resources (CPU, etc). The net effect is that all processes run slower, sometimes to the point where you feel like the only solution is to reboot the server.

In the case of MariaDB, I recommend turning on the slowlog so that you can identify the query that is having the most impact on the system. Then speed it up. If you want help, provide the query, its Explain and Create Table. More: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

Speeding up a few queries is likely to decrease the 170 Load Average and I/O, thereby alleviating the bottleneck.

0 + 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: How to query linux kernel which storage-related operations are currently being run on the level of FS / block layer / SATA controller?

TH: จะสอบถามเคอร์เนล linux ได้อย่างไรว่าการดำเนินการที่เกี่ยวข้องกับการจัดเก็บข้อมูลกำลังเรียกใช้ในระดับของคอนโทรลเลอร์ FS / block layer / SATA อย่างไร

RO: Cum se interoghează nucleul Linux care operațiuni legate de stocare sunt executate în prezent la nivelul FS / strat de bloc / controler SATA?

RU: Как запросить ядро linux, какие операции, связанные с хранением, в настоящее время выполняются на уровне FS / блочного уровня / контроллера SATA?

VI: Làm cách nào để truy vấn nhân linux mà các hoạt động liên quan đến lưu trữ hiện đang được chạy ở cấp độ FS/lớp khối/bộ điều khiển SATA?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.