Score:0

SLURM / NFS based computing cluster with disk uniterruptible sleep issues (state : D)

pf flag

Context :

We have a computing cluster based on 7 servers, running Debian 11:

  • a storage (HDD NAS, ~500TB, RAID5, LVM)
  • a frontal server, running SLURM, nfs-common
  • 5 nodes on which the storage is mounted through NFS.

When business users run SLURM jobs on frontal, python threads are ditributed to nodes, which read & write data on their shared NFS mount.

Everything was working fine until last week. We lost control of "frontal" : We couldn't interact with it through ssh or local console. We decided to reboot it, and took this opportunity to upgrade its kernel from 5.10.140 to 5.10.162

Since then, SLURM jobs are most of the time in an "uninterruptible sleep" state ("D"), and mostly failing.

We have rollback'ed the kernel to version 5.10.140, but the problem remains.

Do you have any ideas ?

shodanshok avatar
ca flag
Can you share the output of `iostat -x -k 1`, `nfsiostat 1` and `nfsstat -s` taken on the NFS server when you have jobs in `D` state?
Grégory Hare avatar
pf flag
Thank you for your answer ! We currently are running a RAID check on the storage to be sure the disks are not to blame. But I'll plan to do it as soon as the RAID will be available.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.