I can reproduce the problem consistently (and in minutes quickly) but I can't find any messages in the logs that are helpful. This problem occurred with a RocketRaid 3740C HBA and the proprietary nvidia driver but now occurs with an LSI/Broadcom 9305-16i HBA and nouveau drivers. I have flashed the Broadcom card to the latest firmware and bios. The Host Bus Adapter is connected to 9 drives (of 10, RAID 6 is degraded until the replacement disk arrives). The network card is a Mellanox ConnectX3 running a 10G ethernet on fibre. Before I exchange the RocketRaid card I remember seeing the proprietary driver write to the kernel log talk about getting 20 something when expecting 18 before the crash. I can't seem to find those messages anymore though (pointers on how to find them appreciated!).
Steps to Reproduce:
Write a lot of things to disk (write speeds are > 700MB/s). For example open 3 scp sessions from another computer and write 3 files in parallel at ~250MB/s each. In less than five minutes Ubuntu screen is frozen / locked up and ssh is non-responsive. Hard reset appears to be the only option. After which mdadm thinks the array is dirty (even though the Event count is the same on all drives). mdadm assemble --force works but then the array spends a day re-syncing.
I'm about at my wits end with this. I'm considering seeing what will happen with TrueNAS or Alma Linux. I'm somewhat wondering about the motherboard too (ASRock Tachi X570). The system seems to be fine under any load that does not involve extensive writes to the array including cpu (5700x) and intense network traffic (I can repeatedly send/receive 10s of Gigabytes of network traffic and get ~70 Gbit/s bandwidth).
Edit per comment from
@heynnema
$ sudo free -h
total used free shared buff/cache available
Mem: 62Gi 12Gi 442Mi 372Mi 50Gi 49Gi
Swap: 975Mi 44Mi 931Mi
sudo sysctl vm.swappiness
vm.swappiness = 60
phil@omni:~$ sudo dmidecode -s bios-version
P4.30
Tasks: 428 total, 2 running, 426 sleeping, 0 stopped, 0 zombie
%Cpu(s): 34.8 us, 2.0 sy, 0.0 ni, 61.1 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st
MiB Mem : 64242.9 total, 1192.4 free, 14388.3 used, 48662.3 buff/cache
MiB Swap: 976.0 total, 915.5 free, 60.5 used. 48780.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15919 fooo 20 0 4083880 3.6g 12520 S 312.5 5.7 77:36.68 chia
15560 fooo 20 0 4083904 3.6g 12544 S 93.8 5.7 77:43.99 chia
4764 root 20 0 0 0 0 S 18.8 0.0 93:17.25 md0_raid6
1375 unifi 20 0 4028748 180588 21888 S 6.2 0.3 0:04.47 launcher
2154 unifi 20 0 1078716 132904 39776 S 6.2 0.2 0:25.11 mongod
4776 root 20 0 0 0 0 R 6.2 0.0 18:39.73 md0_resync
15419 root 20 0 0 0 0 I 6.2 0.0 0:01.07 kworker/0:1-events
1 root 20 0 168296 11728 7896 S 0.0 0.0 0:01.02 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
10 root 20 0 0 0 0 S 0.0 0.0 0:06.43 ksoftirqd/0
11 root 20 0 0 0 0 I 0.0 0.0 0:04.24 rcu_sched
12 root rt 0 0 0 0 S 0.0 0.0 0:00.02 migration/0
13 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0
cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/mapper/vgubuntu-root / ext4 errors=remount-ro 0 1
# /boot/efi was on /dev/nvme0n1p1 during installation
UUID=3C3E-4180 /boot/efi vfat umask=0077 0 1
/dev/mapper/vgubuntu-swap_1 none swap sw 0 0
#192.168.1.192:/storage /storage nfs defaults 0 0
UUID=ddc550d2-7f93-4ecf-ac2e-d754c5eee6c9 /storage xfs defaults 0 0
UUID=BCB65C49B65C05F4 /var/ExChia1 ntfs defaults 0 0
UUID=3A10-3FE7 /var/ExChia4 exfat defaults 0 0
UUID=0EF0-7586 /var/ExChia5 exfat defaults 0 0
UUID=3837-E26A /var/ExChia6 exfat defaults 0 0
UUID=73338b75-d356-4e7f-9757-948f1078f04e /var/ExChia13 xfs defaults 0 0