We have a really strange bug where a Yocto operating system running on a Raspberry Pi will 'lock up' because of disk IO wait.
Scenario:
- operating system runs read only and has no swap
 
- there is a tmpfs filesystem for stuff the OS needs to write to (/var, /log etc)
 
- the tmpfs has default to half of the available 2GB of RAM
 
- there is a USB hard drive connected for storing large MP4 files
 
After a while of running a Python program interacting with a Google Coral USB accelerator, the output of top is:

So the CPU load is massive but the CPU usage is low. We believe this is because it is waiting for IO to the USB hard disk.
Other times we will see even higher cache usage:
Mem: 1622744K used, 289184K free, 93712K shrd, 32848K buff, 1158916K cached
CPU:   0% usr   0% sys   0% nic  24% idle  74% io   0% irq   0% sirq
Load average: 5.00 4.98 4.27 1/251 2645
The filesystem looks fairly normal:
root@ifu-14:~# df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/root                 3.1G    528.1M      2.4G  18% /
devtmpfs                804.6M      4.0K    804.6M   0% /dev
tmpfs                   933.6M     80.0K    933.5M   0% /dev/shm
tmpfs                   933.6M     48.6M    884.9M   5% /run
tmpfs                   933.6M         0    933.6M   0% /sys/fs/cgroup
tmpfs                   933.6M     48.6M    884.9M   5% /etc/machine-id
tmpfs                   933.6M      1.5M    932.0M   0% /tmp
tmpfs                   933.6M     41.3M    892.3M   4% /var/volatile
tmpfs                   933.6M     41.3M    892.3M   4% /var/spool
tmpfs                   933.6M     41.3M    892.3M   4% /var/lib
tmpfs                   933.6M     41.3M    892.3M   4% /var/cache
/dev/mmcblk0p1           39.9M     28.0M     11.9M  70% /uboot
/dev/mmcblk0p4          968.3M      3.3M    899.0M   0% /data
/dev/mmcblk0p4          968.3M      3.3M    899.0M   0% /etc/hostname
/dev/mmcblk0p4          968.3M      3.3M    899.0M   0% /etc/NetworkManager
/dev/sda1               915.9G     30.9G    838.4G   4% /mnt/sda1
When it all 'locks up' we notice that the USB hard drive because completely unresponsive (ls does nothing and just freezes).
In the dmesg logs we have noticed the following lines (pasted as an image to preserve colours):

Here is a full output of dmesg after we start getting these errors:
https://pastebin.com/W7k4cp35
We are surmising that when the software running on the system tries to do something with a big file (50MB +) (moving it around on the USB hard drive), somehow the system is running out of memory.
We are really unsure how on earth we proceed. We found this blog: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ which kind of seems like the same problem and suggests modifying the vm.dirty_ratio and vm.dirty_background_ratio to flush caches to disk more often.
Is that the right approach?
The current settings are vm.dirty_ratio = 20 and vm.dirty_background_ratio = 10
Could a relatively slow USB hard drive require changing this? Can someone explain what is going on?