We've set up an AMD Ryzen computer with Ubuntu 21.10, and plugged 6 Akitio Duo machines each with 2x NVIDIA 4GB cards via 2x Thunderbolt hubs, as well as a 13th card directly on the PCIe slot which is a 16GB NVIDIA card (RTX A4000, can run 4 jobs in parallel).
We have this rig running 12+4 threads of Alphafold2 (https://github.com/deepmind/alphafold#running-alphafold) and for the most of it, it can run without issues for a while.
But every once in a while, maybe once every 24hr or so on average, the computer completely locks up. If we only have 4x Alphafold2 running on the 16GB card, the computer is stable for weeks, so the issue seems to be with the jobs on the Akitio eGPU cards.
Is there anywhere that can tell us why it's crashing (the computer is on, but completely unresponsive, only a physical power button reboot does the trick)?
Looking at /var/log/kern.log
doesn't seem to show anything indicative of the issue.
One aspect we've read about is that it could be the PCIe lanes are overburdened, and the 16 threads trip each other up with so many PCIe devices connected. Since this machine is not used for anything else, would disabling the 'Sound' or 'USB 3.1' PCIe lanes solve the issue? If so, how?