We've set up an AMD Ryzen computer with Ubuntu 21.10, and plugged 6 Akitio Duo machines each with 2x NVIDIA 4GB cards, as well as a 13th card directly on the PCIe slot which is a 16GB NVIDIA card (RTX A4000).
We have this rig running 16x threads of Alphafold2 (https://github.com/deepmind/alphafold#running-alphafold) and for the most of it, it can run without issues for a while.
But every once in a while, maybe once every 24hr or so on average, the computer completely locks up. If we only have 4x Alphafold2 running on the 16GB card, the computer is stable for weeks, so the issue seems to be with the jobs on the Akitio eGPU cards.
Is there anywhere that can tell us why it's crashing (the computer is on, but completely unresponsive, only a physical power button reboot does the trick)?
Looking at /var/log/kern.log
doesn't seem to show anything indicative of the issue.
EDIT:
Running dmidecode
when only the 16GB card plus 2 Akitios are plugged, gives the following:
# dmidecode --type 9 | egrep "Usage|Type|Designation"
Designation: PCIEX16_1
Type: x16 PCI Express
Current Usage: Available
Designation: PCIEX16_2
Type: x8 PCI Express
Current Usage: In Use
Designation: PCIEX1_1
Type: x1 PCI Express
Current Usage: Available
thanks @matigo for the suggestion to look at syslog. In the latest crash, it shows the bit above the '@^' bit, then the hard reboot was at 10:02.