Score:0

Computer with 12+1 Thunderbolt 3 connected GPUs crashed when using eGPUs

de flag

We've set up an AMD Ryzen computer with Ubuntu 21.10, and plugged 6 Akitio Duo machines each with 2x NVIDIA 4GB cards via 2x Thunderbolt hubs, as well as a 13th card directly on the PCIe slot which is a 16GB NVIDIA card (RTX A4000, can run 4 jobs in parallel).

We have this rig running 12+4 threads of Alphafold2 (https://github.com/deepmind/alphafold#running-alphafold) and for the most of it, it can run without issues for a while.

But every once in a while, maybe once every 24hr or so on average, the computer completely locks up. If we only have 4x Alphafold2 running on the 16GB card, the computer is stable for weeks, so the issue seems to be with the jobs on the Akitio eGPU cards.

Is there anywhere that can tell us why it's crashing (the computer is on, but completely unresponsive, only a physical power button reboot does the trick)?

Looking at /var/log/kern.log doesn't seem to show anything indicative of the issue.

One aspect we've read about is that it could be the PCIe lanes are overburdened, and the 16 threads trip each other up with so many PCIe devices connected. Since this machine is not used for anything else, would disabling the 'Sound' or 'USB 3.1' PCIe lanes solve the issue? If so, how?

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.