Score:1

Why can't the GPUs communicate in a multi-GPU server?

us flag

This is a Dell PowerEdge r750xa server with 4 Nvidia A40 GPUs, intended for AI applications. While the GPUs work well individually, multi-GPU training jobs or indeed any multi-GPU computational workload fails where at least 2 GPUs have to exchange information, including the simpleIPC and the conjugateGradientMultiDeviceCG CUDA samples (the first one shows mismatching results, the second just hangs).

I have seen online discussions (1, 2, 3), claiming that something called the IOMMU must be turned off. I tried setting the iommu=off and intel_iommu=off Linux kernel flags but it didn't help. I checked the BIOS settings, but there is no option to turn IOMMU off in the BIOS.

Score:1
us flag

While there is no explicit "IOMMU off" setting in this BIOS flavour, the problem is still with the BIOS configuration.

In the BIOS, go to "Integrated Devices" and change the "Memory Mapped I/O Base" setting from the default "56TB" to "12TB". This will solve the issue. There is no need to add any extra kernel parameters.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.