I've been having issues with my machine not detecting my second GPU (both RTX 3090s). This is not a new machine and is an issue that popped up a few weeks ago, which I resolved by rolling back to an older kernel (unknown version). But after a recent update, I lost that kernel and am stuck with this issue.
Here's what I've tried so far:
- Swapping GPUs in their PCI slots to rule out a hardware issue
- Update to latest mobo BIOS
- Fresh 22.04 install for each driver install below
- Every NVIDIA CUDA install (>= 11.7) from the NVIDIA downloads page (deb local, deb network and run file)
- Every Ubuntu nvidia-driver* as far back as I can go to maintain a minimum CUDA version of 11.7
- Rolling back to an abritrarily old kernel version (5.15) using mainline
- Rolling forward to kernel 6.4
- Booting with a HDMI monitor attached to GPU 2
*Note that all older Ubuntu nvidia-drivers-5XX are transitional packages to either 525 or 535 (apt search nvidia-driver
). The last driver I had both GPUs working was 515.
The single GPU that is listed (also my display GPU) does run CUDA workloads, but seems to make my system unstable/laggy when the job (PyTorch) starts for a few mins.
❯ uname -r
5.19.0-46-generic
❯ lspci | grep VGA
09:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
❯ nvidia-smi
Sat Jul 1 12:11:41 2023
+---------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:43:00.0 On | N/A |
| 0% 41C P8 24W / 350W | 562MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+---------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1879 G /usr/lib/xorg/Xorg 140MiB |
| 0 N/A N/A 2338 C+G ...ome-remote-desktop-daemon 258MiB |
| 0 N/A N/A 2375 G /usr/bin/gnome-shell 87MiB |
| 0 N/A N/A 3338 G ...566776601308618822,262144 73MiB |
+---------------------------------------------+
dmesg
Link to GitHub Gist
The weird thing is very occasionally, after a fresh CUDA install (not isolated to a single driver version) and a restart, the 2nd GPU does show up in nvidia-smi
. But after a reboot is disappears again. Uninstalling and reinstalling CUDA can replicate this but it appears to be random (and not what I want to do each time I reboot)
Any ideas how I can get my machine working properly again?
Link to nvidia-bug-report