Score:1

2nd GPU not showing in nvidia-smi in Ubuntu 22.04

br flag

I've been having issues with my machine not detecting my second GPU (both RTX 3090s). This is not a new machine and is an issue that popped up a few weeks ago, which I resolved by rolling back to an older kernel (unknown version). But after a recent update, I lost that kernel and am stuck with this issue.

Here's what I've tried so far:

  • Swapping GPUs in their PCI slots to rule out a hardware issue
  • Update to latest mobo BIOS
  • Fresh 22.04 install for each driver install below
  • Every NVIDIA CUDA install (>= 11.7) from the NVIDIA downloads page (deb local, deb network and run file)
  • Every Ubuntu nvidia-driver* as far back as I can go to maintain a minimum CUDA version of 11.7
  • Rolling back to an abritrarily old kernel version (5.15) using mainline
  • Rolling forward to kernel 6.4
  • Booting with a HDMI monitor attached to GPU 2

*Note that all older Ubuntu nvidia-drivers-5XX are transitional packages to either 525 or 535 (apt search nvidia-driver). The last driver I had both GPUs working was 515.

The single GPU that is listed (also my display GPU) does run CUDA workloads, but seems to make my system unstable/laggy when the job (PyTorch) starts for a few mins.

❯ uname -r
5.19.0-46-generic
❯ lspci | grep VGA
09:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
❯ nvidia-smi
Sat Jul  1 12:11:41 2023       
+---------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:43:00.0  On |                  N/A |
|  0%   41C    P8    24W / 350W |    562MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1879      G   /usr/lib/xorg/Xorg                140MiB |
|    0   N/A  N/A      2338    C+G   ...ome-remote-desktop-daemon      258MiB |
|    0   N/A  N/A      2375      G   /usr/bin/gnome-shell               87MiB |
|    0   N/A  N/A      3338      G   ...566776601308618822,262144       73MiB |
+---------------------------------------------+

dmesg Link to GitHub Gist

The weird thing is very occasionally, after a fresh CUDA install (not isolated to a single driver version) and a restart, the 2nd GPU does show up in nvidia-smi. But after a reboot is disappears again. Uninstalling and reinstalling CUDA can replicate this but it appears to be random (and not what I want to do each time I reboot)

Any ideas how I can get my machine working properly again?

Link to nvidia-bug-report

us flag
Ubuntu 22.04.2 LTS was just upgraded from kernel 5.15.0-76 to 6.1.0-1015 and with that change, Nvidia drivers stopped working for me at all. Maybe you'll succeed with the 5.15 kernel.
Score:0
us flag
sudo dkms autoinstall

might help by rebuilding the Nvidia kernel modules.

Score:0
br flag

As of today, the only solution I have found is to use mainline to install kernel version 5.15. This restored my second GPU in nvidia-smi.

I have no idea why the current 22.04.2 LTS image uses 5.19 as it states here that 22.04 LTS should be 5.15. It is also bizarre that a routine update essentially created this issue - I'm sure the main reason people use the LTS versions is to avoid this kind of issue.

Edit: Based on the release notes

Ubuntu Desktop will automatically opt-into v5.17 kernel on the latest generations of certified devices (linux-oem-22.04)

Ubuntu Server defaults to a non-rolling LTS kernel v5.15 (linux-generic)

So it looks like 5.15 may only be for Ubuntu Server, and Ubuntu Desktop uses a rolling kernel. Shame the current kernel seems to have broken something...

Score:0
ir flag

@Anjum Sayed, could you detail about what you did to restore it. I'm using a dual boot Windows 10/Ubuntu 20.04 Desktop and experiencing the same issue, where I cannot see the GPU RTX 3090 core anymore:

Loading new nvidia-465.19.01 DKMS files…
Building for 5.15.0-76-generic
Building for architecture x86_64
Building initial module for 5.15.0-76-generic
ERROR: Cannot create report: [Errno 17] File exists: ‘/var/crash/nvidia-dkms-465.0.crash’
Error! Bad return status for module build on kernel: 5.15.0-76-generic (x86_64)
Consult /var/lib/dkms/nvidia/465.19.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-465 (–configure):
installed nvidia-dkms-465 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-465:
cuda-drivers-465 depends on nvidia-dkms-465 (>= 465.19.01); however:
Package nvidia-dkms-465 is not configured yet.

dpkg: error processing package cuda-drivers-465 (–configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
No apport report written because the error message indicates its a followup error from a previous failure.
dpkg: dependency problems prevent configuration of cuda-drivers:
cuda-drivers depends on cuda-drivers-465 (= 465.19.01-1); however:
Package cuda-drivers-465 is not configured yet.

dpkg: error processing package cuda-drivers (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-driver-465:
nvidia-driver-465 depends on nvidia-dkms-465 (= 465.19.01-0ubuntu1); however:
Package nvidia-dkms-465 is not configured yet.

dpkg: error processing package nvidia-driver-465 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-runtime-11-3:
cuda-runtime-11-3 depends on cuda-drivers (>= 465.19.01); however:
Package cuda-drivers is not configured yet.

dpkg: error processing package cuda-runtime-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-demo-suite-11-3:
cuda-demo-suite-11-3 depends on cuda-runtimeNo apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
-11-3; however:
Package cuda-runtime-11-3 is not configured yet.

dpkg: error processing package cuda-demo-suite-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-11-3:
cuda-11-3 depends on cuda-runtime-11-3 (>= 11.3.1); however:
Package cuda-runtime-11-3 is not configured yet.
cuda-11-3 depends on cuda-demo-suite-11-3 (>= 11.3.58); however:
Package cuda-demo-suite-11-3 is not configured yet.

dpkg: error processing package cuda-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda:
cuda depends on cuda-11-3 (>= 11.3.1); however:
Package cuda-11-3 is not configured yet.

No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
dpkg: error processing package cuda (–configure):
dependency problems - leaving unconfigured
Processing triggers for initramfs-tools (0.136ubuntu6.7) …
update-initramfs: Generating /boot/initrd.img-5.15.0-76-generic
Errors were encountered while processing:
nvidia-dkms-465
cuda-drivers-465
cuda-drivers
nvidia-driver-465
cuda-runtime-11-3
cuda-demo-suite-11-3
cuda-11-3
cuda
E: Sub-process /usr/bin/dpkg returned an error code (1)
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.