The issue at hand
I have laptop with AMD CPU and Nvidia GPU, and use Ubuntu.
This configuration has given me a lot of trouble, because AMD support for Linux is apparently not working very well. But a new issue has popped up recently, and these are the steps:
- PC is working fine, Nvidia driver is installed and running, and I can use the GPU for development.
- I reboot the PC
- The Nvidia now no longer works
This is where I am now. When I run nvidia-smi
I get this message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But when I open the panel "Additional Drivers" inside "Software & Updates", then I get shown this as active Driver:
o Using NVIDIA driver metapackage from nvidia-driver-535 (proprietary, tested)
I can also run lspci | grep VGA
for this output:
01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [Geforce RTX 3070 Ti Laptop GPU] (rev a1)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] (rev c7)
Finally, I can run sudo apt search nvidia-driver-535
for this output:
...
nvidia-driver-535/lunar-updates,lunar-security,lunar,now 535.54.03-0ubuntu0.23.04.2 amd64 [installed]
NVIDIA driver metapackage
...
xserver-xorg-video-nvidia-535/lunar-updates,lunar-security,lunar,now 535.54.03-0ubuntu0.23.04.2 amd64 [installed,automatic]
NVIDIA binary Xorg driver
...
And yes, I just deleted (ie. sudo apt purge nvidia*)
the driver from the PC and reinstalled it. Same problem still. And I have also forced gdm3 to use X11 by editing the file /etc/gdm3/custom.conf
. Because it doesn't work with Wayland.
Attempt at debugging
My skills at debugging the internals of Linux are quite limited, but I did get a few messages from running sudo journalctl -S -1h
:
modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-25-generic
[...]
systemd[1860]: Started app-gnome-nvidia\x2dsettings-7946.scope - Application launched by gnome-shell.
nvidia-settings.desktop[7946]: ERROR: NVIDIA driver is not loaded
nvidia-settings[7946]: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
nvidia-settings[7946]: ctk_powermode_new: assertion '(ctrl_target != NULL) && (ctrl_target->h != NULL)' failed
nvidia-settings.desktop[7946]: ERROR: nvidia-settings could not find the registry key file or the X server is not accessible. This file should have been installed along with this driver at /usr/share/nvidia/nvidia-application-profiles-key-documentation. The application profiles will continue to work, but values cannot be prepopulated or validated, and will not be listed in the help text. Please see the README for possible values and descriptions.
nvidia-settings[7946]: PRIME: No offloading required. Abort
nvidia-settings[7946]: PRIME: is it supported? no
Conclusion
The PC can read the GPU. The NVIDIA driver is correctly installed. But somehow, randomly, on reboot the driver gets disabled. Or something.
What is happening? I have absolutely no idea how to fix this.