Score:1

Continuing NVidia problems. Broken by going to sleep over night

gi flag

Is there a 100% guaranteed way to setup one's NVidia 4090 for doing AI and not using it for graphics or the desktop? Such that it survives making minor driver upgrades, CUDA upgrades and minor OS upgrades, or just shutting it down for the night, sleeping and rebooting in the morning?

Yesterday, I upgraded to CUDA 12.0 which also upgraded the NVidia driver to 525.60.13. sudo sh cuda_12.0.0_525.60.13_linux.run.

The upgrade failed on the 525.60.13, so I ran the run script from the emergency single user mode without a desktop. That worked but then I had no audio. This is supposed to be driven through my monitor via the Intel integrated GPU. It was working just before I upgraded the NVidia stuff. Did some inference work awhile without music. Just before shutting down I rebooted and the audio worked. Did more inferencing. Shut down, went to sleep, woke up, start my system and got:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
5.17.0-1019-oem #20-Ubuntu SMP PREEMPT

Obviously I have the latest driver having upgraded a few hours prior. Yes, I just rebooted again. Yes, I have spent hours Googling. Please try to help without finding fault with the perfection of my question. lshw sees the device. I've tried so many things.

sudo modprobe -a nvidia
modprobe: ERROR: ../libkmod/libkmod-module.c:838 kmod_module_insert_module() could not find module by name='off'
modprobe: ERROR: could not insert 'off': Unknown symbol in module, or unknown parameter (see dmesg)

Last night this wasn't a problem. :-(

Dan Wood avatar
gi flag
I'll leave the question. I comment out the "blacklist nvidia" and "alias nvidia off" lines from /lib/modprobe.d/blacklist-nvidia.conf and then modprobe can then load the driver. Some solutions say to comment out all the lines but I think but don't know for sure that the drm driver is needed if you aren't using the GPU for rendering and I don't want my screen to go black again.
Score:1
gi flag

My setup is to use my Intel CPU Integrated GPU to run my monitor leaving my NVidia 4090 100% for AI/DNN/Stable Diffusion. It seems that sometime in the act of upgrading the NVidia or CUDA drivers it takes over, thinking I'm a typical gamer wanting the NVidia to run my video/sound.

To fix I would run: prime-select intel which seems to fix my audio.

The problem is that it ALSO disables all 3 NVidia kernel modules by blacklisting them in /lib/modprobe.d/blacklist-nvidia.conf.

If the nvidia driver isn't loaded, the 4090 doesn't work. With the way the blacklist works the error you get when trying to manually load the kernel module is confusing.

The solution was to comment out the blacklist nvidia and alias nvidia off lines from the conf file above. Then you can load nvidia with modprobe and it works.

Since I'm not using the 4090 as a display device I left the other two modules blacklisted in the file.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.