Ubuntu 20.04 breaks Nvidia driver regularly

jp flag

I am using Ubuntu 20.04.3 LTS on two machines (my personal computer and a small server from work), both equipped with Nvidia cards. The personal machine has an RTX2080 Super while the server runs with two RTX3090s.

We are doing deep learning research at work, so I use the machines mostly for running TensorFlow or related tools that make use of the GPU.

I was the one setting up both machines from scratch, so I did a fresh Ubuntu 20.04.3 LTS install on both machines, update, upgrade, installing basic tools, installing the Nvidia driver + CUDA. For this, on both machines, I used the runfile installer from the official Nvidia page here for CUDA which contains the Nvidia driver. Before running this installer I always blacklist the Noveau driver as shown here for example. I would not consider myself a very experienced admin for such systems as I come from a research background, I learned to use and understand Linux over the past months and so far, everything we needed for our small team worked like a charm. Except for a little problem, that I encounter both on my personal machine and the research server. It seems like my driver installations are broken regularly without me being able to understand why and when exactly.

Why mention both machines? Because I think it is the same problem expressing in two different ways: (1) My personal machine is the one I also use for work and coding. It has a display attached and in regular intervals (every 3-5 weeks I would say), it doesn't boot up into the login screen but rather shows me a single line saying:

/dev/nvme0n1p1: clean

I don't remember the exact line, but it definitely contains the location of my SSD and the work "clean". And then nothing happens from this point. I usually solve the problem by logging in via Ctrl+Alt+F2 and simply re-running the cuda/driver installer with:

sudo sh

and then reboot. After reboot, my login screen is back and everything works again. I am doing this for about a year now on my personal machine and it never bothered me much to find where the problem is, because after re-installing CUDA works, TF-GPU works, my UI works and tbh. thats all I need.

(2) Now comes the display-less server. It is running non-stop without rebooting. But in regular intervals (the same 3-5 weeks), everything that has to do with the GPU just stops working. Python scripts using TensorFlow-GPU will not find the GPU anymore. nvidia-smi shows the message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

One day everything is there and works and without manually manipulating anything in the system (like updates etc..) it will stop working and show this message. As in the case of my personal machine, simply reinstalling the driver will fix the problem. But since it is about a server I am responsible for and that many people use, I want to make sure to have a solution for it and understand the problem in detail to avoid it for the future.

I took a look into /var/log/dpkg.log to see if I can find any message of an automatically updated driver. I also watched the Xorg, boot and system logs, but I lack the knowledge for finding hints of what goes wrong in these logs. One thing I found out is, that running: dpkg --list | grep nvidia actually returns nothing at all on the server.nvidia-smi will print the above mentioned message. Surprisingly, nvcc --version still works and gives:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

so it seems that CUDA is still there but the nvidia-driver isn't.

Both on the personal machine and the server, I assume it is the same problem. When trying to run nvidia-smi in the terminal while my personal machine is broken, it will show me the same error message and I am sure, if I'd attach a display to the server, it wouldn't show me an Ubuntu login screen as well.

For now, I didn't re-run the installation on the server as I wanted to leave it in the "broken" state, in case that you have some advice for where to search for the problem. In any case, thanks in advance for your help!

ChanganAuto avatar
us flag
Whenever you install the driver using Nvidia binaries, i.e., not from the repositories as you should, then that's exactly what's supposed to happen. You need to reinstall each time there's a kernel update.
Hendrik avatar
jp flag
Sounds reasonable! This means `sudo apt install nvidia-driver-470` will most likely do the job? Why does the default CUDA installer come with the driver then? Do I still have to do the blacklisting of nouveau in this case?
ChanganAuto avatar
us flag
Yes, it should do the job. And you should install Cuda also from repos. And no, there's no need to blacklist anything, the installation takes care of that.
Hendrik avatar
jp flag
Thanks so much for your easy and quick answer!

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.