I've been trying all day to have this (v100) GPU working on a new ubuntu VM. I tried installing the drivers and rebooting and also purging/uninstalling everything to do with nvidia but none of these things seem to work.
In particular I ran this specifically:
apt update;
apt install build-essential;
sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now
Then sometimes it seems that nvidia-smi is working (as of the writing of this question it wasn't so I wasn't able to copy paste what is said when it works) but when it doesn't work it says this:
(synthesis) miranda9@miranda9:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:00:06.0: Unknown Error
any help is appreciated.
Note I also do not have access to the VMs vmx file so this question and answers are useless/meaningless to me: https://forums.developer.nvidia.com/t/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/46835
In addition I have tried to uninstall everything from nivida and re-install it with:
sudo apt-get --purge remove "*nvidia*"
sudo /usr/bin/nvidia-uninstall
then
apt update;
apt install build-essential;
sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now
but that doesnt seem to work
More info in case it helps:
(synthesis) miranda9@miranda9:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
also:
(synthesis) miranda9@miranda9:~$ python
Python 3.9.5 (default, Jun 4 2021, 12:28:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/miranda9/miniconda3/envs/synthesis/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at /opt/conda/conda-bld/pytorch_1623448238472/work/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
As requested by comment:
# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
another vm:
$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
Resources I've search for help: