Score:0

Unable to run Tensorflow model with CUDA on Ubuntu 20.04

cn flag

I have been trying to install CUDA for the past few days to fit my Tensorflow CNNs. Right now is installed on my machine (Ubuntu 20.04 LTS, RTX3060):

tensorflow-gpu 2.4

python 3.8.10

cuDNN 8.0

CUDA 11.0

nvidia-driver-495

The driver was installed along side CUDA 11.0.

When i fit a model, i can see that my GPU is allocating all his memory but the model verbose stays at : Epoch : 1/50 and will never go further.

I tried to downgrade my driver to nvidia-driver-470 as the 495 is not officially out. This acction led everything to stop working : my GPU does not allocate anymore when fitting, nvidia -smi does not work anymore, and importing tensorflow now returns:

Could not load dynamic library 'libcudart.so.11.0'; dlerror: ,

which was not the case previously.

Does anyone knows where this issue may come from?

Thanks

edit 1:

After reboot, importing Tensorflow returns:

tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2021-11-02 06:24:40.852786: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Directories /usr/lib/cuda/include and /usr/lib/cuda/lib64 actually exist.

edit 2:

After reinstalling cuda from this link : https://askubuntu.com/a/1288405/231142

Tensorflow import work and does not return any issues.

EarlyStop=EarlyStopping(patience=10,restore_best_weights=True)
Reduce_LR=ReduceLROnPlateau(monitor='val_accuracy',verbose=2,factor=0.5,min_lr=0.00001)
model_check=ModelCheckpoint('model.hdf5',monitor='val_loss',verbose=1,save_best_only=True)
tensorbord=TensorBoard(log_dir='logs')
callback=[EarlyStop , Reduce_LR,model_check,tensorbord]

returns :

2021-11-02 20:09:55.607299: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:09:55.607335: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:09:55.608325: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-11-02 20:09:55.609026: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609320: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609372: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-11-02 20:09:55.609476: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-11-02 20:09:55.609527: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.

Model fitting starts and uses all my GPU and CPU while still going slowly and returns :

2021-11-02 20:09:55.832301: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.269844: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:56.669900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.821919: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:57.065544: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/20
2021-11-02 20:09:59.868007: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
  1/137 [..............................] - ETA: 1:15:21 - loss: 0.7485 - accuracy: 0.38712021-11-02 20:10:30.404084: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:10:30.404114: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:10:30.404277: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.

There may be an issue with the libcupti.so.11.2 library but i have not find it for the moment.

Terrance avatar
id flag
I hate to ask this, but when you "deprecated" your NVIDIA driver, did you reboot your system so that the older driver takes effect?
Louis avatar
cn flag
i did for good measures. importing tensorflow now returns : `2021-11-02 06:01:48.281681: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64: 2021-11-02 06:01:48.281751: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.`
Terrance avatar
id flag
I am not sure how you setup your system for CUDA, but you might want to look at my answer [here](https://askubuntu.com/a/1288405/231142) and see if you may have missed a step in the installation of CUDA for like the additional information that you need to add to the `~/.profile` file. I wish I had a better card on my home system as some of the tensorflow tests I cannot run due to my card being older, but other CUDA tests pass. Sometimes running `sudo ldconfig` can fix library file issues as well.
Louis avatar
cn flag
i followed the instructions on your link. i updated the post with the new state.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.