I have been trying to install CUDA for the past few days to fit my Tensorflow CNNs.
Right now is installed on my machine (Ubuntu 20.04 LTS, RTX3060):
tensorflow-gpu 2.4
python 3.8.10
cuDNN 8.0
CUDA 11.0
nvidia-driver-495
The driver was installed along side CUDA 11.0.
When i fit a model, i can see that my GPU is allocating all his memory but the model verbose stays at : Epoch : 1/50
and will never go further.
I tried to downgrade my driver to nvidia-driver-470 as the 495 is not officially out.
This acction led everything to stop working : my GPU does not allocate anymore when fitting, nvidia -smi
does not work anymore, and importing tensorflow now returns:
Could not load dynamic library 'libcudart.so.11.0'; dlerror:
,
which was not the case previously.
Does anyone knows where this issue may come from?
Thanks
edit 1:
After reboot, importing Tensorflow returns:
tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2021-11-02 06:24:40.852786: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Directories /usr/lib/cuda/include and /usr/lib/cuda/lib64 actually exist.
edit 2:
After reinstalling cuda from this link : https://askubuntu.com/a/1288405/231142
Tensorflow import work and does not return any issues.
EarlyStop=EarlyStopping(patience=10,restore_best_weights=True)
Reduce_LR=ReduceLROnPlateau(monitor='val_accuracy',verbose=2,factor=0.5,min_lr=0.00001)
model_check=ModelCheckpoint('model.hdf5',monitor='val_loss',verbose=1,save_best_only=True)
tensorbord=TensorBoard(log_dir='logs')
callback=[EarlyStop , Reduce_LR,model_check,tensorbord]
returns :
2021-11-02 20:09:55.607299: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:09:55.607335: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:09:55.608325: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-11-02 20:09:55.609026: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609320: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609372: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-11-02 20:09:55.609476: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-11-02 20:09:55.609527: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
Model fitting starts and uses all my GPU and CPU while still going slowly and returns :
2021-11-02 20:09:55.832301: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.269844: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:56.669900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.821919: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:57.065544: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/20
2021-11-02 20:09:59.868007: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
1/137 [..............................] - ETA: 1:15:21 - loss: 0.7485 - accuracy: 0.38712021-11-02 20:10:30.404084: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:10:30.404114: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:10:30.404277: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
There may be an issue with the libcupti.so.11.2
library but i have not find it for the moment.