Score:-1

Extremely slow GPU memory allocation

tl flag

I'm having a problem with extremely slow memory allocation on an nvidia GPU from Python.

When running a GPU calculation in a fresh Python session, tensorflow/pytorch allocates memory in tiny increments for around four minutes until it suddenly allocates a large chunk of memory and performs the actual calculation. All subsequent calculations are performed instantly.

Does anyone know what could be wrong? Or how to get a log of what is actually going on during memory allocation?

I've tried re-installing CUDA libraries and nvidia drivers. Re-installing drivers fixes the issue for a little while, then memory allocation hangs again.

Python output:

Python 3.11.3 (main, Apr  5 2023, 14:15:06) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('import tensorflow as tf;tf.random.uniform([10])', number=1)
2023-04-17 09:08:24.062130: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-17 09:08:24.641429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-04-17 09:12:12.879503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21368 MB memory:  -> device: 0, name: GRID RTX6000-24Q, pci bus id: 0000:02:02.0, compute capability: 7.5
229.68861908599501

nvidia-smi:

+---------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID RTX6000-24Q    On   | 00000000:02:02.0 Off |                  N/A |
| N/A   N/A    P8    N/A /  N/A |  23527MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    122079      C   ...Model-js4zUkog/bin/python    21743MiB |
+---------------------------------------------+

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
guiverc avatar
cn flag
Ubuntu 23.04 doesn't yet exist; it's currently the *development* release Ubuntu *lunar* and remains that until it reaches RC state which isn't expected until after 13 April 2023, and isn't on-topic here until release on 20 April 2023. https://discourse.ubuntu.com/t/lunar-lobster-release-schedule/27284 Please refer https://askubuntu.com/help/on-topic. For support issues with Ubuntu *lunar* you'll need to use a #ubuntu-next or #ubuntu+1 site (IRC, UF etc) *Details in your provided pastes only match unsupported Ubuntu*
petrovski avatar
tl flag
@guiverc Good to know, my question is not about Ubuntu Lunar though.
guiverc avatar
cn flag
python3 or `python3 | 3.11.2-1` is only default for Ubuntu 23.04 (until recent release being Ubuntu *lunar*), and that is the only release details I noted in your question - ie. details from pasted detail.
Score:1
tl flag

I found out that slow memory allocation was caused by nvidia throttling my GPU because the license could not be verified.

I checked:

sudo cat /var/log/syslog | grep nvidia

And found:

Apr 18 11:35:43 srv-apu102 nvidia-gridd: Valid GRID license not found. GPU features and performance are restricted. To enable full functionality please configure licensing details. Apr 18 11:42:32 srv-apu102 nvidia-gridd: Acquiring license. (Info: http://10.1.2.56:7070/request; NVIDIA RTX Virtual Workstation) Apr 18 11:42:32 srv-apu102 nvidia-gridd: Calling load_byte_array(tra) Apr 18 11:42:35 srv-apu102 nvidia-gridd: Error: Failed server communication. Server URL : http://10.1.2.56:7070/request - #012[1,7e2,2,0[74000008,7,110001f3]] Generic communications error.#012[1,7e2,2,0[75000001,7,30010255]] General data transfer failure. Couldn't connect to server

I hope this helps others, regardless of the downvotes from people who seems to think that my question is about Ubuntu Lunar...

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.