Score:1

Does a defunct process still allocate resources in the system?

de flag

I have a production machine (Ubuntu 18.04) that runs processes in GPU using Nvidia. A certain process has allocated memory and is now defunct, leaving the GPUs basically unusable.

ps -o ppid= -p

Returns one which means that PID=1 is parent of my defunct process, so i cant kill it.

nvidia-smi reveals that this process has lots of memory allocated in the GPUs. So i figure i can use

nvidia-smi --gpu-reset

to free the resources. Is the child process going to generate any trouble? Can it "see" that the resources it has allocated are not available anymore?

In essence: is this dangerous in any way?

Score:0
do flag

Using nvidia-smi --gpu-reset will reset the GPU and free any allocated resources, including memory, held by the defunct process. However, this command can only be used when the GPU is idle, meaning no other active processes are using the GPU.

If your GPU is being used by other active processes, the --gpu-reset command might fail or cause unintended side effects, such as terminating those processes or causing them to malfunction due to the sudden loss of GPU resources.

Since the defunct process's parent is PID=1, it is unlikely that it will generate any further trouble. When you reset the GPU, the resources it has allocated will be released, and the defunct process won't be able to see or use them.

It is generally safe to use nvidia-smi --gpu-reset as long as there are no other processes actively using the GPU. If there are other processes using the GPU, you should try to gracefully stop those processes before resetting the GPU. Additionally, it is a good idea to monitor your system after resetting the GPU to ensure that no unexpected issues arise.

It's quite common that you need to restart machine if gets frozen, as probably kernel module gets stuck and it doesn't respond even to killing processes that are using GPU

Marco Montevechi Filho avatar
de flag
Interesting, you described basically everything that happened with the system after i asked the question. We tried nvidia-smi --gpu-reset but it the defunct process still made the command fail. After i tried to rmmod nvidia_rmu, rmmod froze and everything about nvidia or kernel module management also froze. We have to reboot. Is there another way to force reset only the GPU without rebooting the whole system?
Marco Montevechi Filho avatar
de flag
Or am i looking at the wrong source of trouble here and nvidia thought that the defunct process still allocated memory because the kernel module was already frozen in the first place, not because the process died in weird ways?
DenisZ avatar
do flag
@MarcoMontevechiFilho To my knowledge it's not possible to recover nvidia kernel module once it gets defunct. Even reboot via shell does not always work, and hard reset it's needed.
Marco Montevechi Filho avatar
de flag
Okay, thanks! Indeed in my case i had to reboot the machine.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.