Score:1

GPU server freezes during GPU idling

ar flag

We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.

We ran a few long-lasting tests on the GPUs and the system was stable. However, after some GPU idling the system crashed repeatedly.

We assume that GpuPowerMizerMode has to be set to 1 to prevent crashes during GPU idling (an assumption backed by other user reports found on the internet).

The only way to do this that we know of is to start X (e.g. by starting gdm) and then set the value accordingly via nvidia-settings (running nvidia-settings without X/gdm leads to "Unable to init server: Could not connect: Connection refused."). But when stopping X/gdm, the GpuPowerMizerMode value is automatically reset to 2. Unfortunately, keeping X/gdm running is not an option because this also leads to system instability.

So, our problem seems to be as follows:

  1. GPU idling + GpuPowerMizerMode != 1 can result in a system freeze. GpuPowerMizerMode can only be set via nvidia-settings connected to a running X/dm(?). In order to persistently set the value to 1 X/dm(?) has to keep running.
  2. A running X/gdm can cause a system crash.

Are our assumptions correct? / Are others also experiencing these specific problems?

How can we solve the problem of freezing during GPU idling?

Score:1
cz flag

It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this; nvidia-settings should work fine from the framebuffer console or even in a script you write that runs at startup.

Check to be sure:

# nvidia-settings -q GpuPowerMizerMode

  Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1.
    Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2.
    'GPUPowerMizerMode' can use the following target types: GPU.

For eight GPUs just write a simple script, something like:

for n in $(seq 0 7); do
    nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1"
done

and run it at startup in whatever manner you find convenient.


I can't say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.

user776206 avatar
ar flag
Running nvidia-settings without running and using X/gdm leads to 'Unable to init server: Could not connect: Connection refused.'
Michael Hampton avatar
cz flag
@user776206 Hm, that's unexpected. I'll go play with it a bit later.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.