We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.
We ran a few long-lasting tests on the GPUs and the system was stable. However, after some GPU idling the system crashed repeatedly.
We assume that GpuPowerMizerMode
has to be set to 1 to prevent crashes during GPU idling (an assumption backed by other user reports found on the internet).
The only way to do this that we know of is to start X (e.g. by starting gdm) and then set the value accordingly via nvidia-settings
(running nvidia-settings
without X/gdm leads to "Unable to init server: Could not connect: Connection refused."). But when stopping X/gdm, the GpuPowerMizerMode
value is automatically reset to 2. Unfortunately, keeping X/gdm running is not an option because this also leads to system instability.
So, our problem seems to be as follows:
- GPU idling +
GpuPowerMizerMode
!= 1 can result in a system freeze. GpuPowerMizerMode
can only be set via nvidia-settings
connected to a running X/dm(?). In order to persistently set the value to 1 X/dm(?) has to keep running.
- A running X/gdm can cause a system crash.
Are our assumptions correct? / Are others also experiencing these specific problems?
How can we solve the problem of freezing during GPU idling?