Score:1

Server

Slurm srun cannot allocate ressources for GPUs - Invalid generic resource specification

user324810

4/6/23, 2:47 PM

I am able to launch a job on a GPU server the traditional way (using CPU and MEM as consumables):

~ srun -c 1 --mem 1M -w serverGpu1 hostname
serverGpu1

but trying to use the GPUs will give an error:

~ srun -c 1 --mem 1M --gres=gpu:1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

I checked this question but it doesn't help in my case.

Slurm.conf

On all nodes

SlurmctldHost=vinz
SlurmctldHost=shiny
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/media/Slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/media/Slurm
SwitchType=switch/none
TaskPlugin=task/cgroup

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/media/Slurm/job_completion.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/media/Slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
MaxArraySize=10001
NodeName=docker1 CPUs=144 Boards=1 RealMemory=300000 Sockets=4 CoresPerSocket=18 ThreadsPerCore=2 Weight=100 State=UNKNOWN
NodeName=serverGpu1 CPUs=96 RealMemory=550000 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 Gres=gpu:nvidia_tesla_t4:4 ThreadsPerCore=2 Weight=500 State=UNKNOWN

PartitionName=Cluster Nodes=docker1,serverGpu1 Default=YES MaxTime=INFINITE State=UP

cgroup.conf

On all nodes

CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes

gres.conf

Only on GPU servers

AutoDetect=nvml

As for the log of the GPU server:

[2021-12-06T12:22:52.800] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-12-06T12:22:52.801] CPU frequency setting not configured for this node
[2021-12-06T12:22:52.803] slurmd version 20.11.2 started
[2021-12-06T12:22:52.803] killing old slurmd[42176]
[2021-12-06T12:22:52.805] slurmd started on Mon, 06 Dec 2021 12:22:52 +0100
[2021-12-06T12:22:52.805] Slurmd shutdown completing
[2021-12-06T12:22:52.805] CPUs=96 Boards=1 Sockets=2 Cores=24 Threads=2 Memory=772654 TmpDisk=1798171 Uptime=8097222 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

I would like some guidance on how to resolve this issue, please.

Edits: As requested by @Gerald Schneider

~ sinfo -N -o "%N %G"
NODELIST GRES
docker1 (null)
serverGpu1 (null)

518

0 + 0

slurm

gpu

Gerald Schneider

4/6/23, 2:56 PM

can you please add the output of `sinfo -N -o "%N %G"`?

user324810

4/6/23, 2:58 PM

@GeraldSchneider done!

Gerald Schneider

4/6/23, 3:00 PM

Try adding the GPUs to gres.conf on the node directly, instead of setting it to AutoDetect. I get the correct GPU definitions in the %G column with sinfo on my nodes.

user324810

4/6/23, 3:06 PM

I removed the `AutoDetect=nvml` and I set in the `gres.conf` the following line: `Name=gpu File=/dev/nvidia[0-3]` and in the slurm.conf, I changed the NodeName of the GPU by modifying to `Gres=gpu`. In the log, I got `[2021-12-06T16:05:47.604] WARNING: A line in gres.conf for GRES gpu has 3 more configured than expected in slurm.conf. Ignoring extra GRES.`

Gerald Schneider

4/7/23, 9:05 AM

My config looks very similar to yours. The only difference I see is that I have AccountingStorage enabled and have set `AccountingStorageTRES=gres/gpu,gres/gpu:tesla`, but I don't think that should be necessary. I also have a `Type=` set in gres.conf, you could try setting it to `nvidia_tesla_t4` so it matches your definition in slurm.conf.

Gerald Schneider

4/7/23, 9:05 AM

Are the slurm.conf files identical on your nodes? Try setting `DebugFlags=gres` and see if something helpful shows up in the logs.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Slurm srun cannot allocate ressources for GPUs - Invalid generic resource specification

TH: Slurm srun ไม่สามารถจัดสรรทรัพยากรสำหรับ GPU - ข้อกำหนดทรัพยากรทั่วไปไม่ถูกต้อง

RO: Slurm srun nu poate aloca resurse pentru GPU - Specificație generică de resurse invalidă

RU: Slurm srun не может выделить ресурсы для графических процессоров — неверная общая спецификация ресурсов

VI: Slurm srun không thể phân bổ nguồn tài nguyên cho GPU - Đặc tả tài nguyên chung không hợp lệ

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.