Score:1

Slurm srun cannot allocate ressources for GPUs - Invalid generic resource specification

ca flag

I am able to launch a job on a GPU server the traditional way (using CPU and MEM as consumables):

~ srun -c 1 --mem 1M -w serverGpu1 hostname
serverGpu1

but trying to use the GPUs will give an error:

~ srun -c 1 --mem 1M --gres=gpu:1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

I checked this question but it doesn't help in my case.

Slurm.conf

On all nodes

SlurmctldHost=vinz
SlurmctldHost=shiny
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/media/Slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/media/Slurm
SwitchType=switch/none
TaskPlugin=task/cgroup

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/media/Slurm/job_completion.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/media/Slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
MaxArraySize=10001
NodeName=docker1 CPUs=144 Boards=1 RealMemory=300000 Sockets=4 CoresPerSocket=18 ThreadsPerCore=2 Weight=100 State=UNKNOWN
NodeName=serverGpu1 CPUs=96 RealMemory=550000 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 Gres=gpu:nvidia_tesla_t4:4 ThreadsPerCore=2 Weight=500 State=UNKNOWN

PartitionName=Cluster Nodes=docker1,serverGpu1 Default=YES MaxTime=INFINITE State=UP

cgroup.conf

On all nodes

CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes

gres.conf

Only on GPU servers

AutoDetect=nvml

As for the log of the GPU server:

[2021-12-06T12:22:52.800] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-12-06T12:22:52.801] CPU frequency setting not configured for this node
[2021-12-06T12:22:52.803] slurmd version 20.11.2 started
[2021-12-06T12:22:52.803] killing old slurmd[42176]
[2021-12-06T12:22:52.805] slurmd started on Mon, 06 Dec 2021 12:22:52 +0100
[2021-12-06T12:22:52.805] Slurmd shutdown completing
[2021-12-06T12:22:52.805] CPUs=96 Boards=1 Sockets=2 Cores=24 Threads=2 Memory=772654 TmpDisk=1798171 Uptime=8097222 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

I would like some guidance on how to resolve this issue, please.

Edits: As requested by @Gerald Schneider

~ sinfo -N -o "%N %G"
NODELIST GRES
docker1 (null)
serverGpu1 (null)
in flag
can you please add the output of `sinfo -N -o "%N %G"`?
user324810 avatar
ca flag
@GeraldSchneider done!
in flag
Try adding the GPUs to gres.conf on the node directly, instead of setting it to AutoDetect. I get the correct GPU definitions in the %G column with sinfo on my nodes.
user324810 avatar
ca flag
I removed the `AutoDetect=nvml` and I set in the `gres.conf` the following line: `Name=gpu File=/dev/nvidia[0-3]` and in the slurm.conf, I changed the NodeName of the GPU by modifying to `Gres=gpu`. In the log, I got `[2021-12-06T16:05:47.604] WARNING: A line in gres.conf for GRES gpu has 3 more configured than expected in slurm.conf. Ignoring extra GRES.`
in flag
My config looks very similar to yours. The only difference I see is that I have AccountingStorage enabled and have set `AccountingStorageTRES=gres/gpu,gres/gpu:tesla`, but I don't think that should be necessary. I also have a `Type=` set in gres.conf, you could try setting it to `nvidia_tesla_t4` so it matches your definition in slurm.conf.
in flag
Are the slurm.conf files identical on your nodes? Try setting `DebugFlags=gres` and see if something helpful shows up in the logs.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.