Score:2

Managing SLURM memory on single node installation (issues)

it flag

I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not memory.

When I try to allocate memory I get

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

So this will run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --time=6-59:00

But this will not run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem=2000M
#SBATCH --time=6-59:00

similarly this won't run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=6-59:00

Both give the above error message.

This is a pain because now that I am starting to max out the cpu usage, I am having jobs clash and fail, and I believe it is because there isn't proper memory allocation, so programs will crash with bad alloc error messages, or just stop running. I have used SLURM quite abit on compute canada clusters, and assigning memory was no issue. Is the problem that I am running SLURM on a single node which is also the login node? or that I am essentially using default settings and need to do some admin work?

I have tried using different units for memory such as 2G rather than 2000M and I have tried using 1024M as well, but to no avail.

The slurm.conf file is

ClusterName=linux
ControlMachine=dummyname

ControlAddr=dummyaddress
#BackupController=
#BackupAddr=
#
#SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=dummyport
SlurmdPort=dummyport+1
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurm
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
#FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
#DebugFlags=gres
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
GresTypes=gpu
NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE Gres=gpu:2
#NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE
PartitionName=all Nodes=dummyname Default=YES Shared=Yes MaxTime=INFINITE State=UP
in flag
Can you please share your slurm.conf and cgroup.conf?
Wesley avatar
it flag
@GeraldSchneider I added the `slurm.conf` file. There does not appear to be a `cgroup.conf`. `/slurm/` has a `cgroup.conf.example` file, but that is all.
in flag
You haven't defined any memory configuration for your node. Try adding the `RealMemory=` parameter to your `NodeName=` line.
Wesley avatar
it flag
@GeraldSchneider I still get the error. I added `RealMemory=200000` to the line
Wesley avatar
it flag
I also get the same error when I try adding SelectTypeParameters=CR_CPU_Memory and other variants involving memory
Wesley avatar
it flag
Anymore ideas @GeraldSchneider? 50 points isn't much, but they are honest points :)
Nathan Musoke avatar
cn flag
@wesley, I was able to fix what might be the same error by setting `MaxMemPerNode`. See https://slurm.schedmd.com/cons_res_share.html
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.