I have one Ryzen R9 5950x CPU (16 cores/32 threads), one Xeon Phi 7120p card and partition/node in slurm.conf defined as:
NodeName=mic0 RealMemory=15000 Sockets=1 CoresPerSocket=61 ThreadsPerCore=4 State=UNKNOWN
PartitionName=compute Nodes=mic0 Default=YES MaxTime=INFINITE State=UP TRESBillingWeights="CPU=1.0,Mem=4.0G"
NodeName=amd RealMemory=10000 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=fast Nodes=amd Default=No MaxTime=INFINITE State=UP TRESBillingWeights="CPU=4.0,Mem=4.0G"
I want to run one task per core or thread of Ryzen CPU, but each of tasks in my jobs gets access to all CPU threads. For example, after the job allocation with salloc -p fast -n 8 --threads-per-core=1 --mem=256mb
, the following command srun -l --cpu_bind=threads cat /proc/self/status | grep Cpus_allowed_list | sort -n
displays:
0: Cpus_allowed_list: 0-31
1: Cpus_allowed_list: 0-31
2: Cpus_allowed_list: 0-31
3: Cpus_allowed_list: 0-31
4: Cpus_allowed_list: 0-31
5: Cpus_allowed_list: 0-31
6: Cpus_allowed_list: 0-31
7: Cpus_allowed_list: 0-31
I want one task to use only one thread or eventually core. The same problem is with salloc -p fast -n 8 --ntasks-per-core=1 --mem=256mb
In contrast to Ryzen, everything works just fine with Xeon Phi.
How can I fix the problem? Is there a mistake in the slurm.conf or the job allocation lines?
The slurm version is 21.08.8-2.
The OS is CentOS 7.
The complete slurm.conf (it is a very small "cluster", just a workstation):
ClusterName=cluster
SlurmctldHost=amd
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=1000 # don't use the qos factor
PriorityWeightTRES=CPU=1000,Mem=4000
PriorityFavorSmall=YES
AccountingStorageEnforce=associations,limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=mic0 RealMemory=15000 Sockets=1 CoresPerSocket=61 ThreadsPerCore=4 State=UNKNOWN
PartitionName=compute Nodes=mic0 Default=YES MaxTime=INFINITE State=UP TRESBillingWeights="CPU=1.0,Mem=4.0G"
#
NodeName=amd RealMemory=10000 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=fast Nodes=amd Default=No MaxTime=INFINITE State=UP TRESBillingWeights="CPU=4.0,Mem=4.0G"