Introduction to SLURM

Introduction to SLURM
Ismael Fernández Pavón
Cristian Gomollon Escribano
19 / 02 / 2020

What is SLURM?
• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing
a queue of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.

LoadLeveler (IBM)
LSF
SLURM
PBS Pro
Resource Managers Scheduler
What is SLURM?
ALPS (Cray)
Torque
Maui
Moab

✓ Open source
✓ Fault-tolerant
✓ Highly scalable
LoadLeveler (IBM)
LSF
SLURM
PBS Pro
Resource Managers Scheduler
What is SLURM?
ALPS (Cray)
Torque
Maui
Moab

SLURM: Resource Management
Cluster:
Collection of many separate
servers (nodes), connected
via a fast interconnect.

Node
CPU
(Core)
CPU
(Thread)
Nodes:
• Baseboards, Sockets,
Cores, Threads, (CPUs)
• Memory size
• Generic resources
(GRES)
• Features
• State
GPGPU
(GRES)
Individual computer
component of an HPC
system.

Partitions:
• Associatedwith specific
set of nodes
• Nodes can be in more
than one partition
• Job size and time limits
• Access control list
• State information
Partitions
Logical group of nodes with
common specs.

Allocated
cores
Allocated
memory
Jobs:
• ID (a number)
• Name
• Time limit
• Size specification
• Other Jobs Dependency
• State
Allocations of resources
assigned to a user for a
specified amount of time.

Core
used
Memory
used
Jobs Step:
• ID (a number)
• Name
• Time limit
• Size specification
Sets of (possibly parallel)
tasks within a job.

FULL CLUSTER
Job scheduling time!

SLURM: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.

on which resources.
FIFO Scheduling
Resources

on which resources.
FIFO Scheduling
Backfill Scheduling
• Job priority
• Time limit (Important!)
Time
Resources

Backfill Scheduling:
• Based on the job request, resources available, and
policy limits imposed.
• Starts with job priority.
• Higher priority jobs cannot be delayed by lower priority
jobs.
• Expected start time of pending jobs depends upon the
expected completion time of running jobs, reasonably
accurate time limits.
• Results in a resource allocation over a period.

• Ej: New lower priority job
Elapsed time
Time limit
Time
Resources

Time
Resources
Submit
Elapsed time
Time limit

Time
Resources
Elapsed time
Time limit

Time
Resources
Wait time: 7
Elapsed time
Time limit

Time
Resources
Elapsed time
Time limit

Time
Resources
Wait time: 1
Elapsed time
Time limit

Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor

Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor
Fixed value
Dynamic value
User defined value

• Priority factor:
QoS:
• Account’s Priority:
− Normal
− Low
QoS

Partition:
• It only affects to RES
users:
− class_a
− class_b
− class_c
QoS
Partition

Fairshare:
• It depends on:
• Consumption.
• Resources requested.
QoS
Partition
Fairshare

Age:
• Increase priority as more
time the job pends on
queue.
• Max 7 days.
• Not valid for dependent
jobs!
QoS
Partition
Fairshare
Age

Job size:
• Bigger jobs have more
priority.
• ONLY resources
NOT time.
QoS
Partition
Fairshare
Age
Job size

•sbatch – Submit a batch script.
•salloc – Request resources for an interactive job.
•srun – Start a new task (job step).
•scancel – Cancel a job.
SLURM: Commands

• sinfo – Report system status (nodes, queues, etc.).
PARTITION AVAIL TIME NODES STATE NODELIST
std* up inf+ 2 mix pirineus[15,21]
std* up inf+ 30 alloc pirineus[13-14,16-20,22-44]
std-fat up inf+ 3 idle~ pirineus[45,49-50]
std-fat up inf+ 3 alloc pirineus[46-48]
gpu up inf+ 2 idle~ pirineusgpu[3-4]
gpu up inf+ 1 mix pirineusgpu2
knl up inf+ 3 idle~ pirineusknl[2-4]
mem up inf+ 1 mix canigo1
class_a up inf+ 1 idle~ pirineus12
class_a up inf+ 2 mix canigo1,pirineus11
class_a up inf+ 8 alloc pirineus[1-6,8-9]
class_a up inf+ 2 resv pirineus[7,10]
class_c up inf+ 1 idle~ pirineus12
class_c up inf+ 2 mix canigo1,pirineus11
class_c up inf+ 8 alloc pirineus[1-6,8-9]
class_c up inf+ 2 resv pirineus[7,10]
SLURM: Commands

• sinfo – Report system status.
-N Node-oriented format information, with one line per
node and partition.
-p Print information only about the specified partition(s).
--Format Specify the information to be displayed.
"Nodelist,Partition,StateCompact,CpusState,Memory,Freemem"
NODELIST PARTITION STATE CPUS(A/I/O/T) MEMORY FREE_MEM
canigo1 class_a mix 112/80/0/192 4643070 2458001
pirineus1 class_a idle~ 0/48/0/48 191904 188950
pirineus2 class_a alloc 48/0/0/48 191904 44123
pirineus4 class_a mix 32/16/0/48 191904 66623
pirineus5 class_a mix 16/32/0/48 191904 162277
pirineus7 class_a idle~ 0/48/0/48 191904 189289
SLURM: Commands

-s List only a partition state summary with no node state details.
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
std* up infinite 32/0/0/32 pirineus[13-44]
std-fat up infinite 3/3/0/6 pirineus[45-50]
gpu up infinite 1/2/0/3 pirineusgpu[2-4]
knl up infinite 0/3/0/3 pirineusknl[2-4]
mem up infinite 1/0/0/1 canigo1
class_a up infinite 10/3/0/13 canigo1,pirineus[1-12]
class_b up infinite 10/3/0/13 canigo1,pirineus[1-12]
class_c up infinite 10/3/0/13 canigo1,pirineus[1-12]
SLURM: Commands

-s List only a partition state summary with no node state details.
TIP: Use system-status.
SLURM: Commands
+-----------+-------------+-----------------+--------------+------------+
| MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION |
+-----------+-------------+-----------------+--------------+------------+
| std nodes | 1536 | 1468 | 2212 | 95 % |
| fat nodes | 288 | 144 | 0 | 50 % |
| mem nodes | 96 | 96 | 289 | 100 % |
| gpu nodes | 144 | 96 | 252 | 66 % |
| knl nodes | 816 | 0 | 0 | 0 % |
| res nodes | 672 | 648 | 1200 | 96 % |
+-----------+-------------+-----------------+--------------+------------+

• squeue – Report job and job step status.
JOBID PARTIT NAME USER ST TIME NODES NODELIST
1222376 mem dada2 mvelasco PD 0:00 1 (Resources)
1221504 std Freq_TS_ uabqut16 PD 0:00 1 (Resources)
1222346 std Cu2T-tra agusti PD 0:00 1 (Priority)
1222347 std AuIPr_Ph sciortin PD 0:00 1 (Priority)
1220930 std nickeloc ubaqis07 PD 0:00 1 (Priority)
1222351 std g09d1 upceqt04 R 2:18:20 1 pirineus21
1221621 mem C3 vpenya R 23:56:04 1 canigo1
1221569 std preTS_VI porellan R 19:39:13 1 pirineus17
1221543 std Au2-Cl-d agusti R 1-13:40:32 1 pirineus22
1221616 std-fat CuII_mod mariona R 1-10:35:33 1 pirineus47
1221617 std-fat CuIII_mo mariona R 1-10:35:33 1 pirineus48
1221461 std opt-1xe2 pbesalu R 2-11:22:43 1 pirineus37
1221413 std s24ls_de jcirera R 4:08:01 1 pirineus22
1220720 std nickeloc ubaqis07 R 4-03:00:44 2 pirineus[34-35]
1220719 std nickeloc ubaqis07 R 4-03:00:48 1 pirineus14
1221546 mem C60-Zn-T pbesalu R 22:31:12 1 canigo1
SLURM: Commands

• scontrol – Administrator tool to view and/or update
system, job, step, partition or reservation status.
scontrol hold <jobid>
scontrol release <jobid>
scontrol show job <jobid>
SLURM: Commands

SLURM: Commands
JobId=1222543 JobName=test_large_g16.slm
UserId=ifernandez(80347) GroupId=csuc(10000) MCS_label=N/A
Priority=100209 Nice=0 Account=csuc QOS=test
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=05:04:05 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2020-01-16T09:55:19 EligibleTime=2020-01-16T09:55:19
AccrueTime=2020-01-16T09:55:19
StartTime=2020-01-16T09:55:20 EndTime=2020-01-17T09:55:21 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-16T09:55:20
Partition=std AllocNode:Sid=192.168.19.26:7243
ReqNodeList=(null) ExcNodeList=(null)
NodeList=pirineus17
BatchHost=pirineus17
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=15600M,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryCPU=3900M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ifernandez/examples/gaussian/g16/large/test_large_g16.slm
WorkDir=/home/ifernandez/examples/gaussian/g16/large
StdErr=/home/ifernandez/examples/gaussian/g16/large/slurm-1222543.out
StdIn=/dev/null
StdOut=/home/ifernandez/examples/gaussian/g16/large/slurm-1222543.out
Power=

SLURM: Job Life
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED TIMEOUTFAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL
HOLD
RELEASE
REQUEUE
SUBMISSION

SLURM: Job Life
SUBMISSION
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED
HOLD
RELEASE
REQUEUE
TIMEOUTFAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL

SLURM: Job Life
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED
HOLD
RELEASE
REQUEUE
SUBMISSION
TIMEOUTFAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL

SLURM: Job Life
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED
HOLD
RELEASE
REQUEUE
SUBMISSION
TIMEOUTFAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL
Pending Reasons:
• Priority: One or more higher priority jobs exist for this partition or advanced
reservation.
• Reasources: The job is waiting for resources to become available.
• Reservation: The job is waiting its advanced reservation to become available.
• ReqNodeNotAvail: Some node specifically required by the job is not currently
available.
• JobHeldAdmin / JobHeldUser: The job is held by a system administrator / the
user.
• Dependency: This job is waiting for a dependent job to complete.
• BadConstraints: The job's constraints can not be satisfied.
• InvalidQOS: The job's QOS is invalid. Account’s assigned time exhausted?
• AssociationTimeLimit: The job's association has reached its time limit.
Account’s assigned time exhausted?

•SLURM Upgrade to 19.05
• New job state: OUT_OF_MEMORY.
• Job killed by OOM.
• Fixed ratio between MEMORY and CPU.
SLURM: News
Partition
MAX. Mem per CPU
(MB)
MAX. Mem per CPU
(GB)
std 3900 MB 3,8 GB
std-fat 7900 MB 7,7 GB
mem 24180 MB 23,6 GB

Login on CSUC infrastructure
• Login
ssh –p 2122 username@hpc.csuc.cat
• Transferfiles
scp -P 2122 local_file username@hpc.csuc.cat:[path to your folder]
sftp -oPort=2122 username@hpc.csuc.cat
• Useful paths
Name Variable Availability Quote/project Time limit Backup
/home/$user $HOME global >64 GB unlimited Yes
/scratch/$user $SCRATCH global unlimited 30 days No
/scratch/$user/tmp/jobid $TMPDIR / $SHAREDSCRATCH global job file limit 1 week No
/tmp/$user/jobid $TMPDIR / $LOCALSCRATCH Local to each node job file limit 1 week No
• Get HC consumption
consum -a ‘any’ (group consumption)
consum -a ‘any’ -u ‘nom_usuari’ (user consumption)

Batch job submission: Default settings
• 4-8Gb/core (std and std-fat partitions).
• 24Gb/core on mem partition.
• 1 core on std, std-fat and mem partitions.
• 24 cores and 1 GPU on gpu partition.
• The whole node on KNL partition.
• Non-exclusive, multinode job.
• Working and Output directory are the submit directory.

Batch job submission
• Basic Linux commands:
Description Command Exemple
List files ls ls /home/user
Making folders mkdir mkdir /home/prova
Changing folder cd cd /home/prova
Copy files cp cp nom_arxiu1 nom_arxiu2
Move file mv mv /home/prova.txt /cescascratch/prova.txt
Delete file rm rm filename
Print file content cat cat filename
Find string into files grep grep ‘word’ filename
List last lines on file tail tail filename
• Text editors : vim, nano, emacs,etc.
• More detailed info and options about the commands:
‘command’ –help
man ‘command’

#!/bin/bash
#SBATCH–jJOB_NAME
#SBATCH-o OUTPUT_FILE.log
#SBATCH-e ERROR_FILE.err
#SBATCH-p PARTITION
#SBATCH–mem=TOTMEM
#SBATCH-n NTASKS
#SBATCH–c NCORES/TASK
module load mpi/intel/openmpi/3.1.0
cp –r $input $SCRATCH
Cd $SCRATCH
srun $APPLICATION
mkdir -p $OUTPUT_DIR
cp -r * $output
Batch job submission: The slurm submit script
Schedulerdirectives
Setting up the environment variables and paths
Move the input files to the working directory
Launch the application(similar to mpirun)
Create the output folderand move the outputs

Scheduler directives/Options : #SBATCH
• -c, --cpus-per-task=ncpus number of cpus required per task
• --gres=list required generic resources
• -J, --job-name=jobname name of job
• -n, --ntasks=ntasks number of tasks to run
• --ntasks-per-node=n number of tasks to invoke on each node
• -N, --nodes=N number of nodes on which to run (N = min[-max])
• -o, --output=out file for batch script's standard output
• -p, --partition=partition partition requested
• -t, --time=minutes time limit (format: dd-hh:mm)

• -C, --constraint=list specify a list of constraints(mem, vnc , ....)
• --mem=MB minimum amount of total real memory
• --reservation=name allocate resources from named reservation
• -w, --nodelist=hosts... request a specific list of hosts
• --mem-per-cpu=MB amount of real memory per allocated core
• -t, --time=minutes Job max duration (Mandatory!!)
More commands/infotyping 'sbatch -h'
Scheduler directives/Options : #SBATCH

How to generate slurm script files: 1º Identify app parallelism
Threadparallelism
Process parallelism
#SBATCH –-ntasks=1
#SBATCH --cpus-per-task=NCORES
#SBATCH –-ntasks=NCORES
#SBATCH --cpus-per-task=1

How to generate slurm script files: 2º Determine the memory requirements
#SBATCH –-mem=63900
#SBATCH --partition=std-fat
The partition choice is strongly dependent of the job memory requirements !!
#SBATCH --partition=std
#SBATCH --partition=mem
#SBATCH –-mem-per-cpu=3900
#SBATCH --ntasks=16
#SBATCH --partition=std
Partition Memory/core
std/gpu
std-fat/KNL
mem
4Gb
8Gb
24Gb

How to generate slurm script files: 3º RunTime requirements
#SBATCH --time=Thpc
WORKSTATION -->
4 Cores(Nws)
8-16Gb RAM
1Tb 600mb/s
Ethernet 1-10 Gbs
HPC NODE
48 Cores(Nhpc)
192Gb RAM
200Tb 4Gb/s
Infiniband 100-200Gbs
Performance comparison At first approximation:

How to generate slurm script files: 4º Disk/IO requirements
Two kind of applications
Threaded/serial Multitask
Only one node: Multinode:
cd $SHAREDSCRATCH
Or
cd $LOCALSCRATCH
cd $SHAREDSCRATCH
Or let SLURM decide for you
cd $SCRATCH

How to generate slurm script files: Summary
1. Identify your application parallelism.
2. Estimate the amount of resources needed by your solving algorithm.
3. Estimate as better as possible the runtime.
4. Determine if your job I/O and input requirements.
5. Determine which are the necessary output files and save only these files
in your own disk space.

Gaussian 16 (Threaded Example)
#!/bin/bash
#SBATCH-j gau16_test
#SBATCH-o gau_test_%j.log
#SBATCH-e gau_test_%j.err
#SBATCH-n 1
#SBATCH-c 16
#SBATCH-p std
#SBATCH–mem=30000
#SBATCH–time=10-00
module load gaussian/g16b1
INPUT_DIR=/$HOME/gaussian_test/inputs
OUTPUT_DIR=$HOME/gaussian_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
g16 < input.gau > output.out
cp -r output.out $output
Threaded application
Less than 4Gb/core, std partition
10 Days RunTime
Set up environment to run the APP

Vasp 5.4.4 (Multitask Example)
#!/bin/bash
#SBATCH-j vasp_test_%j
#SBATCH-o vasp_test_%j.log
#SBATCH–e vasp_test_%j.err
#SBATCH-n 24
#SBATCH–c 1
#SBATCH–mem-per-cpu=7500
#SBATCH-p std-fat
#SBATCH–time=20:00
module load vasp/5.4.4
INPUT_DIR=/$HOME/vasp_test/inputs
OUTPUT_DIR=$HOME/vasp_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
srun `which vasp_std`
cp -r * $output
Multitaskapplication
More than 4Gb/core,but less than 8Gb/core ,
std-fat partition
20 Min RunTime
Set up environment to run the APP
Multitask app requires 'srun' command

Gromacs (MultiTask and threaded Example)
#!/bin/bash
#SBATCH--job-name=gromacs
#SBATCH--output=gromacs_%j.out
#SBATCH--error=gromacs_%j.err
#SBATCH-n 24
#SBATCH-c 2
#SBATCH-N 1
#SBATCH-p gpu
#SBATCH--gres=gpu:2
#SBATCH--time=00:30:00
module load gromacs/2018.4_mpi
cd $SHAREDSCRATCH
cp -r $HOME/SLMs/gromacs/CASE/*.
srun `which gmx_mpi`mdrun -v -deffnm input_system -ntomp $SLURM_CPUS_PER_TASK -nb
gpu -npme 12 -dlb yes -pin on –gpu_id 01
cp –r * /scratch/$USER/gromacs/CASE/output/
1 NODE Hybrid job!
2GPUs/Node on GPU partition

ANSYS Fluent (MultiTask Example)
#!/bin/bash
#SBATCH-j truck.cas
#SBATCH-o truck.log
#SBATCH-e truck.err
#SBATCH-p std
#SBATCH-n 16
#SBATCH–time=10-20:00
module load toolchains/gcc_mkl_ompi
INPUT_DIR=$HOME/FLUENT/inputs
OUTPUT_DIR=$HOME/FLUENT/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
/prod/ANSYS16/v162/fluent/bin/fluent3ddp –t $SLURM_NCPUS -mpi=hp -g -i input1_50.txt
cp -r * $output

Best Practices
• Use $SCRATCHas workingdirectory.
• Move only the necessaryfiles(notall files in the folder each time).
• Try to keep importantfiles only at $HOME
• Try to choose the partition and resoruces whose mostfit to your job.

Introduction to SLURM

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to SLURM

Similar to Introduction to SLURM (20)

More from CSUC - Consorci de Serveis Universitaris de Catalunya

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Introduction to SLURM