Introduction to SLURM

Introduction to Slurm
Ismael Fernández Pavón
16 / 12 / 2021

What is Slurm? Resource Manager
Job Scheduler Basic interaction

What is Slurm?
• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing a queue
of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.

What is Slurm?
• The use of “Slurm”, not SLURM or any other variation is
preferred.
• Simple Linux Utility for Resource Management (Historic)
• Slurm in all capitals describes earlier days of the software
when Slurm was just a resource manager.
Cluster manager and job scheduler
system for large and small Linux
clusters.

LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
What is Slurm?
ALPS (Cray)
Torque
Maui
Moab

✓ Open source
✓ Fault-tolerant
✓ Highly scalable
LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
What is Slurm?
ALPS (Cray)
Torque
Maui
Moab

Slurm: Resource Management
Cluster:
Collection of many separate
servers (nodes), connected
via a fast interconnect.
Node
Fast interconnection
Ethernet, Infiniband…

Node
CPU
(Core)
CPU
(Thread)
Nodes:
• pirineus[1-44]
• pirineus[45-50]
• canigo[1,2]
• pirineusgpu[1-4]
• pirineusknl[1-4]
GPGPU
(GRES)
Individual computer
component of an HPC
system.

Partitions:
• std
• std-fat
• mem
• gpu
• knl
• covid19
Partitions
Logical group of nodes with
common specs.

Allocated
cores
Allocated
memory
Jobs:
• ID (a number)
• Name
• Time limit
• Size specification
• Other Jobs Dependency
• State
Allocations of resources
assigned to a user for a
specified amount of time.

Core
used
Memory
used
Jobs Step:
• ID (a number)
• Name
• Time limit
• Size specification
Sets of (possibly parallel)
tasks within a job.

FULL CLUSTER
Job scheduling time!

Slurm: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.

on which resources.
FIFO Scheduling
Resources

on which resources.
FIFO Scheduling
Backfill Scheduling
• Job priority
• Time limit (Important!)
Time
Resources

Backfill Scheduling:
• Ej: New lower priority job
Elapsed time
Time limit
Time
Resources

Time
Resources
Submit
Elapsed time
Time limit

Time
Resources
Elapsed time
Time limit

Time
Resources
Wait time: 7
Elapsed time
Time limit

Time
Resources
Elapsed time
Time limit

Time
Resources
Wait time: 1
Elapsed time
Time limit

• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor

• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor
Fixed value
Dynamic value
User defined value

• Priority factor:
QoS:
• Account’s Priority:
− Normal
− Low
• RES users:
− class_a
− class_b
− class_c
QoS

Fairshare:
• It depends on:
• Consumption.
• Resources requested.
QoS
Fairshare

Age:
• Increase priority as more
time the job pends on
queue.
• Max 7 days.
• Not valid for dependent
jobs!
QoS
Fairshare
Age

Job size:
• Bigger jobs have more
priority.
• ONLY resources
NOT time.
QoS
Fairshare
Age
Job size

Name Variable Availability Time limit Backup
/home/<user> $HOME Global -
/scratch/<user> - Global 30 d
/scratch/<user>/tmp/<jobid>
$SCRATCH /
$SHAREDSCRATCH
Global 7 d
/tmp/<user>/<jobid>
$SCRATCH /
$LOCALSCRATCH
Node Job
Basic: General information
ssh –p 2122 <user>@hpc.csuc.cat
scp -P 2122 <local_file> <user>@hpc.csuc.cat:<remote_path>
• Login:
• Transfer files:
• Storage:

Basic: General information
• Linux commands:
Command Description
pwd Show current path
ls List current folder’s files
cd <path> Change directory
mkdir <dir> Create directory
cp <file> <new> Copy
mv <file> <new> Move
rm <file> Remove file
man <command> Show manual
Command Description
CTRL-c Stop current command
CTRL-r Search history
!! Repeatlast command
grep <p> <f> Search for patterns in files
tail <file> Show last 10 lines
head <file> Show first 10 lines
cat <file> Print file content
touch <file> Create an empty file

Basic: Jobs
Batch job: #!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Submit
Execute
Obtain
jobA
jobA
jobA

Basic: Jobs
Batch job:
Slurm directives
-J <name> Name of job
-o <file> Job’s std output file
-e <file> Job’s std error file
-p <part> Partition requested
-n <#tasks> Number of tasks
-c <#cpus> Number of procs per
task
-t <time> Time limit (dd-hh:mm,
hh:mm, mm)
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>

Basic: Jobs
Batch job:
Defaults
• std: 1 core
3900 MB / core
• std-fat: 1 core
7900 MB / core
• mem: 1 core
23900 MB / core
• gpu: 24 cores
3900 MB / core
1 GPGPU
• knl: All node
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>

Basic: Jobs
Batch job:
• First Load module
• Second Copy inputs to SCRATCH
Change working path
• Third Execution
• Forth Get outputs back
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>

Basic: Jobs
Batch job: Steps #!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
srun <application>
srun <application> &
srun <application> &
wait
Submit
Execute
Obtain
jobA
jobA.3
jobA.2
jobA.1 … job.N
jobA

Basic: Jobs
Batch job: Arrays #!/bin/bash
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
Submit
Execute
Obtain
jobA
jobA_3
jobA_2
jobA_1 … job_N
jobA_3
jobA_2
jobA_1 … job_N

Basic: Jobs
Batch job: Arrays
• Example: Job 50
#!/bin/bash
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
SLURM_JOB_ID 50
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 0
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
SLURM_JOB_ID 51
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 1
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
…

Batch job: Dependency
Basic: Jobs
Pre-processing
Analysis
Verification
jobX
jobB
jobA
jobN
jobM
jobL
ok?
ok?
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
cd ${SCRATCH}
<application>

Basic: Jobs
Batch job: Dependency
• after:<jobid>[:<jobid>...]
Start of the specified jobs.
• afterany:<jobid>[:<jobid>...]
Termination of the specified jobs.
• afternotok:<jobid>[:<jobid>...]
Failing of the specified jobs.
• singleton
All jobs with the same name
and user have ended.
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
cd ${SCRATCH}
<application>

PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED TIMEOUT
FAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL
HOLD
RELEASE
REQUEUE
SUBMISSION
Basic: Jobs

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
std* up infinite 19 mix pirineus[7,34-35,37-40,45]
std* up infinite 25 alloc pirineus[8,11-17,19-33,36,41-44,46-50]
std-fat up infinite 1 mix pirineus45
std-fat up infinite 5 alloc pirineus[46-50]
gpu up infinite 4 idle~ pirineusgpu[1-4]
knl up infinite 4 idle~ pirineusknl[1-4]
mem up infinite 2 mix canigo[1-2]
$ sinfo
PENDING
(CONFIGURING)
RUNNING COMPLETED
sinfo
COMPLETING
Basic: Jobs

+-----------+-------------+-----------------+--------------+------------+
| MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION |
+-----------+-------------+-----------------+--------------+------------+
| std nodes | 1536 | 1468 | 2212 | 95 % |
| fat nodes | 288 | 144 | 0 | 50 % |
| mem nodes | 96 | 96 | 289 | 100 % |
| gpu nodes | 144 | 96 | 252 | 66 % |
| knl nodes | 816 | 0 | 0 | 0 % |
| res nodes | 672 | 648 | 1200 | 96 % |
+-----------+-------------+-----------------+--------------+------------+
$ system-status
PENDING
(CONFIGURING)
RUNNING COMPLETED
sinfo
COMPLETING
Basic: Jobs

Submitted batch job 1720189
$ sbatch <file>
PENDING
(CONFIGURING)
RUNNING COMPLETED
sbatch
sinfo
COMPLETING
Basic: Jobs

Submitted batch job 1720189
$ sbatch <file>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1720189 std test user PD 0:00 1 (Resources)
$ squeue –u <username>
• Priority: One or more higher priority jobs exist for this partition or advanced reservation.
• Dependency: The job is waiting for a dependent job to complete.
PENDING
(CONFIGURING)
RUNNING COMPLETED
sbatch squeue
sinfo
COMPLETING
Basic: Jobs

1720189 std test user R 1-03:44 1 pirineus27
$ squeue –j 1720189
PENDING
(CONFIGURING)
RUNNING COMPLETED
sbatch
sinfo
COMPLETING
$ sstat –aj 1720189 --format=jobid,nodelist,mincpu,maxrss,pids
JobID Nodelist MinCPU MaxRSS Pids
------------ ---------------- ------------- ---------- ----------------------
1720189.ext+ pirineus27 226474
1720189.bat+ pirineus27 00:00.000 7348K 226491,226526,226528
1720189.0 pirineus27 1-03:44:05 19171808K 226557,226577
squeue
sstat
squeue
Basic: Jobs

PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sbatch squeue squeue
sinfo
1720189 std test user CG 2-15:56 1 pirineus27
$ squeue –j 1720189
• Move files from $LOCALSCRATCH to $SHAREDSCRATCH.
sstat
Basic: Jobs

Slurm: Job Life
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 COMPLETED 0:0
1720189.bat+ batch account 16 COMPLETED 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 COMPLETED 0:0
1720189.1 process account 16 COMPLETED 0:0
1720189.2 post account 16 COMPLETED 0:0
$ sacct
PENDING
(CONFIGURING)
RUNNING COMPLETED
sbatch
sinfo
COMPLETING
squeue
squeue
sstat sacct
• completed (CP), time_out (TO), out of memory (OM), fail (F), node_fail (NF)…

PENDING
(CONFIGURING)
RUNNING COMPLETED
sbatch scancel
CANCELLED
sinfo
COMPLETING
squeue
squeue
sstat
$ scancel –j 1720189
sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 CANCELLED+ 0:0
1720189.bat+ batch account 16 CANCELLED+ 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 CANCELLED+ 0:0
$ sacct
Basic: Jobs

the most suitable partition.
Choose
$SCRATCH as working directory.
Use
only the necessary files.
Move
important files at $HOME.
Keep
Basic: Best practices

Thank you for your attention!
feedback – ismael.fernandez@csuc.cat
support – http://hpc.csuc.cat

Introduction to SLURM

More Related Content

What's hot

Similar to Introduction to SLURM

More from CSUC - Consorci de Serveis Universitaris de Catalunya

Recently uploaded

Introduction to SLURM