Introduction to Slurm Resource Manager and Job Scheduler

Introduction to Slurm
Ismael Fernández Pavón
HPC Support
13 / 12 / 2022

What is ?
Resource Manager
Job Scheduler
Basic interaction

• Simple Linux Utility for Resource Management (Historic)
or simply Slurm.
Cluster manager and job scheduler
system for large and small Linux
clusters.
What is Slurm?

• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing a queue
of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.
What is Slurm?

LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
ALPS (Cray)
Torque
Maui
Moab
What is Slurm?

✓ Open source
✓ Fault-tolerant
✓ Highly scalable
✓ Almost everywhere
LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
ALPS (Cray)
Torque
Maui
Moab
What is Slurm?

Cluster:
Collection of many separate
servers (nodes), connected
via a fast interconnect.
Node
Fast interconnection
Ethernet, Infiniband…
Slurm: Resource Management

Node
CPU
(Core)
CPU
(Thread)
Nodes:
• pirineus[1-6]
• pirineus[7-50]
• pirineus[51-69]
• canigo[1,2]
• pirineusgpu[1-4]
• pirineusknl[1-4]
GPGPU
(GRES)
Individual computer
component of an HPC
system.

Partitions:
• std
• std-fat
• mem
• gpu
• knl
• covid19
• exclusive
Partitions
Logical group of nodes with
common specs.

Allocated
cores
Allocated
memory
Jobs:
• ID (a number)
• Name
• Time limit
• Size specification
• Other Jobs Dependency
• State
Allocations of resources
assigned to a user for a
specified amount of time.

Core
used
Memory
used
Jobs Step:
• ID (a number)
• Name
• Time limit
• Size specification
Sets of (possibly parallel)
tasks within a job.

FULL CLUSTER
Job scheduling time!

Scheduling: The process of determining next job to run and
on which resources.
Slurm: Job Scheduling

on which resources.
FIFO Scheduling
Resources

on which resources.
FIFO Scheduling
Backfill Scheduling
• Job priority
• Time limit (Important!)
Time
Resources

Backfill Scheduling:
• Ej: New lower priority job
Elapsed time
Time limit
Time
Resources

Time
Resources
Submit
Elapsed time
Time limit

Time
Resources
Elapsed time
Time limit

Time
Resources
Wait time: 7
Elapsed time
Time limit

Time
Resources
Wait time: 1
Elapsed time
Time limit

• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor

• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor
Fixed value
Dynamic value
User defined value

• Priority factor: QoS:
• Account’s Priority:
− Normal
− Low
• RES users:
− class_a
− class_b
− class_c
QoS

• Priority factor: Fairshare:
• It depends on:
• Consumption.
• Resources requested.
QoS
Fairshare

• Priority factor: Age:
• Increase priority as more
time the job pends on
queue.
• Max 7 days.
• Not valid for dependent
jobs!
QoS
Fairshare
Age

• Priority factor: Job size:
• Bigger jobs have more
priority.
• ONLY resources
NOT time.
QoS
Fairshare
Age
Job size

Name Variable Availability Lifetime Backup
/home/<user> $HOME Global -
/scratch/<user> - Global 30 d
/scratch/<user>/tmp/<jobid>
$SCRATCH /
$SHAREDSCRATCH
Global 7 d
/tmp/<user>/<jobid>
$SCRATCH /
$LOCALSCRATCH
Node Job
Basic: General information
ssh –p 2122 <user>@hpc.csuc.cat
scp -P 2122 <local_file> <user>@hpc.csuc.cat:<remote_path>
• Login:
• Transfer files:
• Storage:
Cheatsheet

Basic: General information
• Linux commands:
Command Description
pwd Show current path
ls List current folder’s files
cd <path> Change directory
mkdir <dir> Create directory
cp <file> <new> Copy
mv <file> <new> Move
rm <file> Remove file
man <command> Show manual
Command Description
CTRL-c Stop current command
CTRL-r Search history
!! Repeatlast command
grep <p> <f> Search for patterns in files
tail <file> Show last 10 lines
head <file> Show first 10 lines
cat <file> Print file content
touch <file> Create an empty file
Cheatsheet

$ vim submit_file.slm
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs

Batch job: #!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Submit
Execute
Obtain
jobA
jobA
jobA
Basic: Jobs

Batch job:
Slurm directives
-J <name> Name of job
-o <file> Job’s std output file
-e <file> Job’s std error file
-p <part> Partition requested
-n <#tasks> Number of tasks
-c <#cpus> Number of procs per
task
-t <time> Time limit (dd-hh:mm,
hh:mm, mm)
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
Basic: Jobs

Batch job:
Defaults
• std: 1 core
3900 MB / core
• std-fat: 1 core
7900 MB / core
• mem: 1 core
23900 MB / core
• gpu: 24 cores
3900 MB / core
1 GPGPU
• knl: All node
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
Basic: Jobs

Batch job:
• First Load module
• Second Copy inputs to SCRATCH
Change working path
• Third Execution
• Forth Get outputs back
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
Basic: Jobs

Batch job: Steps #!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
srun <application>
srun <application> &
srun <application> &
wait
Submit
Execute
Obtain
jobA
jobA.3
jobA.2
jobA.1 … job.N
jobA
Basic: Jobs

Batch job: Arrays #!/bin/bash
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
Submit
Execute
Obtain
jobA
jobA_3
jobA_2
jobA_1 … job_N
jobA_3
jobA_2
jobA_1 … job_N
Basic: Jobs

Basic: Jobs
Batch job: Arrays
• Example: Job 50
#!/bin/bash
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -n <#tasks>
#SBATCH -t 60
cd ${SCRATCH}
<application>
SLURM_JOB_ID 50
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 0
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
SLURM_JOB_ID 51
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 1
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
…

Batch job: Dependency
Pre-processing
Analysis
Verification
jobX
jobB
jobA
jobN
jobM
jobL
ok?
ok?
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
cd ${SCRATCH}
<application>
Basic: Jobs

Batch job: Dependency
• after:<jobid>[:<jobid>...]
Start of the specified jobs.
• afterany:<jobid>[:<jobid>...]
Termination of the specified jobs.
• afternotok:<jobid>[:<jobid>...]
Failing of the specified jobs.
• singleton
All jobs with the same name
and user have ended.
#!/bin/bash
#SBATCH -n <#tasks>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
cd ${SCRATCH}
<application>
Basic: Jobs

PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
COMPLETING
CANCELED COMPLETED TIMEOUT
FAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE
FAIL
HOLD
RELEASE
REQUEUE
SUBMISSION
Basic: Jobs

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
std* up infinite 19 mix pirineus[7,34-35,37-40,45]
std* up infinite 25 alloc pirineus[8,11-17,19-33,36,41-44,46-50]
std-fat up infinite 1 mix pirineus45
std-fat up infinite 5 alloc pirineus[46-50]
gpu up infinite 4 idle~ pirineusgpu[1-4]
knl up infinite 4 idle~ pirineusknl[1-4]
mem up infinite 2 mix canigo[1-2]
$ sinfo
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs
sinfo

+-----------+-------------+-----------------+--------------+------------+
| MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION |
+-----------+-------------+-----------------+--------------+------------+
| std nodes | 1536 | 1468 | 2212 | 95 % |
| fat nodes | 288 | 144 | 0 | 50 % |
| mem nodes | 96 | 96 | 289 | 100 % |
| gpu nodes | 144 | 96 | 252 | 66 % |
| knl nodes | 816 | 0 | 0 | 0 % |
| res nodes | 672 | 648 | 1200 | 96 % |
+-----------+-------------+-----------------+--------------+------------+
$ system-status
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs
sinfo

Submitted batch job 1720189
$ sbatch <file>
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sinfo
sbatch

Submitted batch job 1720189
$ sbatch <file>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1720189 std test user PD 0:00 1 (Resources)
$ squeue –u <username>
• Priority: One or more higher priority jobs exist for this partition or advanced reservation.
• Dependency: The job is waiting for a dependent job to complete.
Basic: Jobs
PENDING
(CONFIGURING)
sinfo
squeue
sbatch

1720189 std test user R 1-03:44 1 pirineus27
$ squeue –j 1720189
$ sstat –aj 1720189 --format=jobid,nodelist,mincpu,maxrss,pids
JobID Nodelist MinCPU MaxRSS Pids
------------ ---------------- ------------- ---------- ----------------------
1720189.ext+ pirineus27 226474
1720189.bat+ pirineus27 00:00.000 7348K 226491,226526,226528
1720189.0 pirineus27 1-03:44:05 19171808K 226557,226577
Basic: Jobs
PENDING
(CONFIGURING)
sinfo
squeue
sstat
squeue
sbatch

1720189 std test user CG 2-15:56 1 pirineus27
$ squeue –j 1720189
• Move files from $LOCALSCRATCH to $SHAREDSCRATCH.
Basic: Jobs
PENDING
(CONFIGURING)
sinfo
squeue
sstat
squeue
sbatch

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 COMPLETED 0:0
1720189.bat+ batch account 16 COMPLETED 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 COMPLETED 0:0
1720189.1 process account 16 COMPLETED 0:0
1720189.2 post account 16 COMPLETED 0:0
$ sacct
• completed (CP), time_out (TO), out of memory (OM), fail (F), node_fail (NF)…
Basic: Jobs
PENDING
(CONFIGURING)
sacct
sinfo
squeue
sstat
squeue
sbatch

CANCELLED
$ scancel –j 1720189
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 CANCELLED+ 0:0
1720189.bat+ batch account 16 CANCELLED+ 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 CANCELLED+ 0:0
$ sacct
Basic: Jobs
PENDING
(CONFIGURING)
sacct
sinfo
squeue
sstat
squeue
sbatch
scancel

the most suitable partition.
Choose
$SCRATCH as working directory.
Use
only the necessary files.
Move
important files at $HOME.
Keep
Basic: Best practices

Thank you for your attention!
Feedback – ismael.fernandez@csuc.cat
Support – https://hpc.csuc.cat

Introduction to Slurm Resource Manager and Job Scheduler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Slurm Resource Manager and Job Scheduler

Similar to Introduction to Slurm Resource Manager and Job Scheduler (20)

More from CSUC - Consorci de Serveis Universitaris de Catalunya

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Introduction to Slurm Resource Manager and Job Scheduler