SlideShare a Scribd company logo
1 of 53
Download to read offline
Introduction to Slurm
Ismael Fernández Pavón
HPC Support
13 / 12 / 2022
What is ?
Resource Manager
Job Scheduler
Basic interaction
• Simple Linux Utility for Resource Management (Historic)
or simply Slurm.
Cluster manager and job scheduler
system for large and small Linux
clusters.
What is Slurm?
• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing a queue
of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.
What is Slurm?
LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
ALPS (Cray)
Torque
Maui
Moab
What is Slurm?
✓ Open source
✓ Fault-tolerant
✓ Highly scalable
✓ Almost everywhere
LoadLeveler (IBM)
LSF
Slurm
PBS Pro
Resource Managers Scheduler
ALPS (Cray)
Torque
Maui
Moab
What is Slurm?
Cluster:
Collection of many separate
servers (nodes), connected
via a fast interconnect.
Node
Fast interconnection
Ethernet, Infiniband…
Slurm: Resource Management
Node
CPU
(Core)
CPU
(Thread)
Nodes:
• pirineus[1-6]
• pirineus[7-50]
• pirineus[51-69]
• canigo[1,2]
• pirineusgpu[1-4]
• pirineusknl[1-4]
GPGPU
(GRES)
Individual computer
component of an HPC
system.
Slurm: Resource Management
Partitions:
• std
• std-fat
• mem
• gpu
• knl
• covid19
• exclusive
Partitions
Logical group of nodes with
common specs.
Slurm: Resource Management
Allocated
cores
Allocated
memory
Jobs:
• ID (a number)
• Name
• Time limit
• Size specification
• Other Jobs Dependency
• State
Allocations of resources
assigned to a user for a
specified amount of time.
Slurm: Resource Management
Core
used
Memory
used
Jobs Step:
• ID (a number)
• Name
• Time limit
• Size specification
Sets of (possibly parallel)
tasks within a job.
Slurm: Resource Management
FULL CLUSTER
Job scheduling time!
Slurm: Resource Management
Scheduling: The process of determining next job to run and
on which resources.
Slurm: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.
FIFO Scheduling
Resources
Slurm: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.
FIFO Scheduling
Backfill Scheduling
• Job priority
• Time limit (Important!)
Time
Resources
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Elapsed time
Time limit
Time
Resources
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Submit
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Wait time: 7
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Submit
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
Wait time: 1
Elapsed time
Time limit
Slurm: Job Scheduling
Backfill Scheduling:
• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor
Slurm: Job Scheduling
Backfill Scheduling:
• Starts with job priority.
Job_priority =
= site_factor +
+ (PriorityWeightQOS) * (QOS_factor) +
+ (PriorityWeightPartition) * (partition_factor) +
+ (PriorityWeightFairshare) * (fair-share_factor) +
+ (PriorityWeightAge) * (age_factor) +
+ (PriorityWeightJobSize) * (job_size_factor) +
+ (PriorityWeightAssoc) * (assoc_factor) +
+ SUM(TRES_weight_<type> * TRES_factor_<type>…)
− nice_factor
Fixed value
Dynamic value
User defined value
Slurm: Job Scheduling
Backfill Scheduling:
• Priority factor: QoS:
• Account’s Priority:
− Normal
− Low
• RES users:
− class_a
− class_b
− class_c
QoS
Slurm: Job Scheduling
Backfill Scheduling:
• Priority factor: Fairshare:
• It depends on:
• Consumption.
• Resources requested.
QoS
Fairshare
Slurm: Job Scheduling
Backfill Scheduling:
• Priority factor: Age:
• Increase priority as more
time the job pends on
queue.
• Max 7 days.
• Not valid for dependent
jobs!
QoS
Fairshare
Age
Slurm: Job Scheduling
Backfill Scheduling:
• Priority factor: Job size:
• Bigger jobs have more
priority.
• ONLY resources
NOT time.
QoS
Fairshare
Age
Job size
Slurm: Job Scheduling
Name Variable Availability Lifetime Backup
/home/<user> $HOME Global -
/scratch/<user> - Global 30 d
/scratch/<user>/tmp/<jobid>
$SCRATCH /
$SHAREDSCRATCH
Global 7 d
/tmp/<user>/<jobid>
$SCRATCH /
$LOCALSCRATCH
Node Job
Basic: General information
ssh –p 2122 <user>@hpc.csuc.cat
scp -P 2122 <local_file> <user>@hpc.csuc.cat:<remote_path>
• Login:
• Transfer files:
• Storage:
Cheatsheet
Basic: General information
• Linux commands:
Command Description
pwd Show current path
ls List current folder’s files
cd <path> Change directory
mkdir <dir> Create directory
cp <file> <new> Copy
mv <file> <new> Move
rm <file> Remove file
man <command> Show manual
Command Description
CTRL-c Stop current command
CTRL-r Search history
!! Repeatlast command
grep <p> <f> Search for patterns in files
tail <file> Show last 10 lines
head <file> Show first 10 lines
cat <file> Print file content
touch <file> Create an empty file
Cheatsheet
$ vim submit_file.slm
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs
Batch job: #!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Submit
Execute
Obtain
jobA
jobA
jobA
Basic: Jobs
Batch job:
Slurm directives
-J <name> Name of job
-o <file> Job’s std output file
-e <file> Job’s std error file
-p <part> Partition requested
-n <#tasks> Number of tasks
-c <#cpus> Number of procs per
task
-t <time> Time limit (dd-hh:mm,
hh:mm, mm)
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Basic: Jobs
Batch job:
Defaults
• std: 1 core
3900 MB / core
• std-fat: 1 core
7900 MB / core
• mem: 1 core
23900 MB / core
• gpu: 24 cores
3900 MB / core
1 GPGPU
• knl: All node
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Basic: Jobs
Batch job:
• First Load module
• Second Copy inputs to SCRATCH
Change working path
• Third Execution
• Forth Get outputs back
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Basic: Jobs
Batch job: Steps #!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
srun <application>
srun <application> &
srun <application> &
wait
cp -r <output> ${SLURM_SUBMIT_DIR}
Submit
Execute
Obtain
jobA
jobA.3
jobA.2
jobA.1 … job.N
jobA
Basic: Jobs
Batch job: Arrays #!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Submit
Execute
Obtain
jobA
jobA_3
jobA_2
jobA_1 … job_N
jobA_3
jobA_2
jobA_1 … job_N
Basic: Jobs
Basic: Jobs
Batch job: Arrays
• Example: Job 50
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file_%A_%a.out
#SBATCH -e error_file_%A_%a.err
#SBATCH --array=0-4
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
SLURM_JOB_ID 50
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 0
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
SLURM_JOB_ID 51
SLURM_ARRAY_JOB_ID 50
SLURM_ARRAY_TASK_ID 1
SLURM_ARRAY_TASK_COUNT 5
SLURM_ARRAY_TASK_MAX 4
SLURM_ARRAY_TASK_MIN 0
…
Batch job: Dependency
Pre-processing
Analysis
Verification
jobX
jobB
jobA
jobN
jobM
jobL
ok?
ok?
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Basic: Jobs
Batch job: Dependency
• after:<jobid>[:<jobid>...]
Start of the specified jobs.
• afterany:<jobid>[:<jobid>...]
Termination of the specified jobs.
• afternotok:<jobid>[:<jobid>...]
Failing of the specified jobs.
• singleton
All jobs with the same name
and user have ended.
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -o output_file.out
#SBATCH -e error_file.err
#SBATCH -p <partition>
#SBATCH -n <#tasks>
#SBATCH -c <#cpus_per_task>
#SBATCH -t 60
#SBATCH --dependency=afterok:<jid>
module load <module>
cp -r <input> ${SCRATCH}
cd ${SCRATCH}
<application>
cp -r <output> ${SLURM_SUBMIT_DIR}
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
COMPLETING
CANCELED COMPLETED TIMEOUT
FAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE
FAIL
HOLD
RELEASE
REQUEUE
SUBMISSION
Basic: Jobs
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
std* up infinite 19 mix pirineus[7,34-35,37-40,45]
std* up infinite 25 alloc pirineus[8,11-17,19-33,36,41-44,46-50]
std-fat up infinite 1 mix pirineus45
std-fat up infinite 5 alloc pirineus[46-50]
gpu up infinite 4 idle~ pirineusgpu[1-4]
knl up infinite 4 idle~ pirineusknl[1-4]
mem up infinite 2 mix canigo[1-2]
$ sinfo
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs
sinfo
+-----------+-------------+-----------------+--------------+------------+
| MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION |
+-----------+-------------+-----------------+--------------+------------+
| std nodes | 1536 | 1468 | 2212 | 95 % |
| fat nodes | 288 | 144 | 0 | 50 % |
| mem nodes | 96 | 96 | 289 | 100 % |
| gpu nodes | 144 | 96 | 252 | 66 % |
| knl nodes | 816 | 0 | 0 | 0 % |
| res nodes | 672 | 648 | 1200 | 96 % |
+-----------+-------------+-----------------+--------------+------------+
$ system-status
PENDING
(CONFIGURING)
RUNNING COMPLETED
COMPLETING
Basic: Jobs
sinfo
Submitted batch job 1720189
$ sbatch <file>
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sinfo
sbatch
Submitted batch job 1720189
$ sbatch <file>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1720189 std test user PD 0:00 1 (Resources)
$ squeue –u <username>
• Priority: One or more higher priority jobs exist for this partition or advanced reservation.
• Dependency: The job is waiting for a dependent job to complete.
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sinfo
squeue
sbatch
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1720189 std test user R 1-03:44 1 pirineus27
$ squeue –j 1720189
$ sstat –aj 1720189 --format=jobid,nodelist,mincpu,maxrss,pids
JobID Nodelist MinCPU MaxRSS Pids
------------ ---------------- ------------- ---------- ----------------------
1720189.ext+ pirineus27 226474
1720189.bat+ pirineus27 00:00.000 7348K 226491,226526,226528
1720189.0 pirineus27 1-03:44:05 19171808K 226557,226577
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sinfo
squeue
sstat
squeue
sbatch
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1720189 std test user CG 2-15:56 1 pirineus27
$ squeue –j 1720189
• Move files from $LOCALSCRATCH to $SHAREDSCRATCH.
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sinfo
squeue
sstat
squeue
sbatch
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 COMPLETED 0:0
1720189.bat+ batch account 16 COMPLETED 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 COMPLETED 0:0
1720189.1 process account 16 COMPLETED 0:0
1720189.2 post account 16 COMPLETED 0:0
$ sacct
• completed (CP), time_out (TO), out of memory (OM), fail (F), node_fail (NF)…
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sacct
sinfo
squeue
sstat
squeue
sbatch
CANCELLED
$ scancel –j 1720189
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ------------ --------
1720189 test std account 16 CANCELLED+ 0:0
1720189.bat+ batch account 16 CANCELLED+ 0:0
1720189.ext+ extern account 16 COMPLETED 0:0
1720189.0 pre account 16 CANCELLED+ 0:0
$ sacct
Basic: Jobs
PENDING
(CONFIGURING)
RUNNING COMPLETING COMPLETED
sacct
sinfo
squeue
sstat
squeue
sbatch
scancel
the most suitable partition.
Choose
$SCRATCH as working directory.
Use
only the necessary files.
Move
important files at $HOME.
Keep
Basic: Best practices
Questions?
Thank you for your attention!
Feedback – ismael.fernandez@csuc.cat
Support – https://hpc.csuc.cat

More Related Content

What's hot

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Epoll - from the kernel side
Epoll -  from the kernel sideEpoll -  from the kernel side
Epoll - from the kernel sidellj098
 
C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述Xiaozhe Wang
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with GangliaFastly
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?Mateusz Pusz
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...HostedbyConfluent
 

What's hot (20)

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Epoll - from the kernel side
Epoll -  from the kernel sideEpoll -  from the kernel side
Epoll - from the kernel side
 
C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Scilabpdf
ScilabpdfScilabpdf
Scilabpdf
 
Linux
LinuxLinux
Linux
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?Git, CMake, Conan - How to ship and reuse our C++ projects?
Git, CMake, Conan - How to ship and reuse our C++ projects?
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 

Similar to Introduction to Slurm Resource Manager and Job Scheduler

Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Ludovico Caldara
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
Background Jobs - Com BackgrounDRb
Background Jobs - Com BackgrounDRbBackground Jobs - Com BackgrounDRb
Background Jobs - Com BackgrounDRbJuan Maiz
 
Deploying systemd at scale
Deploying systemd at scaleDeploying systemd at scale
Deploying systemd at scaleDavide Cavalca
 
Linux Cluster Job Management Systems (SGE)
Linux Cluster Job Management Systems (SGE)Linux Cluster Job Management Systems (SGE)
Linux Cluster Job Management Systems (SGE)anandvaidya
 
Background processing with Resque
Background processing with ResqueBackground processing with Resque
Background processing with ResqueNicolas Blanco
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature OptimizationsMaude Lemaire
 
Nomad Multi-Cloud
Nomad Multi-CloudNomad Multi-Cloud
Nomad Multi-CloudNic Jackson
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i Zend by Rogue Wave Software
 
High Performance Drupal
High Performance DrupalHigh Performance Drupal
High Performance DrupalJeff Geerling
 
Scheduling torque-maui-tutorial
Scheduling torque-maui-tutorialScheduling torque-maui-tutorial
Scheduling torque-maui-tutorialSantosh Kumar
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryMongoDB
 
Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1Susant Sahani
 
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...Michael Rosenblum
 

Similar to Introduction to Slurm Resource Manager and Job Scheduler (20)

TYPO3 Scheduler
TYPO3 SchedulerTYPO3 Scheduler
TYPO3 Scheduler
 
Lec7
Lec7Lec7
Lec7
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Background Jobs - Com BackgrounDRb
Background Jobs - Com BackgrounDRbBackground Jobs - Com BackgrounDRb
Background Jobs - Com BackgrounDRb
 
Introduction to SLURM
Introduction to SLURMIntroduction to SLURM
Introduction to SLURM
 
Deploying systemd at scale
Deploying systemd at scaleDeploying systemd at scale
Deploying systemd at scale
 
Linux Cluster Job Management Systems (SGE)
Linux Cluster Job Management Systems (SGE)Linux Cluster Job Management Systems (SGE)
Linux Cluster Job Management Systems (SGE)
 
Background processing with Resque
Background processing with ResqueBackground processing with Resque
Background processing with Resque
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature Optimizations
 
Nomad Multi-Cloud
Nomad Multi-CloudNomad Multi-Cloud
Nomad Multi-Cloud
 
Nyt Prof 200910
Nyt Prof 200910Nyt Prof 200910
Nyt Prof 200910
 
Queue your work
Queue your workQueue your work
Queue your work
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i
 
High Performance Drupal
High Performance DrupalHigh Performance Drupal
High Performance Drupal
 
Scheduling torque-maui-tutorial
Scheduling torque-maui-tutorialScheduling torque-maui-tutorial
Scheduling torque-maui-tutorial
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster Recovery
 
Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
 
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...
Hidden Gems of Performance Tuning: Hierarchical Profiler and DML Trigger Opti...
 

More from CSUC - Consorci de Serveis Universitaris de Catalunya

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)
 
Publicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de RecercaPublicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de Recerca
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
 
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
 
Security Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademySecurity Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademy
 
The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
 
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
 
La gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolasLa gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolas
 
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
 
Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...
 
Enginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fonsEnginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fons
 
Transformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IATransformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IA
 
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de BarcelonaDifusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
 
Fons de discos perforats de cartró
Fons de discos perforats de cartróFons de discos perforats de cartró
Fons de discos perforats de cartró
 
Biblioteca Digital Gencat
Biblioteca Digital GencatBiblioteca Digital Gencat
Biblioteca Digital Gencat
 
El fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusióEl fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusió
 
El CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físicsEl CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físics
 
Els serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUCEls serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUC
 
SIG-NOC Tools Survey
SIG-NOC Tools SurveySIG-NOC Tools Survey
SIG-NOC Tools Survey
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Introduction to Slurm Resource Manager and Job Scheduler

  • 1. Introduction to Slurm Ismael Fernández Pavón HPC Support 13 / 12 / 2022
  • 2. What is ? Resource Manager Job Scheduler Basic interaction
  • 3. • Simple Linux Utility for Resource Management (Historic) or simply Slurm. Cluster manager and job scheduler system for large and small Linux clusters. What is Slurm?
  • 4. • Allocates access to resources for some duration of time. • Provides a framework for starting, executing, and monitoring work (normally a parallel job). • Arbitrates contention for resources by managing a queue of pending work. Cluster manager and job scheduler system for large and small Linux clusters. What is Slurm?
  • 5. LoadLeveler (IBM) LSF Slurm PBS Pro Resource Managers Scheduler ALPS (Cray) Torque Maui Moab What is Slurm?
  • 6. ✓ Open source ✓ Fault-tolerant ✓ Highly scalable ✓ Almost everywhere LoadLeveler (IBM) LSF Slurm PBS Pro Resource Managers Scheduler ALPS (Cray) Torque Maui Moab What is Slurm?
  • 7. Cluster: Collection of many separate servers (nodes), connected via a fast interconnect. Node Fast interconnection Ethernet, Infiniband… Slurm: Resource Management
  • 8. Node CPU (Core) CPU (Thread) Nodes: • pirineus[1-6] • pirineus[7-50] • pirineus[51-69] • canigo[1,2] • pirineusgpu[1-4] • pirineusknl[1-4] GPGPU (GRES) Individual computer component of an HPC system. Slurm: Resource Management
  • 9. Partitions: • std • std-fat • mem • gpu • knl • covid19 • exclusive Partitions Logical group of nodes with common specs. Slurm: Resource Management
  • 10. Allocated cores Allocated memory Jobs: • ID (a number) • Name • Time limit • Size specification • Other Jobs Dependency • State Allocations of resources assigned to a user for a specified amount of time. Slurm: Resource Management
  • 11. Core used Memory used Jobs Step: • ID (a number) • Name • Time limit • Size specification Sets of (possibly parallel) tasks within a job. Slurm: Resource Management
  • 12. FULL CLUSTER Job scheduling time! Slurm: Resource Management
  • 13. Scheduling: The process of determining next job to run and on which resources. Slurm: Job Scheduling
  • 14. Scheduling: The process of determining next job to run and on which resources. FIFO Scheduling Resources Slurm: Job Scheduling
  • 15. Scheduling: The process of determining next job to run and on which resources. FIFO Scheduling Backfill Scheduling • Job priority • Time limit (Important!) Time Resources Slurm: Job Scheduling
  • 16. Backfill Scheduling: • Ej: New lower priority job Elapsed time Time limit Time Resources Slurm: Job Scheduling
  • 17. Backfill Scheduling: • Ej: New lower priority job Time Resources Submit Elapsed time Time limit Slurm: Job Scheduling
  • 18. Backfill Scheduling: • Ej: New lower priority job Time Resources Elapsed time Time limit Slurm: Job Scheduling
  • 19. Backfill Scheduling: • Ej: New lower priority job Time Resources Wait time: 7 Elapsed time Time limit Slurm: Job Scheduling
  • 20. Backfill Scheduling: • Ej: New lower priority job Time Resources Elapsed time Time limit Slurm: Job Scheduling
  • 21. Backfill Scheduling: • Ej: New lower priority job Time Resources Submit Elapsed time Time limit Slurm: Job Scheduling
  • 22. Backfill Scheduling: • Ej: New lower priority job Time Resources Elapsed time Time limit Slurm: Job Scheduling
  • 23. Backfill Scheduling: • Ej: New lower priority job Time Resources Wait time: 1 Elapsed time Time limit Slurm: Job Scheduling
  • 24. Backfill Scheduling: • Starts with job priority. Job_priority = = site_factor + + (PriorityWeightQOS) * (QOS_factor) + + (PriorityWeightPartition) * (partition_factor) + + (PriorityWeightFairshare) * (fair-share_factor) + + (PriorityWeightAge) * (age_factor) + + (PriorityWeightJobSize) * (job_size_factor) + + (PriorityWeightAssoc) * (assoc_factor) + + SUM(TRES_weight_<type> * TRES_factor_<type>…) − nice_factor Slurm: Job Scheduling
  • 25. Backfill Scheduling: • Starts with job priority. Job_priority = = site_factor + + (PriorityWeightQOS) * (QOS_factor) + + (PriorityWeightPartition) * (partition_factor) + + (PriorityWeightFairshare) * (fair-share_factor) + + (PriorityWeightAge) * (age_factor) + + (PriorityWeightJobSize) * (job_size_factor) + + (PriorityWeightAssoc) * (assoc_factor) + + SUM(TRES_weight_<type> * TRES_factor_<type>…) − nice_factor Fixed value Dynamic value User defined value Slurm: Job Scheduling
  • 26. Backfill Scheduling: • Priority factor: QoS: • Account’s Priority: − Normal − Low • RES users: − class_a − class_b − class_c QoS Slurm: Job Scheduling
  • 27. Backfill Scheduling: • Priority factor: Fairshare: • It depends on: • Consumption. • Resources requested. QoS Fairshare Slurm: Job Scheduling
  • 28. Backfill Scheduling: • Priority factor: Age: • Increase priority as more time the job pends on queue. • Max 7 days. • Not valid for dependent jobs! QoS Fairshare Age Slurm: Job Scheduling
  • 29. Backfill Scheduling: • Priority factor: Job size: • Bigger jobs have more priority. • ONLY resources NOT time. QoS Fairshare Age Job size Slurm: Job Scheduling
  • 30. Name Variable Availability Lifetime Backup /home/<user> $HOME Global - /scratch/<user> - Global 30 d /scratch/<user>/tmp/<jobid> $SCRATCH / $SHAREDSCRATCH Global 7 d /tmp/<user>/<jobid> $SCRATCH / $LOCALSCRATCH Node Job Basic: General information ssh –p 2122 <user>@hpc.csuc.cat scp -P 2122 <local_file> <user>@hpc.csuc.cat:<remote_path> • Login: • Transfer files: • Storage: Cheatsheet
  • 31. Basic: General information • Linux commands: Command Description pwd Show current path ls List current folder’s files cd <path> Change directory mkdir <dir> Create directory cp <file> <new> Copy mv <file> <new> Move rm <file> Remove file man <command> Show manual Command Description CTRL-c Stop current command CTRL-r Search history !! Repeatlast command grep <p> <f> Search for patterns in files tail <file> Show last 10 lines head <file> Show first 10 lines cat <file> Print file content touch <file> Create an empty file Cheatsheet
  • 32. $ vim submit_file.slm PENDING (CONFIGURING) RUNNING COMPLETED COMPLETING Basic: Jobs
  • 33. Batch job: #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Submit Execute Obtain jobA jobA jobA Basic: Jobs
  • 34. Batch job: Slurm directives -J <name> Name of job -o <file> Job’s std output file -e <file> Job’s std error file -p <part> Partition requested -n <#tasks> Number of tasks -c <#cpus> Number of procs per task -t <time> Time limit (dd-hh:mm, hh:mm, mm) #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Basic: Jobs
  • 35. Batch job: Defaults • std: 1 core 3900 MB / core • std-fat: 1 core 7900 MB / core • mem: 1 core 23900 MB / core • gpu: 24 cores 3900 MB / core 1 GPGPU • knl: All node #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Basic: Jobs
  • 36. Batch job: • First Load module • Second Copy inputs to SCRATCH Change working path • Third Execution • Forth Get outputs back #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Basic: Jobs
  • 37. Batch job: Steps #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} srun <application> srun <application> & srun <application> & wait cp -r <output> ${SLURM_SUBMIT_DIR} Submit Execute Obtain jobA jobA.3 jobA.2 jobA.1 … job.N jobA Basic: Jobs
  • 38. Batch job: Arrays #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file_%A_%a.out #SBATCH -e error_file_%A_%a.err #SBATCH --array=0-4 #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Submit Execute Obtain jobA jobA_3 jobA_2 jobA_1 … job_N jobA_3 jobA_2 jobA_1 … job_N Basic: Jobs
  • 39. Basic: Jobs Batch job: Arrays • Example: Job 50 #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file_%A_%a.out #SBATCH -e error_file_%A_%a.err #SBATCH --array=0-4 #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} SLURM_JOB_ID 50 SLURM_ARRAY_JOB_ID 50 SLURM_ARRAY_TASK_ID 0 SLURM_ARRAY_TASK_COUNT 5 SLURM_ARRAY_TASK_MAX 4 SLURM_ARRAY_TASK_MIN 0 SLURM_JOB_ID 51 SLURM_ARRAY_JOB_ID 50 SLURM_ARRAY_TASK_ID 1 SLURM_ARRAY_TASK_COUNT 5 SLURM_ARRAY_TASK_MAX 4 SLURM_ARRAY_TASK_MIN 0 …
  • 40. Batch job: Dependency Pre-processing Analysis Verification jobX jobB jobA jobN jobM jobL ok? ok? #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 #SBATCH --dependency=afterok:<jid> module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Basic: Jobs
  • 41. Batch job: Dependency • after:<jobid>[:<jobid>...] Start of the specified jobs. • afterany:<jobid>[:<jobid>...] Termination of the specified jobs. • afternotok:<jobid>[:<jobid>...] Failing of the specified jobs. • singleton All jobs with the same name and user have ended. #!/bin/bash #SBATCH -J <job_name> #SBATCH -o output_file.out #SBATCH -e error_file.err #SBATCH -p <partition> #SBATCH -n <#tasks> #SBATCH -c <#cpus_per_task> #SBATCH -t 60 #SBATCH --dependency=afterok:<jid> module load <module> cp -r <input> ${SCRATCH} cd ${SCRATCH} <application> cp -r <output> ${SLURM_SUBMIT_DIR} Basic: Jobs
  • 42. PENDING (CONFIGURING) RUNNING HELD RESIZE COMPLETING CANCELED COMPLETED TIMEOUT FAIL OUT OF MEMORY SPECIAL EXIT NODE FAIL HOLD RELEASE REQUEUE SUBMISSION Basic: Jobs
  • 43. PARTITION AVAIL TIMELIMIT NODES STATE NODELIST std* up infinite 19 mix pirineus[7,34-35,37-40,45] std* up infinite 25 alloc pirineus[8,11-17,19-33,36,41-44,46-50] std-fat up infinite 1 mix pirineus45 std-fat up infinite 5 alloc pirineus[46-50] gpu up infinite 4 idle~ pirineusgpu[1-4] knl up infinite 4 idle~ pirineusknl[1-4] mem up infinite 2 mix canigo[1-2] $ sinfo PENDING (CONFIGURING) RUNNING COMPLETED COMPLETING Basic: Jobs sinfo
  • 44. +-----------+-------------+-----------------+--------------+------------+ | MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION | +-----------+-------------+-----------------+--------------+------------+ | std nodes | 1536 | 1468 | 2212 | 95 % | | fat nodes | 288 | 144 | 0 | 50 % | | mem nodes | 96 | 96 | 289 | 100 % | | gpu nodes | 144 | 96 | 252 | 66 % | | knl nodes | 816 | 0 | 0 | 0 % | | res nodes | 672 | 648 | 1200 | 96 % | +-----------+-------------+-----------------+--------------+------------+ $ system-status PENDING (CONFIGURING) RUNNING COMPLETED COMPLETING Basic: Jobs sinfo
  • 45. Submitted batch job 1720189 $ sbatch <file> Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sinfo sbatch
  • 46. Submitted batch job 1720189 $ sbatch <file> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1720189 std test user PD 0:00 1 (Resources) $ squeue –u <username> • Priority: One or more higher priority jobs exist for this partition or advanced reservation. • Dependency: The job is waiting for a dependent job to complete. Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sinfo squeue sbatch
  • 47. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1720189 std test user R 1-03:44 1 pirineus27 $ squeue –j 1720189 $ sstat –aj 1720189 --format=jobid,nodelist,mincpu,maxrss,pids JobID Nodelist MinCPU MaxRSS Pids ------------ ---------------- ------------- ---------- ---------------------- 1720189.ext+ pirineus27 226474 1720189.bat+ pirineus27 00:00.000 7348K 226491,226526,226528 1720189.0 pirineus27 1-03:44:05 19171808K 226557,226577 Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sinfo squeue sstat squeue sbatch
  • 48. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1720189 std test user CG 2-15:56 1 pirineus27 $ squeue –j 1720189 • Move files from $LOCALSCRATCH to $SHAREDSCRATCH. Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sinfo squeue sstat squeue sbatch
  • 49. JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ------------ -------- 1720189 test std account 16 COMPLETED 0:0 1720189.bat+ batch account 16 COMPLETED 0:0 1720189.ext+ extern account 16 COMPLETED 0:0 1720189.0 pre account 16 COMPLETED 0:0 1720189.1 process account 16 COMPLETED 0:0 1720189.2 post account 16 COMPLETED 0:0 $ sacct • completed (CP), time_out (TO), out of memory (OM), fail (F), node_fail (NF)… Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sacct sinfo squeue sstat squeue sbatch
  • 50. CANCELLED $ scancel –j 1720189 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ------------ -------- 1720189 test std account 16 CANCELLED+ 0:0 1720189.bat+ batch account 16 CANCELLED+ 0:0 1720189.ext+ extern account 16 COMPLETED 0:0 1720189.0 pre account 16 CANCELLED+ 0:0 $ sacct Basic: Jobs PENDING (CONFIGURING) RUNNING COMPLETING COMPLETED sacct sinfo squeue sstat squeue sbatch scancel
  • 51. the most suitable partition. Choose $SCRATCH as working directory. Use only the necessary files. Move important files at $HOME. Keep Basic: Best practices
  • 53. Thank you for your attention! Feedback – ismael.fernandez@csuc.cat Support – https://hpc.csuc.cat