Presentació a càrrec d'Ismael Fernández i Cristian
Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la "2a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 19 de febrer de 2020 al CSUC.
3. What is SLURM?
• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing
a queue of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.
9. SLURM: Resource Management
Partitions:
• Associatedwith specific
set of nodes
• Nodes can be in more
than one partition
• Job size and time limits
• Access control list
• State information
Partitions
Logical group of nodes with
common specs.
15. SLURM: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.
FIFO Scheduling
Backfill Scheduling
• Job priority
• Time limit (Important!)
Time
Resources
16. SLURM: Job Scheduling
Backfill Scheduling:
• Based on the job request, resources available, and
policy limits imposed.
• Starts with job priority.
• Higher priority jobs cannot be delayed by lower priority
jobs.
• Expected start time of pending jobs depends upon the
expected completion time of running jobs, reasonably
accurate time limits.
• Results in a resource allocation over a period.
17. Backfill Scheduling:
• Ej: New lower priority job
SLURM: Job Scheduling
Elapsed time
Time limit
Time
Resources
18. Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
SLURM: Job Scheduling
Submit
Elapsed time
Time limit
19. Backfill Scheduling:
• Ej: New lower priority job
SLURM: Job Scheduling
Time
Resources
Elapsed time
Time limit
20. Backfill Scheduling:
• Ej: New lower priority job
SLURM: Job Scheduling
Time
Resources
Wait time: 7
Elapsed time
Time limit
21. Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
SLURM: Job Scheduling
Elapsed time
Time limit
22. Backfill Scheduling:
• Ej: New lower priority job
Time
Resources
SLURM: Job Scheduling
Submit
Elapsed time
Time limit
23. Backfill Scheduling:
• Ej: New lower priority job
SLURM: Job Scheduling
Time
Resources
Elapsed time
Time limit
24. Backfill Scheduling:
• Ej: New lower priority job
SLURM: Job Scheduling
Time
Resources
Wait time: 1
Elapsed time
Time limit
30. Backfill Scheduling:
• Priority factor:
SLURM: Job Scheduling
Age:
• Increase priority as more
time the job pends on
queue.
• Max 7 days.
• Not valid for dependent
jobs!
QoS
Partition
Fairshare
Age
31. Backfill Scheduling:
• Priority factor:
SLURM: Job Scheduling
Job size:
• Bigger jobs have more
priority.
• ONLY resources
NOT time.
QoS
Partition
Fairshare
Age
Job size
33. •sbatch – Submit a batch script.
•salloc – Request resources for an interactive job.
•srun – Start a new task (job step).
•scancel – Cancel a job.
SLURM: Commands
34. • sinfo – Report system status (nodes, queues, etc.).
PARTITION AVAIL TIME NODES STATE NODELIST
std* up inf+ 2 mix pirineus[15,21]
std* up inf+ 30 alloc pirineus[13-14,16-20,22-44]
std-fat up inf+ 3 idle~ pirineus[45,49-50]
std-fat up inf+ 3 alloc pirineus[46-48]
gpu up inf+ 2 idle~ pirineusgpu[3-4]
gpu up inf+ 1 mix pirineusgpu2
knl up inf+ 3 idle~ pirineusknl[2-4]
mem up inf+ 1 mix canigo1
class_a up inf+ 1 idle~ pirineus12
class_a up inf+ 2 mix canigo1,pirineus11
class_a up inf+ 8 alloc pirineus[1-6,8-9]
class_a up inf+ 2 resv pirineus[7,10]
class_c up inf+ 1 idle~ pirineus12
class_c up inf+ 2 mix canigo1,pirineus11
class_c up inf+ 8 alloc pirineus[1-6,8-9]
class_c up inf+ 2 resv pirineus[7,10]
SLURM: Commands
35. • sinfo – Report system status.
-N Node-oriented format information, with one line per
node and partition.
-p Print information only about the specified partition(s).
--Format Specify the information to be displayed.
"Nodelist,Partition,StateCompact,CpusState,Memory,Freemem"
NODELIST PARTITION STATE CPUS(A/I/O/T) MEMORY FREE_MEM
canigo1 class_a mix 112/80/0/192 4643070 2458001
pirineus1 class_a idle~ 0/48/0/48 191904 188950
pirineus2 class_a alloc 48/0/0/48 191904 44123
pirineus3 class_a alloc 48/0/0/48 191904 41831
pirineus4 class_a mix 32/16/0/48 191904 66623
pirineus5 class_a mix 16/32/0/48 191904 162277
pirineus6 class_a alloc 48/0/0/48 191904 82747
pirineus7 class_a idle~ 0/48/0/48 191904 189289
SLURM: Commands
36. • sinfo – Report system status.
-s List only a partition state summary with no node state details.
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
std* up infinite 32/0/0/32 pirineus[13-44]
std-fat up infinite 3/3/0/6 pirineus[45-50]
gpu up infinite 1/2/0/3 pirineusgpu[2-4]
knl up infinite 0/3/0/3 pirineusknl[2-4]
mem up infinite 1/0/0/1 canigo1
class_a up infinite 10/3/0/13 canigo1,pirineus[1-12]
class_b up infinite 10/3/0/13 canigo1,pirineus[1-12]
class_c up infinite 10/3/0/13 canigo1,pirineus[1-12]
SLURM: Commands
37. • sinfo – Report system status.
-s List only a partition state summary with no node state details.
TIP: Use system-status.
SLURM: Commands
+-----------+-------------+-----------------+--------------+------------+
| MACHINE | TOTAL SLOTS | ALLOCATED SLOTS | QUEUED SLOTS | OCCUPATION |
+-----------+-------------+-----------------+--------------+------------+
| std nodes | 1536 | 1468 | 2212 | 95 % |
| fat nodes | 288 | 144 | 0 | 50 % |
| mem nodes | 96 | 96 | 289 | 100 % |
| gpu nodes | 144 | 96 | 252 | 66 % |
| knl nodes | 816 | 0 | 0 | 0 % |
| res nodes | 672 | 648 | 1200 | 96 % |
+-----------+-------------+-----------------+--------------+------------+
38. • squeue – Report job and job step status.
JOBID PARTIT NAME USER ST TIME NODES NODELIST
1222376 mem dada2 mvelasco PD 0:00 1 (Resources)
1221504 std Freq_TS_ uabqut16 PD 0:00 1 (Resources)
1222346 std Cu2T-tra agusti PD 0:00 1 (Priority)
1222347 std AuIPr_Ph sciortin PD 0:00 1 (Priority)
1220930 std nickeloc ubaqis07 PD 0:00 1 (Priority)
1222351 std g09d1 upceqt04 R 2:18:20 1 pirineus21
1221621 mem C3 vpenya R 23:56:04 1 canigo1
1221569 std preTS_VI porellan R 19:39:13 1 pirineus17
1221543 std Au2-Cl-d agusti R 1-13:40:32 1 pirineus22
1221616 std-fat CuII_mod mariona R 1-10:35:33 1 pirineus47
1221617 std-fat CuIII_mo mariona R 1-10:35:33 1 pirineus48
1221461 std opt-1xe2 pbesalu R 2-11:22:43 1 pirineus37
1221413 std s24ls_de jcirera R 4:08:01 1 pirineus22
1220720 std nickeloc ubaqis07 R 4-03:00:44 2 pirineus[34-35]
1220719 std nickeloc ubaqis07 R 4-03:00:48 1 pirineus14
1221546 mem C60-Zn-T pbesalu R 22:31:12 1 canigo1
SLURM: Commands
39. • scontrol – Administrator tool to view and/or update
system, job, step, partition or reservation status.
scontrol hold <jobid>
scontrol release <jobid>
scontrol show job <jobid>
SLURM: Commands
45. SLURM: Job Life
PENDING
(CONFIGURING)
RUNNING
HELD RESIZE
CANCELED
COMPLETING
COMPLETED
HOLD
RELEASE
REQUEUE
SUBMISSION
TIMEOUTFAIL
OUT OF
MEMORY
SPECIAL
EXIT
NODE FAIL
Pending Reasons:
• Priority: One or more higher priority jobs exist for this partition or advanced
reservation.
• Reasources: The job is waiting for resources to become available.
• Reservation: The job is waiting its advanced reservation to become available.
• ReqNodeNotAvail: Some node specifically required by the job is not currently
available.
• JobHeldAdmin / JobHeldUser: The job is held by a system administrator / the
user.
• Dependency: This job is waiting for a dependent job to complete.
• BadConstraints: The job's constraints can not be satisfied.
• InvalidQOS: The job's QOS is invalid. Account’s assigned time exhausted?
• AssociationTimeLimit: The job's association has reached its time limit.
Account’s assigned time exhausted?
48. •SLURM Upgrade to 19.05
• New job state: OUT_OF_MEMORY.
• Job killed by OOM.
• Fixed ratio between MEMORY and CPU.
SLURM: News
Partition
MAX. Mem per CPU
(MB)
MAX. Mem per CPU
(GB)
std 3900 MB 3,8 GB
std-fat 7900 MB 7,7 GB
mem 24180 MB 23,6 GB
52. Login on CSUC infrastructure
• Login
ssh –p 2122 username@hpc.csuc.cat
• Transferfiles
scp -P 2122 local_file username@hpc.csuc.cat:[path to your folder]
sftp -oPort=2122 username@hpc.csuc.cat
• Useful paths
Name Variable Availability Quote/project Time limit Backup
/home/$user $HOME global >64 GB unlimited Yes
/scratch/$user $SCRATCH global unlimited 30 days No
/scratch/$user/tmp/jobid $TMPDIR / $SHAREDSCRATCH global job file limit 1 week No
/tmp/$user/jobid $TMPDIR / $LOCALSCRATCH Local to each node job file limit 1 week No
• Get HC consumption
consum -a ‘any’ (group consumption)
consum -a ‘any’ -u ‘nom_usuari’ (user consumption)
53. Batch job submission: Default settings
• 4-8Gb/core (std and std-fat partitions).
• 24Gb/core on mem partition.
• 1 core on std, std-fat and mem partitions.
• 24 cores and 1 GPU on gpu partition.
• The whole node on KNL partition.
• Non-exclusive, multinode job.
• Working and Output directory are the submit directory.
54. Batch job submission
• Basic Linux commands:
Description Command Exemple
List files ls ls /home/user
Making folders mkdir mkdir /home/prova
Changing folder cd cd /home/prova
Copy files cp cp nom_arxiu1 nom_arxiu2
Move file mv mv /home/prova.txt /cescascratch/prova.txt
Delete file rm rm filename
Print file content cat cat filename
Find string into files grep grep ‘word’ filename
List last lines on file tail tail filename
• Text editors : vim, nano, emacs,etc.
• More detailed info and options about the commands:
‘command’ –help
man ‘command’
55. #!/bin/bash
#SBATCH–jJOB_NAME
#SBATCH-o OUTPUT_FILE.log
#SBATCH-e ERROR_FILE.err
#SBATCH-p PARTITION
#SBATCH–mem=TOTMEM
#SBATCH-n NTASKS
#SBATCH–c NCORES/TASK
module load mpi/intel/openmpi/3.1.0
cp –r $input $SCRATCH
Cd $SCRATCH
srun $APPLICATION
mkdir -p $OUTPUT_DIR
cp -r * $output
Batch job submission: The slurm submit script
Schedulerdirectives
Setting up the environment variables and paths
Move the input files to the working directory
Launch the application(similar to mpirun)
Create the output folderand move the outputs
56. Scheduler directives/Options : #SBATCH
• -c, --cpus-per-task=ncpus number of cpus required per task
• --gres=list required generic resources
• -J, --job-name=jobname name of job
• -n, --ntasks=ntasks number of tasks to run
• --ntasks-per-node=n number of tasks to invoke on each node
• -N, --nodes=N number of nodes on which to run (N = min[-max])
• -o, --output=out file for batch script's standard output
• -p, --partition=partition partition requested
• -t, --time=minutes time limit (format: dd-hh:mm)
57. • -C, --constraint=list specify a list of constraints(mem, vnc , ....)
• --mem=MB minimum amount of total real memory
• --reservation=name allocate resources from named reservation
• -w, --nodelist=hosts... request a specific list of hosts
• --mem-per-cpu=MB amount of real memory per allocated core
• -t, --time=minutes Job max duration (Mandatory!!)
More commands/infotyping 'sbatch -h'
Scheduler directives/Options : #SBATCH
58. How to generate slurm script files: 1º Identify app parallelism
Threadparallelism
Process parallelism
#SBATCH –-ntasks=1
#SBATCH --cpus-per-task=NCORES
#SBATCH –-ntasks=NCORES
#SBATCH --cpus-per-task=1
59. How to generate slurm script files: 2º Determine the memory requirements
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=8
#SBATCH --partition=std-fat
The partition choice is strongly dependent of the job memory requirements !!
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=16
#SBATCH --partition=std
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=4
#SBATCH --partition=mem
#SBATCH –-mem-per-cpu=3900
#SBATCH --ntasks=16
#SBATCH --partition=std
Partition Memory/core
std/gpu
std-fat/KNL
mem
4Gb
8Gb
24Gb
60. How to generate slurm script files: 3º RunTime requirements
#SBATCH --time=Thpc
WORKSTATION -->
4 Cores(Nws)
8-16Gb RAM
1Tb 600mb/s
Ethernet 1-10 Gbs
HPC NODE
48 Cores(Nhpc)
192Gb RAM
200Tb 4Gb/s
Infiniband 100-200Gbs
Performance comparison At first approximation:
61. How to generate slurm script files: 4º Disk/IO requirements
Two kind of applications
Threaded/serial Multitask
Only one node: Multinode:
cd $SHAREDSCRATCH
Or
cd $LOCALSCRATCH
cd $SHAREDSCRATCH
Or let SLURM decide for you
cd $SCRATCH
62. How to generate slurm script files: Summary
1. Identify your application parallelism.
2. Estimate the amount of resources needed by your solving algorithm.
3. Estimate as better as possible the runtime.
4. Determine if your job I/O and input requirements.
5. Determine which are the necessary output files and save only these files
in your own disk space.
63. Gaussian 16 (Threaded Example)
#!/bin/bash
#SBATCH-j gau16_test
#SBATCH-o gau_test_%j.log
#SBATCH-e gau_test_%j.err
#SBATCH-n 1
#SBATCH-c 16
#SBATCH-p std
#SBATCH–mem=30000
#SBATCH–time=10-00
module load gaussian/g16b1
INPUT_DIR=/$HOME/gaussian_test/inputs
OUTPUT_DIR=$HOME/gaussian_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
g16 < input.gau > output.out
mkdir -p $OUTPUT_DIR
cp -r output.out $output
Threaded application
Less than 4Gb/core, std partition
10 Days RunTime
Set up environment to run the APP
64. Vasp 5.4.4 (Multitask Example)
#!/bin/bash
#SBATCH-j vasp_test_%j
#SBATCH-o vasp_test_%j.log
#SBATCH–e vasp_test_%j.err
#SBATCH-n 24
#SBATCH–c 1
#SBATCH–mem-per-cpu=7500
#SBATCH-p std-fat
#SBATCH–time=20:00
module load vasp/5.4.4
INPUT_DIR=/$HOME/vasp_test/inputs
OUTPUT_DIR=$HOME/vasp_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
srun `which vasp_std`
mkdir -p $OUTPUT_DIR
cp -r * $output
Multitaskapplication
More than 4Gb/core,but less than 8Gb/core ,
std-fat partition
20 Min RunTime
Set up environment to run the APP
Multitask app requires 'srun' command
67. Best Practices
• Use $SCRATCHas workingdirectory.
• Move only the necessaryfiles(notall files in the folder each time).
• Try to keep importantfiles only at $HOME
• Try to choose the partition and resoruces whose mostfit to your job.