SlideShare a Scribd company logo
1 of 44
Introduction to Parallelization
and performance optimization
Cristian Gomollon Escribano
17 / 03 / 2021
Why parallelize/optimize?
Sometimes you have a problem and (fortunately) you know how to solve it...
Why parallelize/optimize?
Sometimes you have a problem and (fortunately) you know how to solve it...
But.... you don't have enough resources to get it...
Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of
memory(~8Gb) and disc(1Tb), with a low efficiency ...
Solution proposal: More and better CPUs, more memory, more and faster disc/network!
Why parallelize/optimize?
I have my own code and I want to
parallelize/optimize its execution, how?
1º Analyse your code using a profiler
Initialization
Main loop
Finalization
Identify the section where your
code spends the most part of the
time using a profiler
1º Analyse your code using a profiler
Identify the section where your
code spends the most part of the
time using a profiler
Possible boundings
Compute
Memory
I/O
The profilers(and also tracers) also identify other types of
bottlenecks/overheads like bad memory alignment, cache faults
or bad compiler "pathways"
1º Analyse your code using a profiler
Note: Not the whole code is suitable for parallelization or
optimization. Also that "formulas" are idealisations. In the real
world, the "overheads"(parallel libraries/communication) have a
relevant impact on performance...
Variable setting and
I/O output ~1% time
Nested loop ~98%
Std output ~1%
t Total time
S Speedup
ts Serial code time
tp Parallellizable code time
N Number of cores
Ahmdal's law:
RunTime:
Timing results Some interesting metrics
1º Analyse your code using a profiler
2º Check if it is possible to parallelize/optimize that section
Typical, potentially parallel/efficient tasks:
Not so easy(but sometimes possible):
f = open("demofile.txt", "r")
Typical, potentially parallel/efficient tasks:
Not so easy(but sometimes possible):
f = open("demofile.txt", "r")
In general, the repetitive parts of the code (loops/math ops) are
the best suited for a parallelization/optimization strategy.
2º Check if it is possible to parallelize/optimize that section
3º Parallelization/Optimization strategies
Parallel/Accelerated libraries Parallel programming paradigms
Accelerator libraries
•Uninode/Multinode
•Distributed Memory and I/O
•Slightly recoding needed
•Network dependent
•mpirun –np 256 ./allrun
•Uninode
•Shared Memory
•Only requires “directives”
•OMP_NUM_THREADS=64 ./allrun
•Uninode/Multinode
•Distributed Memory and I/O
•Requires code rewriting.
•Strongly dependent on the workflow!
3º Parallelization/Optimization strategies(parallel programming)
3º Parallelization/Optimization strategies(parallel programming)
Linear algebra solvers Fourier Transform
cuFFT
Parallel I/O
3º Parallelization/Optimization strategies(parallel libraries)
In summary
Identify the section where
your code spends more time
using a profiler.
Determine if your code is
compute, memory and/or
I/O bounded.
Decide if you need a shared
memory, distributed memory/IO
paradigm (OpenMP or MPI), or
call a well-tested optimized
library.
Some programming advices
Tips & Tricks: Programming hints(memory optimization)
//compute the sum of two arrays in parallel
#include < stdio.h >
#include < mpi.h >
#define N 1000000
int main(void) {
MPI_Init(NULL, NULL);
int world_size,world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int Ni=N/world_size; //Be carefull with the memory....
if (world_rank==0)
{float a[N], b[N], c[N];}
else{float a[Ni], b[Ni], c[Ni];}
int i;
/* Initialize arrays a and b */
for (i = 0; i < Ni; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
}
/* Compute values of array c = a+b in parallel. */
for (i = 0; i < Ni; i++){
c[i] = a[i] + b[i]; }
MPI_Gather( a, Ni, MPI_Int, a, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Gather(b, Ni, MPI_Int, b, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Gather( c, Ni, MPI_Int,c, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Finalize();}
//compute the sum of two arrays in parallel
#include < stdio.h >
#include < omp.h >
#define N 1000000
int main(void) {
float a[N], b[N], c[N];
int i;
/* Initialize arrays a and b */
for (i = 0; i < N; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
}
/* Compute values of array c = a+b in parallel. */
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
}
}
Tips & Tricks: Programming hints(memory optimizations)
• Fortran and C have different memory alignments: Be sure that you are going over the
memory in the right way.
Is a good idea to transpose the 2nd matrix and multiply row by row in C (And the opposite in Fortran).
Tips & Tricks: Programming hints(memory optimizations)
In MPI paralelization, be carefull how to share the work between tasks to optimize the memory usage.
Rank 0
Rank 1
Rank 2
All ranks
• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions,
a reduced number of orbitals...), invert the loop nesting order...
Serial RunTime: 14 s
Tips & Tricks: Programming hints(loop paralelization)
2 s Parallel RunTime
>1700 s Parallel RunTime
Serial RunTime: 14 s
Tips & Tricks: Programming hints(loop paralelization)
• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions,
a reduced number of orbitals...), invert the loop nesting order...
Tips & Tricks: Programming hints(Synchronization)
Minimize, as much as possible, the synchronization overhead in MPI codes
• Fortran and C have different memory alignments: Be sure that you are multiplying
matrices in the right way. Also is a good idea to transpose the 2nd matrix before
multiplying.
• Avoid loops accessing to "pointers": This could confuse the compiler and reduce
dramatically the performance.
• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial
dimensions, a reduced number of orbitals...), invert the loop nesting order...
• Unroll loops to minimieze jumps: Is more efficent to have a big loop doing 3 line
operations(for example a 3D spatial operation) than 2 nested loops doing the same op.
• Avoid correlated loops: This kind of loops are really difficult to parallelize by their
interdependences.
• In case of a I/O-Memory bounding: The best strategy is an MPI multinode
parallelization.
• A tested parallel/optimized library/algorithm could reduce coding time and other
issues.
Don't re-invent the wheel, "know the keywords" is not equal to "Be a developer"
Tips & Tricks: Programming hints
Script generation and Parallel job launch
How to generate SLURM script files: 1º Identify app parallelism
Thread parallelism
Process parallelism
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=NCORES
#SBATCH --ntasks=NCORES
#SBATCH --cpus-per-task=1
How to generate SLURM script files: 2º Determine the memory requirements
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=8
#SBATCH --partition=std-fat
The partition choice is strongly
dependent on the job memory
requirements !!
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=16
#SBATCH --partition=std
#SBATCH –-mem=63900
#SBATCH --cpus-per-task=4
#SBATCH --partition=mem
#SBATCH –-mem-per-cpu=3900
#SBATCH --ntasks=16
#SBATCH --partition=std
Partition Memory/core*
std/gpu
std-fat/KNL
mem
~4Gb
~8Gb
~24Gb
* Real memory values:
std 3,9 Gb/core
std-fat/KNL 7,9GB(core)
mem: 23,9GB/core
How to generate SLURM script files: 3º RunTime requirements
#SBATCH --time=Thpc
WORKSTATION
4 Cores(Nws)
8-16Gb RAM
1Tb 600mb/s
Ethernet 1-10 Gbps
HPC NODE
48 Cores(Nhpc)
192-384 Gb RAM
200Tb 4Gb/s
Infiniband 100-200Gbps
Performance comparison At first approximation:
How to generate SLURM script files: 4º Disk/IO requirements
"Two" types of application
Threaded/serial Multitask/MPI
Only one node: Multinode:
cd $SHAREDSCRATCH
or
cd $LOCALSCRATCH
cd $SHAREDSCRATCH
Or let the AI decide for you
cd $SCRATCH
How to generate SLURM script files: Summary
1. Identify your application parallelism.
2. Estimate the resources needed by your solving algorithm.
3. Estimate as better as possible the required runtime.
4. Determine your job I/O and input(files) requirements.
5. Determine which are the necessary output files and save only these files
in your own disk space.
Gaussian 16 (Threaded Example)
#!/bin/bash
#SBATCH -j gau16_test
#SBATCH -o gau_test_%j.log
#SBATCH -e gau_test_%j.err
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p std
#SBATCH –mem=30000
#SBATCH –time=10-00
module load gaussian/g16b1
INPUT_DIR=/$HOME/gaussian_test/inputs
OUTPUT_DIR=$HOME/gaussian_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/* .
g16 < input.gau > output.out
mkdir -p $OUTPUT_DIR
cp -r output.out $output
Threaded application
Less than 4Gb/core , std partition
10 Days RunTime
Set up environment to run the APP
Vasp (Multitask Example)
#!/bin/bash
#SBATCH -j vasp_test_%j
#SBATCH -o vasp_test_%j.log
#SBATCH –e vasp_test_%j.err
#SBATCH -n 24
#SBATCH –c 1
#SBATCH –mem-per-cpu=7500
#SBATCH -p std-fat
#SBATCH –time=20:00
module load vasp/5.4.4
INPUT_DIR=/$HOME/vasp_test/inputs
OUTPUT_DIR=$HOME/vasp_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/* .
srun `which vasp_std`
mkdir -p $OUTPUT_DIR
cp -r * $output
Multitask/MPI application
More than 4Gb/core, but less than
8Gb/core -> std-fat partition
20 Min RunTime
Set up environment to run the APP
Multitask app requires 'srun' command
(but there are exceptions like ORCA)
Gromacs (MultiTask and threaded/Accelerated Example)
#!/bin/bash
#SBATCH --job-name=gromacs
#SBATCH --output=gromacs_%j.out
#SBATCH --error=gromacs_%j.err
#SBATCH -n 24
#SBATCH -c 2
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --time=00:30:00
module load gromacs/2018.4_mpi
cd $SHAREDSCRATCH
cp -r $HOME/SLMs/gromacs/CASE/* .
srun `which gmx_mpi` mdrun -v -deffnm input_system -ntomp
$SLURM_CPUS_PER_TASK -nb gpu -npme 12 -dlb yes -pin on –gpu_id 01
cp –r * /scratch/$USER/gromacs/CASE/output/
1 NODE Hybrid job!
2GPUs/Node on GPU partition
ANSYS Fluent (MultiTask Example)
#!/bin/bash
#SBATCH -j truck.cas
#SBATCH -o truck.log
#SBATCH -e truck.err
#SBATCH -p std
#SBATCH -n 16
#SBATCH –time=10-20:00
module load toolchains/gcc_mkl_ompi
INPUT_DIR=$HOME/FLUENT/inputs
OUTPUT_DIR=$HOME/FLUENT/outputs
cd $SCRATCH
cp -r $INPUT_DIR/* .
/prod/ANSYS16/v162/fluent/bin/fluent 3ddp –t $SLURM_NCPUS -mpi=hp -g -i input1_50.txt
mkdir -p $OUTPUT_DIR
cp -r * $output
Best Practices
• "More cores" not always is equal to "less runtime"
• Move only the necessary files(not all files each time).
• Use $SCRATCH as working directory.
• Try to keep only important files at $HOME
• Try to choose the partition and resoruces whose most fit to your job
requirements.
Thank you for your attention!
Thank you for your attention!
Why parallelize/optimize?
Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of
memory(~8Gb) and disc(1Tb), with a low efficiency ...
Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
"Easy" and Good only for Compute Bounding
Not so "easy", Good for Compute/Memory Bounding
Absolutelly(to much) flexible
Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
Identify a bounding(Bottleneck) "at a glance"
If you get significant more performance increasing the number of
Cores on the same number of Sockets...Is "Compute bounded".
If you get significant more performance increasing the number of
Sockets on the same number of Cores...Is "Memory/Bandwidth
bounded".
If you get significant more performance increasing the number of
Nodes on the same number of Cores and Sockets (or using a
faster HDD)...Is I/O bounded.
1º Analyse your code using a profiler
Identify a bounding(Bottleneck) "at a glance"
In fact, all real applications have different kind of bounds on different
parts of the code....
If you get significant more performance increasing the number of
Cores on the same number of Sockets...Is "Compute bounded".
If you get significant more performance increasing the number of
Sockets on the same number of Cores...Is "Memory/Bandwidth
bounded".
If you get significant more performance increasing the number of
Nodes on the same number of Cores and Sockets (or using a
faster HDD)...Is I/O bounded.
1º Analyse your code using a profiler
Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA

More Related Content

What's hot

Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Chris Fregly
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningSeiya Tokui
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsBrendan Gregg
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersDhanashree Prasad
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 

What's hot (20)

eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep Learning
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
OpenMP
OpenMPOpenMP
OpenMP
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 

Similar to Introduction to Parallelization ans performance optimization

ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxARIV4
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lessonslantsixgames
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDMike Dusenberry
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
Essential c notes singh projects
Essential c notes singh projectsEssential c notes singh projects
Essential c notes singh projectsSINGH PROJECTS
 

Similar to Introduction to Parallelization ans performance optimization (20)

Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
CS4961-L9.ppt
CS4961-L9.pptCS4961-L9.ppt
CS4961-L9.ppt
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Essential c
Essential cEssential c
Essential c
 
Essential c notes singh projects
Essential c notes singh projectsEssential c notes singh projects
Essential c notes singh projects
 
Essential c
Essential cEssential c
Essential c
 
Essential c
Essential cEssential c
Essential c
 
Essential c
Essential cEssential c
Essential c
 

More from CSUC - Consorci de Serveis Universitaris de Catalunya

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
Tendencias en herramientas de monitorización de redes y modelo de madurez en ...Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
 
Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)
 
Publicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de RecercaPublicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de Recerca
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
 
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
 
Security Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademySecurity Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademy
 
The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
 
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
 
La gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolasLa gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolas
 
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
 
Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...
 
Enginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fonsEnginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fons
 
Transformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IATransformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IA
 
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de BarcelonaDifusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
 
Fons de discos perforats de cartró
Fons de discos perforats de cartróFons de discos perforats de cartró
Fons de discos perforats de cartró
 
Biblioteca Digital Gencat
Biblioteca Digital GencatBiblioteca Digital Gencat
Biblioteca Digital Gencat
 
El fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusióEl fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusió
 
El CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físicsEl CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físics
 
Els serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUCEls serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUC
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Introduction to Parallelization ans performance optimization

  • 1. Introduction to Parallelization and performance optimization Cristian Gomollon Escribano 17 / 03 / 2021
  • 2. Why parallelize/optimize? Sometimes you have a problem and (fortunately) you know how to solve it...
  • 3. Why parallelize/optimize? Sometimes you have a problem and (fortunately) you know how to solve it... But.... you don't have enough resources to get it...
  • 4. Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of memory(~8Gb) and disc(1Tb), with a low efficiency ... Solution proposal: More and better CPUs, more memory, more and faster disc/network! Why parallelize/optimize?
  • 5. I have my own code and I want to parallelize/optimize its execution, how?
  • 6. 1º Analyse your code using a profiler Initialization Main loop Finalization
  • 7. Identify the section where your code spends the most part of the time using a profiler 1º Analyse your code using a profiler
  • 8. Identify the section where your code spends the most part of the time using a profiler Possible boundings Compute Memory I/O The profilers(and also tracers) also identify other types of bottlenecks/overheads like bad memory alignment, cache faults or bad compiler "pathways" 1º Analyse your code using a profiler
  • 9. Note: Not the whole code is suitable for parallelization or optimization. Also that "formulas" are idealisations. In the real world, the "overheads"(parallel libraries/communication) have a relevant impact on performance... Variable setting and I/O output ~1% time Nested loop ~98% Std output ~1% t Total time S Speedup ts Serial code time tp Parallellizable code time N Number of cores Ahmdal's law: RunTime: Timing results Some interesting metrics 1º Analyse your code using a profiler
  • 10. 2º Check if it is possible to parallelize/optimize that section Typical, potentially parallel/efficient tasks: Not so easy(but sometimes possible): f = open("demofile.txt", "r")
  • 11. Typical, potentially parallel/efficient tasks: Not so easy(but sometimes possible): f = open("demofile.txt", "r") In general, the repetitive parts of the code (loops/math ops) are the best suited for a parallelization/optimization strategy. 2º Check if it is possible to parallelize/optimize that section
  • 12. 3º Parallelization/Optimization strategies Parallel/Accelerated libraries Parallel programming paradigms Accelerator libraries
  • 13. •Uninode/Multinode •Distributed Memory and I/O •Slightly recoding needed •Network dependent •mpirun –np 256 ./allrun •Uninode •Shared Memory •Only requires “directives” •OMP_NUM_THREADS=64 ./allrun •Uninode/Multinode •Distributed Memory and I/O •Requires code rewriting. •Strongly dependent on the workflow! 3º Parallelization/Optimization strategies(parallel programming)
  • 15. Linear algebra solvers Fourier Transform cuFFT Parallel I/O 3º Parallelization/Optimization strategies(parallel libraries)
  • 16. In summary Identify the section where your code spends more time using a profiler. Determine if your code is compute, memory and/or I/O bounded. Decide if you need a shared memory, distributed memory/IO paradigm (OpenMP or MPI), or call a well-tested optimized library.
  • 18. Tips & Tricks: Programming hints(memory optimization) //compute the sum of two arrays in parallel #include < stdio.h > #include < mpi.h > #define N 1000000 int main(void) { MPI_Init(NULL, NULL); int world_size,world_rank; MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int Ni=N/world_size; //Be carefull with the memory.... if (world_rank==0) {float a[N], b[N], c[N];} else{float a[Ni], b[Ni], c[Ni];} int i; /* Initialize arrays a and b */ for (i = 0; i < Ni; i++) { a[i] = i * 2.0; b[i] = i * 3.0; } /* Compute values of array c = a+b in parallel. */ for (i = 0; i < Ni; i++){ c[i] = a[i] + b[i]; } MPI_Gather( a, Ni, MPI_Int, a, int recv_count, MPI_Int, 0, MPI_COMM_WORLD); MPI_Gather(b, Ni, MPI_Int, b, int recv_count, MPI_Int, 0, MPI_COMM_WORLD); MPI_Gather( c, Ni, MPI_Int,c, int recv_count, MPI_Int, 0, MPI_COMM_WORLD); MPI_Finalize();} //compute the sum of two arrays in parallel #include < stdio.h > #include < omp.h > #define N 1000000 int main(void) { float a[N], b[N], c[N]; int i; /* Initialize arrays a and b */ for (i = 0; i < N; i++) { a[i] = i * 2.0; b[i] = i * 3.0; } /* Compute values of array c = a+b in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } } }
  • 19. Tips & Tricks: Programming hints(memory optimizations) • Fortran and C have different memory alignments: Be sure that you are going over the memory in the right way.
  • 20. Is a good idea to transpose the 2nd matrix and multiply row by row in C (And the opposite in Fortran). Tips & Tricks: Programming hints(memory optimizations) In MPI paralelization, be carefull how to share the work between tasks to optimize the memory usage. Rank 0 Rank 1 Rank 2 All ranks
  • 21. • Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions, a reduced number of orbitals...), invert the loop nesting order... Serial RunTime: 14 s Tips & Tricks: Programming hints(loop paralelization)
  • 22. 2 s Parallel RunTime >1700 s Parallel RunTime Serial RunTime: 14 s Tips & Tricks: Programming hints(loop paralelization) • Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions, a reduced number of orbitals...), invert the loop nesting order...
  • 23. Tips & Tricks: Programming hints(Synchronization) Minimize, as much as possible, the synchronization overhead in MPI codes
  • 24. • Fortran and C have different memory alignments: Be sure that you are multiplying matrices in the right way. Also is a good idea to transpose the 2nd matrix before multiplying. • Avoid loops accessing to "pointers": This could confuse the compiler and reduce dramatically the performance. • Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions, a reduced number of orbitals...), invert the loop nesting order... • Unroll loops to minimieze jumps: Is more efficent to have a big loop doing 3 line operations(for example a 3D spatial operation) than 2 nested loops doing the same op. • Avoid correlated loops: This kind of loops are really difficult to parallelize by their interdependences. • In case of a I/O-Memory bounding: The best strategy is an MPI multinode parallelization. • A tested parallel/optimized library/algorithm could reduce coding time and other issues. Don't re-invent the wheel, "know the keywords" is not equal to "Be a developer" Tips & Tricks: Programming hints
  • 25. Script generation and Parallel job launch
  • 26. How to generate SLURM script files: 1º Identify app parallelism Thread parallelism Process parallelism #SBATCH --ntasks=1 #SBATCH --cpus-per-task=NCORES #SBATCH --ntasks=NCORES #SBATCH --cpus-per-task=1
  • 27. How to generate SLURM script files: 2º Determine the memory requirements #SBATCH –-mem=63900 #SBATCH --cpus-per-task=8 #SBATCH --partition=std-fat The partition choice is strongly dependent on the job memory requirements !! #SBATCH –-mem=63900 #SBATCH --cpus-per-task=16 #SBATCH --partition=std #SBATCH –-mem=63900 #SBATCH --cpus-per-task=4 #SBATCH --partition=mem #SBATCH –-mem-per-cpu=3900 #SBATCH --ntasks=16 #SBATCH --partition=std Partition Memory/core* std/gpu std-fat/KNL mem ~4Gb ~8Gb ~24Gb * Real memory values: std 3,9 Gb/core std-fat/KNL 7,9GB(core) mem: 23,9GB/core
  • 28. How to generate SLURM script files: 3º RunTime requirements #SBATCH --time=Thpc WORKSTATION 4 Cores(Nws) 8-16Gb RAM 1Tb 600mb/s Ethernet 1-10 Gbps HPC NODE 48 Cores(Nhpc) 192-384 Gb RAM 200Tb 4Gb/s Infiniband 100-200Gbps Performance comparison At first approximation:
  • 29. How to generate SLURM script files: 4º Disk/IO requirements "Two" types of application Threaded/serial Multitask/MPI Only one node: Multinode: cd $SHAREDSCRATCH or cd $LOCALSCRATCH cd $SHAREDSCRATCH Or let the AI decide for you cd $SCRATCH
  • 30. How to generate SLURM script files: Summary 1. Identify your application parallelism. 2. Estimate the resources needed by your solving algorithm. 3. Estimate as better as possible the required runtime. 4. Determine your job I/O and input(files) requirements. 5. Determine which are the necessary output files and save only these files in your own disk space.
  • 31. Gaussian 16 (Threaded Example) #!/bin/bash #SBATCH -j gau16_test #SBATCH -o gau_test_%j.log #SBATCH -e gau_test_%j.err #SBATCH -n 1 #SBATCH -c 16 #SBATCH -p std #SBATCH –mem=30000 #SBATCH –time=10-00 module load gaussian/g16b1 INPUT_DIR=/$HOME/gaussian_test/inputs OUTPUT_DIR=$HOME/gaussian_test/outputs cd $SCRATCH cp -r $INPUT_DIR/* . g16 < input.gau > output.out mkdir -p $OUTPUT_DIR cp -r output.out $output Threaded application Less than 4Gb/core , std partition 10 Days RunTime Set up environment to run the APP
  • 32. Vasp (Multitask Example) #!/bin/bash #SBATCH -j vasp_test_%j #SBATCH -o vasp_test_%j.log #SBATCH –e vasp_test_%j.err #SBATCH -n 24 #SBATCH –c 1 #SBATCH –mem-per-cpu=7500 #SBATCH -p std-fat #SBATCH –time=20:00 module load vasp/5.4.4 INPUT_DIR=/$HOME/vasp_test/inputs OUTPUT_DIR=$HOME/vasp_test/outputs cd $SCRATCH cp -r $INPUT_DIR/* . srun `which vasp_std` mkdir -p $OUTPUT_DIR cp -r * $output Multitask/MPI application More than 4Gb/core, but less than 8Gb/core -> std-fat partition 20 Min RunTime Set up environment to run the APP Multitask app requires 'srun' command (but there are exceptions like ORCA)
  • 33. Gromacs (MultiTask and threaded/Accelerated Example) #!/bin/bash #SBATCH --job-name=gromacs #SBATCH --output=gromacs_%j.out #SBATCH --error=gromacs_%j.err #SBATCH -n 24 #SBATCH -c 2 #SBATCH -N 1 #SBATCH -p gpu #SBATCH --gres=gpu:2 #SBATCH --time=00:30:00 module load gromacs/2018.4_mpi cd $SHAREDSCRATCH cp -r $HOME/SLMs/gromacs/CASE/* . srun `which gmx_mpi` mdrun -v -deffnm input_system -ntomp $SLURM_CPUS_PER_TASK -nb gpu -npme 12 -dlb yes -pin on –gpu_id 01 cp –r * /scratch/$USER/gromacs/CASE/output/ 1 NODE Hybrid job! 2GPUs/Node on GPU partition
  • 34. ANSYS Fluent (MultiTask Example) #!/bin/bash #SBATCH -j truck.cas #SBATCH -o truck.log #SBATCH -e truck.err #SBATCH -p std #SBATCH -n 16 #SBATCH –time=10-20:00 module load toolchains/gcc_mkl_ompi INPUT_DIR=$HOME/FLUENT/inputs OUTPUT_DIR=$HOME/FLUENT/outputs cd $SCRATCH cp -r $INPUT_DIR/* . /prod/ANSYS16/v162/fluent/bin/fluent 3ddp –t $SLURM_NCPUS -mpi=hp -g -i input1_50.txt mkdir -p $OUTPUT_DIR cp -r * $output
  • 35. Best Practices • "More cores" not always is equal to "less runtime" • Move only the necessary files(not all files each time). • Use $SCRATCH as working directory. • Try to keep only important files at $HOME • Try to choose the partition and resoruces whose most fit to your job requirements.
  • 36. Thank you for your attention!
  • 37. Thank you for your attention!
  • 38. Why parallelize/optimize? Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of memory(~8Gb) and disc(1Tb), with a low efficiency ...
  • 39. Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA "Easy" and Good only for Compute Bounding Not so "easy", Good for Compute/Memory Bounding Absolutelly(to much) flexible
  • 40. Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
  • 41. Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
  • 42. Identify a bounding(Bottleneck) "at a glance" If you get significant more performance increasing the number of Cores on the same number of Sockets...Is "Compute bounded". If you get significant more performance increasing the number of Sockets on the same number of Cores...Is "Memory/Bandwidth bounded". If you get significant more performance increasing the number of Nodes on the same number of Cores and Sockets (or using a faster HDD)...Is I/O bounded. 1º Analyse your code using a profiler
  • 43. Identify a bounding(Bottleneck) "at a glance" In fact, all real applications have different kind of bounds on different parts of the code.... If you get significant more performance increasing the number of Cores on the same number of Sockets...Is "Compute bounded". If you get significant more performance increasing the number of Sockets on the same number of Cores...Is "Memory/Bandwidth bounded". If you get significant more performance increasing the number of Nodes on the same number of Cores and Sockets (or using a faster HDD)...Is I/O bounded. 1º Analyse your code using a profiler
  • 44. Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA