SlideShare a Scribd company logo
MPI and OpenMP
(Lecture 25, cs262a)
Ion Stoica,
UC Berkeley
November 19, 2016
Message passing vs. Shared memory
Message passing: exchange data
explicitly via IPC
Application developers define
protocol and exchanging format,
number of participants, and each
exchange
Shared memory: all multiple processes
to share data via memory
Applications must locate and and map
shared memory regions to exchange
data
Client
send(msg)
MSG
Client
recv(msg)
MSG
MSG IPC
Client
send(msg)
Client
recv(msg)
Shared
Memory
Architectures
Uniformed Shared
Memory (UMA)
Cray 2
Massively Parallel
DistrBluegene/L
Non-Uniformed Shared
Memory (NUMA)
SGI Altix 3700
Orthogonal to programming model
MPI
MPI - Message Passing Interface
• Library standard defined by a committee of vendors, implementers, and
parallel programmers
• Used to create parallel programs based on message passing
Portable: one standard, many implementations
• Available on almost all parallel machines in C and Fortran
• De facto standard platform for the HPC community
Groups, Communicators, Contexts
Group: a fixed ordered set of k
processes, i.e., 0, 1, .., k-1
Communicator: specify scope of
communication
• Between processes in a group
• Between two disjoint groups
Context: partition communication space
• A message sent in one context cannot
be received in another context
This image is captured from:
“Writing Message Passing Parallel
Programs with MPI”, Course Notes,
Edinburgh Parallel Computing Centre
The University of Edinburgh
Synchronous vs. Asynchronous Message Passing
A synchronous communication is not complete until the
message has been received
An asynchronous communication completes before the
message is received
Communication Modes
Synchronous: completes once ack is received by sender
Asynchronous: 3 modes
• Standard send: completes once the message has been sent, which may
or may not imply that the message has arrived at its destination
• Buffered send: completes immediately, if receiver not ready, MPI buffers
the message locally
• Ready send: completes immediately, if the receiver is ready for the
message it will get it, otherwise the message is dropped silently
Blocking vs. Non-Blocking
Blocking, means the program will not continue until the
communication is completed
• Synchronous communication
• Barriers: wait for every process in the group to reach a point in
execution
Non-Blocking, means the program will continue, without waiting
for the communication to be completed
MPI library
Huge (125 functions)
Basic (6 functions)
MPI Basic
Many parallel programs can be written using just these six
functions, only two of which are non-trivial;
– MPI_INIT
– MPI_FINALIZE
– MPI_COMM_SIZE
– MPI_COMM_RANK
– MPI_SEND
– MPI_RECV
Skeleton MPI Program (C)
#include <mpi.h>
main(int argc, char** argv)
{
MPI_Init(&argc, &argv);
/* main part of the program */
/* Use MPI function call depend on your data
* partitioning and the parallelization architecture
*/
MPI_Finalize();
}
A minimal MPI program (C)
#include “mpi.h”
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
printf(“Hello, world!n”);
MPI_Finalize();
return 0;
}
A minimal MPI program (C)
#include “mpi.h” provides basic MPI definitions and types.
MPI_Init starts MPI
MPI_Finalize exits MPI
Notes:
• Non-MPI routines are local; this “printf” run on each process
• MPI functions return error codes or MPI_SUCCESS
Error handling
By default, an error causes all processes to abort
The user can have his/her own error handling routines
Some custom error handlers are available for downloading
from the net
Improved Hello (C)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
/* rank of this process in the communicator */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* get the size of the group associates to the communicator */
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am %d of %dn", rank, size);
MPI_Finalize();
return 0;
}
Improved Hello (C)
/* Find out rank, size */
int world_rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0n", number);
}
Rank of
destination Default
communicator
Tag to identify
message
Number of
elements
Rank of
source
Status
Many other functions…
MPI_Bcast: send same piece of
data to all processes in the group
MPI_Scatter: send different
pieces of an array to different
processes (i.e., partition an array
across processes)
From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
Many other functions…
MPI_Gather: take elements from
many processes and gathers them
to one single process
• E.g., parallel sorting, searching
From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
Many other functions…
MPI_Reduce: takes an array of
input elements on each process
and returns an array of output
elements to the root process
given a specified operation
MPI_Allreduce: Like
MPI_Reduce but distribute
results to all processes
From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
MPI Discussion
Gives full control to programmer
• Exposes number of processes
• Communication is explicit, driven by the program
Assume
• Long running processes
• Homogeneous (same performance) processors
Little support for failures, no straggler mitigation
Summary: achieve high performance by hand-optimizing jobs
but requires experts to do so, and little support for fault
tolerance
OpenMP
Based on the “Introduction to OpenMP” presentation:
(webcourse.cs.technion.ac.il/236370/Winter2009.../OpenMPLecture.ppt)
Motivation
Multicore CPUs are everywhere:
• Servers with over 100 cores today
• Even smartphone CPUs have 8 cores
Multithreading, natural programming model
• All processors share the same memory
• Threads in a process see same address space
• Many shared-memory algorithms developed
But…
Multithreading is hard
• Lots of expertise necessary
• Deadlocks and race conditions
• Non-deterministic behavior makes it hard to debug
Example
Parallelize the following code using threads:
for (i=0; i<n; i++) {
sum = sum + sqrt(sin(data[i]));
}
Why hard?
• Need mutex to protect the accesses to sum
• Different code for serial and parallel version
• No built-in tuning (# of processors?)
OpenMP
A language extension with constructs for parallel programming:
• Critical sections, atomic access, private variables, barriers
Parallelization is orthogonal to functionality
• If the compiler does not recognize OpenMP directives, the code
remains functional (albeit single-threaded)
Industry standard: supported by Intel, Microsoft, IBM, HP
OpenMP execution model
Fork and Join: Master thread spawns a team of threads as
needed
Master thread
Master thread
Worker
Thread
FORK
JOIN
FORK
JOIN
Parallel
regions
OpenMP memory model
Shared memory model
• Threads communicate by accessing shared variables
The sharing is defined syntactically
• Any variable that is seen by two or more threads is shared
• Any variable that is seen by one thread only is private
Race conditions possible
• Use synchronization to protect from conflicts
• Change how data is stored to minimize the synchronization
OpenMP: Work sharing example
answer1 = long_computation_1();
answer2 = long_computation_2();
if (answer1 != answer2) { … }
How to parallelize?
OpenMP: Work sharing example
answer1 = long_computation_1();
answer2 = long_computation_2();
if (answer1 != answer2) { … }
How to parallelize?
#pragma omp sections
{
#pragma omp section
answer1 = long_computation_1();
#pragma omp section
answer2 = long_computation_2();
}
if (answer1 != answer2) { … }
OpenMP: Work sharing example
Sequential code for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
OpenMP: Work sharing example
Sequential code
(Semi) manual
parallelization
for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
#pragma omp parallel
{
int id = omp_get_thread_num();
int nt = omp_get_num_threads();
int i_start = id*N/nt, i_end = (id+1)*N/nt;
for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }
}
OpenMP: Work sharing example
Sequential code
(Semi) manual
parallelization
for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
#pragma omp parallel
{
int id = omp_get_thread_num();
int nt = omp_get_num_threads();
int i_start = id*N/nt, i_end = (id+1)*N/nt;
for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }
}
• Launch nt threads
• Each thread uses id
and nt variables to
operate on a different
segment of the arrays
OpenMP: Work sharing example
Sequential code
(Semi) manual
parallelization
Automatic
parallelization of
the for loop
using
#parallel for
for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
#pragma omp parallel
{
int id = omp_get_thread_num();
int nt = omp_get_num_threads();
int i_start = id*N/nt, i_end = (id+1)*N/nt;
for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }
}
#pragma omp parallel
#pragma omp for schedule(static)
{
for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
}
One signed
variable in
the loop
Initialization
:
var = init
Comparison:
var op last,
where
op: <, >, <=, >=
Increment:
var++, var--,
var += incr, var -=
incr
Challenges of #parallel for
Load balancing
• If all iterations execute at the same speed, the processors are used optimally
• If some iterations are faster, some processors may get idle, reducing the speedup
• We don’t always know distribution of work, may need to re-distribute dynamically
Granularity
• Thread creation and synchronization takes time
• Assigning work to threads on per-iteration resolution may take more time than the
execution itself
• Need to coalesce the work to coarse chunks to overcome the threading overhead
Trade-off between load balancing and granularity
Schedule: controlling work distribution
schedule(static [, chunksize])
• Default: chunks of approximately equivalent size, one to each thread
• If more chunks than threads: assigned in round-robin to the threads
• Why might want to use chunks of different size?
schedule(dynamic [, chunksize])
• Threads receive chunk assignments dynamically
• Default chunk size = 1
schedule(guided [, chunksize])
• Start with large chunks
• Threads receive chunks dynamically. Chunk size reduces
exponentially, down to chunksize
OpenMP: Data Environment
Shared Memory programming model
• Most variables (including locals) are shared by threads
{
int sum = 0;
#pragma omp parallel for
for (int i=0; i<N; i++) sum += i;
}
• Global variables are shared
Some variables can be private
• Variables inside the statement block
• Variables in the called functions
• Variables can be explicitly declared as private
Overriding storage attributes
private:
• A copy of the variable is created for
each thread
• There is no connection between
original variable and private copies
• Can achieve same using variables
inside { }
firstprivate:
• Same, but the initial value of the
variable is copied from the main
copy
lastprivate:
• Same, but the last value of the
variable is copied to the main copy
int i;
#pragma omp parallel for private(i)
for (i=0; i<n; i++) { … }
int idx=1;
int x = 10;
#pragma omp parallel for 
firsprivate(x) lastprivate(idx)
for (i=0; i<n; i++) {
if (data[i] == x)
idx = i;
}
Reduction
for (j=0; j<N; j++) {
sum = sum + a[j]*b[j];
}
How to parallelize this code?
• sum is not private, but accessing it atomically is too expensive
• Have a private copy of sum in each thread, then add them up
Use the reduction clause
#pragma omp parallel for reduction(+: sum)
• Any associative operator could be used: +, -, ||, |, *, etc
• The private value is initialized automatically (to 0, 1, ~0 …)
#pragma omp reduction
float dot_prod(float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for(int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
Conclusions
OpenMP: A framework for code parallelization
• Available for C++ and FORTRAN
• Based on a standard
• Implementations from a wide selection of vendors
Relatively easy to use
• Write (and debug!) code first, parallelize later
• Parallelization can be incremental
• Parallelization can be turned off at runtime or compile time
• Code is still correct for a serial machine

More Related Content

Similar to 25-MPI-OpenMP.pptx

More mpi4py
More mpi4pyMore mpi4py
More mpi4py
A Jorge Garcia
 
MPI
MPIMPI
MPI Introduction
MPI IntroductionMPI Introduction
MPI Introduction
Rohit Banga
 
MPI message passing interface
MPI message passing interfaceMPI message passing interface
MPI message passing interface
Mohit Raghuvanshi
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
Sudhang Shankar
 
Parallel programming using MPI
Parallel programming using MPIParallel programming using MPI
Parallel programming using MPI
Ajit Nayak
 
Lecture9
Lecture9Lecture9
Lecture9
tt_aljobory
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
Intel® Software
 
Multicore
MulticoreMulticore
Open MPI
Open MPIOpen MPI
Open MPI
Anshul Sharma
 
Intro to MPI
Intro to MPIIntro to MPI
Intro to MPI
jbp4444
 
cs556-2nd-tutorial.pdf
cs556-2nd-tutorial.pdfcs556-2nd-tutorial.pdf
cs556-2nd-tutorial.pdf
ssuserada6a9
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
lccausp
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
inside-BigData.com
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
Divya Tiwari
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
Introduction Of C++
Introduction Of C++Introduction Of C++
Introduction Of C++
Sangharsh agarwal
 

Similar to 25-MPI-OpenMP.pptx (20)

More mpi4py
More mpi4pyMore mpi4py
More mpi4py
 
MPI
MPIMPI
MPI
 
MPI Introduction
MPI IntroductionMPI Introduction
MPI Introduction
 
MPI message passing interface
MPI message passing interfaceMPI message passing interface
MPI message passing interface
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Parallel programming using MPI
Parallel programming using MPIParallel programming using MPI
Parallel programming using MPI
 
Lecture9
Lecture9Lecture9
Lecture9
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
Multicore
MulticoreMulticore
Multicore
 
Open MPI
Open MPIOpen MPI
Open MPI
 
Intro to MPI
Intro to MPIIntro to MPI
Intro to MPI
 
cs556-2nd-tutorial.pdf
cs556-2nd-tutorial.pdfcs556-2nd-tutorial.pdf
cs556-2nd-tutorial.pdf
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Edua...
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction Of C++
Introduction Of C++Introduction Of C++
Introduction Of C++
 

Recently uploaded

ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 

Recently uploaded (20)

ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 

25-MPI-OpenMP.pptx

  • 1. MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
  • 2. Message passing vs. Shared memory Message passing: exchange data explicitly via IPC Application developers define protocol and exchanging format, number of participants, and each exchange Shared memory: all multiple processes to share data via memory Applications must locate and and map shared memory regions to exchange data Client send(msg) MSG Client recv(msg) MSG MSG IPC Client send(msg) Client recv(msg) Shared Memory
  • 3. Architectures Uniformed Shared Memory (UMA) Cray 2 Massively Parallel DistrBluegene/L Non-Uniformed Shared Memory (NUMA) SGI Altix 3700 Orthogonal to programming model
  • 4. MPI MPI - Message Passing Interface • Library standard defined by a committee of vendors, implementers, and parallel programmers • Used to create parallel programs based on message passing Portable: one standard, many implementations • Available on almost all parallel machines in C and Fortran • De facto standard platform for the HPC community
  • 5. Groups, Communicators, Contexts Group: a fixed ordered set of k processes, i.e., 0, 1, .., k-1 Communicator: specify scope of communication • Between processes in a group • Between two disjoint groups Context: partition communication space • A message sent in one context cannot be received in another context This image is captured from: “Writing Message Passing Parallel Programs with MPI”, Course Notes, Edinburgh Parallel Computing Centre The University of Edinburgh
  • 6. Synchronous vs. Asynchronous Message Passing A synchronous communication is not complete until the message has been received An asynchronous communication completes before the message is received
  • 7. Communication Modes Synchronous: completes once ack is received by sender Asynchronous: 3 modes • Standard send: completes once the message has been sent, which may or may not imply that the message has arrived at its destination • Buffered send: completes immediately, if receiver not ready, MPI buffers the message locally • Ready send: completes immediately, if the receiver is ready for the message it will get it, otherwise the message is dropped silently
  • 8. Blocking vs. Non-Blocking Blocking, means the program will not continue until the communication is completed • Synchronous communication • Barriers: wait for every process in the group to reach a point in execution Non-Blocking, means the program will continue, without waiting for the communication to be completed
  • 9. MPI library Huge (125 functions) Basic (6 functions)
  • 10. MPI Basic Many parallel programs can be written using just these six functions, only two of which are non-trivial; – MPI_INIT – MPI_FINALIZE – MPI_COMM_SIZE – MPI_COMM_RANK – MPI_SEND – MPI_RECV
  • 11. Skeleton MPI Program (C) #include <mpi.h> main(int argc, char** argv) { MPI_Init(&argc, &argv); /* main part of the program */ /* Use MPI function call depend on your data * partitioning and the parallelization architecture */ MPI_Finalize(); }
  • 12. A minimal MPI program (C) #include “mpi.h” #include <stdio.h> int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); printf(“Hello, world!n”); MPI_Finalize(); return 0; }
  • 13. A minimal MPI program (C) #include “mpi.h” provides basic MPI definitions and types. MPI_Init starts MPI MPI_Finalize exits MPI Notes: • Non-MPI routines are local; this “printf” run on each process • MPI functions return error codes or MPI_SUCCESS
  • 14. Error handling By default, an error causes all processes to abort The user can have his/her own error handling routines Some custom error handlers are available for downloading from the net
  • 15. Improved Hello (C) #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); /* rank of this process in the communicator */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* get the size of the group associates to the communicator */ MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am %d of %dn", rank, size); MPI_Finalize(); return 0; }
  • 16. Improved Hello (C) /* Find out rank, size */ int world_rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); int number; if (world_rank == 0) { number = -1; MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (world_rank == 1) { MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process 1 received number %d from process 0n", number); } Rank of destination Default communicator Tag to identify message Number of elements Rank of source Status
  • 17. Many other functions… MPI_Bcast: send same piece of data to all processes in the group MPI_Scatter: send different pieces of an array to different processes (i.e., partition an array across processes) From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
  • 18. Many other functions… MPI_Gather: take elements from many processes and gathers them to one single process • E.g., parallel sorting, searching From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
  • 19. Many other functions… MPI_Reduce: takes an array of input elements on each process and returns an array of output elements to the root process given a specified operation MPI_Allreduce: Like MPI_Reduce but distribute results to all processes From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
  • 20. MPI Discussion Gives full control to programmer • Exposes number of processes • Communication is explicit, driven by the program Assume • Long running processes • Homogeneous (same performance) processors Little support for failures, no straggler mitigation Summary: achieve high performance by hand-optimizing jobs but requires experts to do so, and little support for fault tolerance
  • 21. OpenMP Based on the “Introduction to OpenMP” presentation: (webcourse.cs.technion.ac.il/236370/Winter2009.../OpenMPLecture.ppt)
  • 22. Motivation Multicore CPUs are everywhere: • Servers with over 100 cores today • Even smartphone CPUs have 8 cores Multithreading, natural programming model • All processors share the same memory • Threads in a process see same address space • Many shared-memory algorithms developed
  • 23. But… Multithreading is hard • Lots of expertise necessary • Deadlocks and race conditions • Non-deterministic behavior makes it hard to debug
  • 24. Example Parallelize the following code using threads: for (i=0; i<n; i++) { sum = sum + sqrt(sin(data[i])); } Why hard? • Need mutex to protect the accesses to sum • Different code for serial and parallel version • No built-in tuning (# of processors?)
  • 25. OpenMP A language extension with constructs for parallel programming: • Critical sections, atomic access, private variables, barriers Parallelization is orthogonal to functionality • If the compiler does not recognize OpenMP directives, the code remains functional (albeit single-threaded) Industry standard: supported by Intel, Microsoft, IBM, HP
  • 26. OpenMP execution model Fork and Join: Master thread spawns a team of threads as needed Master thread Master thread Worker Thread FORK JOIN FORK JOIN Parallel regions
  • 27. OpenMP memory model Shared memory model • Threads communicate by accessing shared variables The sharing is defined syntactically • Any variable that is seen by two or more threads is shared • Any variable that is seen by one thread only is private Race conditions possible • Use synchronization to protect from conflicts • Change how data is stored to minimize the synchronization
  • 28. OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize?
  • 29. OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize? #pragma omp sections { #pragma omp section answer1 = long_computation_1(); #pragma omp section answer2 = long_computation_2(); } if (answer1 != answer2) { … }
  • 30. OpenMP: Work sharing example Sequential code for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
  • 31. OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } }
  • 32. OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } } • Launch nt threads • Each thread uses id and nt variables to operate on a different segment of the arrays
  • 33. OpenMP: Work sharing example Sequential code (Semi) manual parallelization Automatic parallelization of the for loop using #parallel for for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } } #pragma omp parallel #pragma omp for schedule(static) { for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } } One signed variable in the loop Initialization : var = init Comparison: var op last, where op: <, >, <=, >= Increment: var++, var--, var += incr, var -= incr
  • 34. Challenges of #parallel for Load balancing • If all iterations execute at the same speed, the processors are used optimally • If some iterations are faster, some processors may get idle, reducing the speedup • We don’t always know distribution of work, may need to re-distribute dynamically Granularity • Thread creation and synchronization takes time • Assigning work to threads on per-iteration resolution may take more time than the execution itself • Need to coalesce the work to coarse chunks to overcome the threading overhead Trade-off between load balancing and granularity
  • 35. Schedule: controlling work distribution schedule(static [, chunksize]) • Default: chunks of approximately equivalent size, one to each thread • If more chunks than threads: assigned in round-robin to the threads • Why might want to use chunks of different size? schedule(dynamic [, chunksize]) • Threads receive chunk assignments dynamically • Default chunk size = 1 schedule(guided [, chunksize]) • Start with large chunks • Threads receive chunks dynamically. Chunk size reduces exponentially, down to chunksize
  • 36. OpenMP: Data Environment Shared Memory programming model • Most variables (including locals) are shared by threads { int sum = 0; #pragma omp parallel for for (int i=0; i<N; i++) sum += i; } • Global variables are shared Some variables can be private • Variables inside the statement block • Variables in the called functions • Variables can be explicitly declared as private
  • 37. Overriding storage attributes private: • A copy of the variable is created for each thread • There is no connection between original variable and private copies • Can achieve same using variables inside { } firstprivate: • Same, but the initial value of the variable is copied from the main copy lastprivate: • Same, but the last value of the variable is copied to the main copy int i; #pragma omp parallel for private(i) for (i=0; i<n; i++) { … } int idx=1; int x = 10; #pragma omp parallel for firsprivate(x) lastprivate(idx) for (i=0; i<n; i++) { if (data[i] == x) idx = i; }
  • 38. Reduction for (j=0; j<N; j++) { sum = sum + a[j]*b[j]; } How to parallelize this code? • sum is not private, but accessing it atomically is too expensive • Have a private copy of sum in each thread, then add them up Use the reduction clause #pragma omp parallel for reduction(+: sum) • Any associative operator could be used: +, -, ||, |, *, etc • The private value is initialized automatically (to 0, 1, ~0 …)
  • 39. #pragma omp reduction float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction(+:sum) for(int i = 0; i < N; i++) { sum += a[i] * b[i]; } return sum; }
  • 40. Conclusions OpenMP: A framework for code parallelization • Available for C++ and FORTRAN • Based on a standard • Implementations from a wide selection of vendors Relatively easy to use • Write (and debug!) code first, parallelize later • Parallelization can be incremental • Parallelization can be turned off at runtime or compile time • Code is still correct for a serial machine