“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil

© 2015 IBM Corporation
Tutorial: Programação
paralela híbrida com MPI e
OpenMP
uma abordagem prática
Tutorial: Programação
paralela híbrida com MPI e
OpenMP
uma abordagem prática
Eduardo Rodrigues
edrodri@br.ibm.com
3°. Workshop de High Performance Computing – Convênio: USP – Rice University

IBM Research
IBM Research
Brazil Lab research areas
Industrial Technology and Science
Systems of Engagement and Insight
Social Data Analytics
Natural Resource Solutions
https://jobs3.netmedia1.com/cp/faces/job_summary?job_id=RES-0689175
https://jobs3.netmedia1.com/cp/faces/job_search

IBM Research
Legal stuff
● This presentation represents the views of the
author and does not necessarily represent the
views of IBM.
● Company, product and service names may be
trademarks or service marks of others.

IBM Research
Agenda
● MPI and OpenMP
– Motivation
– Basic functions / directives
– Hybrid usage
– Performance examples
● AMPI – load balancing

IBM Research
Parallel Programming Models
fork-join Message passing
Power8
https://en.wikipedia.org/wiki/Computer_cluster#/media/File:Beowulf.jpg

IBM Research
Motivation
shared memory
fast network
interconnection
Hybrid-model
Why MPI / OpenMP?
They are open standard.
Current HPC architectures

IBM Research
MPI 101
● Message Passing Interface –
share nothing model;
● The most basic functions:
– MPI_Init, MPI_Finalize,
MPI_Comm_rank,
MPI_Comm_size,
MPI_Send, MPI_Recv
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
int rank, size;
int rbuff, sbuff;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
sbuff = rank;
MPI_Send(&sbuff,
1,
MPI_INT,
(rank+1) % size,
1,
MPI_COMM_WORLD);
MPI_Recv(&rbuff,
1,
MPI_INT,
(rank+size-1) % size,
1,
MPI_COMM_WORLD,
&status);
printf("rank %d - rbuff %dn", rank, rbuff);
MPI_Finalize();
return 0;
}
$ mpirun -np 4 ./a.out
rank 0 - rbuff 3
rank 2 - rbuff 1
rank 1 - rbuff 0
rank 3 - rbuff 2
Output:
● Over 500 functions, but why?

IBM Research
Send/Recv flavors (1)
● MPI_Send, MPI_Recv
● MPI_Isend, MPI_Irecv
● MPI_Bsend
● MPI_Ssend
● MPI_Rsend

IBM Research
Send/Recv flavors (2)
● MPI_Send - Basic blocking send operation. Routine returns only after the application
buffer in the sending task is free for reuse.
● MPI_Recv - Receive a message and block until the requested data is available in the
application buffer in the receiving task.
● MPI_Ssend - Synchronous blocking send: Send a message and block until the
application buffer in the sending task is free for reuse and the destination process has
started to receive the message.
●
MPI_Bsend - Buffered blocking send: permits the programmer to allocate the
required amount of buffer space into which data can be copied until it is delivered.
Insulates against the problems associated with insufficient system buffer space.
● MPI_Rsend - Blocking ready send. Should only be used if the programmer is certain
that the matching receive has already been posted.
● MPI_Isend, MPI_Irecv - nonblocking send / recv
● MPI_Wait
● MPI_Probe

IBM Research
Collective communication

IBM Research
how MPI_Bast works

IBM Research
how MPI_All_Reduce
Peter
Pacheco,
Introduction to
Parallel
Programming

IBM Research
(Some) New features
● Process creation (MPI_Comm_spawn);
● MPI I/O (HDF5);
● Non-blocking collectives;
● One-sided communication

IBM Research
One-sided communication
Active target
MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &a);
MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &b);
MPI_Win_create(a, size, sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
for (i = 0; i < size; i++)
a[i] = rank * 100 + i;
printf("Process %d has the following:", rank);
for (i = 0; i < size; i++)
printf(" %d", a[i]);
printf("n");
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
for (i = 0; i < size; i++)
MPI_Get(&b[i], 1, MPI_INT, i, rank, 1, MPI_INT, win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
printf("Process %d obtained the following:", rank);
for (i = 0; i < size; i++)
printf(" %d", b[i]);
printf("n");
MPI_Win_free(&win);

IBM Research
Level of Thread Support
●
MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
● MPI_THREAD_FUNNELED - Level 1: The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled to the main thread.
●
MPI_THREAD_SERIALIZED - Level 2: The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is, calls are not
made concurrently from two distinct threads as all MPI calls are serialized.
● MPI_THREAD_MULTIPLE - Level 3: Multiple threads may call MPI with no
restrictions.
int MPI_Init_thread(int *argc, char *((*argv)[]),
int required, int *provided)

IBM Research
OpenMP
https://en.wikipedia.org/wiki/File:OpenMP_language_extensions.svg
Directives and function library

IBM Research
OpenMP 101
#include <omp.h>
#include <stdio.h>
int main() {
printf("sequential An");
#pragma omp parallel num_threads(3)
{
int id = omp_get_thread_num();
printf("parallel %dn", id);
}
printf("sequential Bn");
}
Points to keep in mind:
- OpenMP uses shared memory
for communication (and
synchronization);
- race condition may occur – the
user is responsible to
synchronize access and avoid
data conflicts;
- synchronization is expensive
and should be avoided;
LOCAL

IBM Research
OpenMP internals
#include <omp.h>
#include <stdio.h>
int main() {
printf("sequential An");
#pragma omp parallel num_threads(3)
{
int id = omp_get_thread_num();
printf("parallel %dn", id);
}
printf("sequential Bn");
}
.LC0:
.string "sequential A"
.align 3
.LC1:
.string "sequential B
(...)
addis 3,2,.LC0@toc@ha
addi 3,3,.LC0@toc@l
bl puts
nop
addis 3,2,main._omp_fn.0@toc@ha
addi 3,3,main._omp_fn.0@toc@l
li 4,0
li 5,5
bl GOMP_parallel_start
nop
li 3,0
bl main._omp_fn.0
bl GOMP_parallel_end
nop
addis 3,2,.LC1@toc@ha
addi 3,3,.LC1@toc@l
bl puts
(...)
main._omp_fn.0:
(…)
bl printf
(...)
libgomp

IBM Research
OpenMP Internals
Tim Mattson, Intel

IBM Research
OpenMP 101
● Parallel loops
● Data environment
● Synchronization
● Reductions
#include <omp.h>
#include <stdio.h>
#define SX 4
#define SY 4
int main() {
int mat[SX][SY];
omp_set_nested(1);
printf(">>> %dn", omp_get_nested());
#pragma omp parallel for num_threads(2)
for (int i = 0; i < SX; i++) {
int outerId = omp_get_thread_num();
#pragma omp parallel for num_threads(2)
for (int j = 0; j < SY; j++) {
int innerId = omp_get_thread_num();
mat[i][j] = (outerId+1)*100 + innerId;
}
}
for (int i = 0; i < SX; i++) {
for (int j = 0; j < SX; j++) {
printf("%d ", mat[i][j]);
}
printf("n");
}
}

IBM Research
Power8
IBM Journal of Research and Development,Issue 1 • Date Jan.-Feb. 2015

IBM Research
Powe8 performance evaluation

IBM Research
Performance examples: a word of
caution
● Hybrid programming not always good;
● Some examples:
– NAS-NBP;
– Ocean-Land-Atmosphere Model (OLAM);
– Weather Research and Forecasting Model (WRF);

IBM Research
NAS-NPB
● Scalar Pentadiagonal (SP) and Block
Tridiagonal (BT) benchmarks
● Intrepid (BlueGene/P) at Argonne National
Laboratory
Xingfu Wu, Valerie Taylor, Performance Characteristics of HybridMPI/OpenMP
Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore
Clusters, The Computer Journal, 2012.

IBM Research
SP - Hybrid vs. pure MPI

IBM Research
BT - Hybrid vs. pure MPI

IBM Research
OLAM
● Global grid that can be locally refined;
● This feature allows simultaneous representation (and
forecasting) of both the global scale and the local
scale phenomena, as well as bi-directional
interactions between scales
Carla Osthoff et al, Improving Performance on
Atmospheric Models through a Hybrid OpenMP/MPI
Implementation, 9th IEEE International Symposium
on Parallel and Distributed Processing with
Applications, 2011.

IBM Research
OLAM 40Km with Physics

IBM Research
WRF
Don Morton, et al, Pushing WRF To Its Computational Limits, Presentation
at Alaska Weather Symposium, 2010.

IBM Research
Motivação para o AMPI
● MPI é um padrão de fato para programação paralela
● Porém, aplicações modernas podem ter:
– distribuição de carga pelos processadores variável ao longo
da simulação;
– refinamentos adaptativos de grades;
– múltiplos módulos relativos a diferentes componentes físicos
combinados na mesma simulação;
– exigências do algoritmo quanto ao número de
processadores a serem utilizados.
● Várias destas características nao combinam bem com
implementações convencionais de MPI

IBM Research
Alternativa: Adaptive MPI
● Adaptive MPI (AMPI) é uma implementação
do padrão
MPI baseada em Charm++
● Com AMPI, é possível utilizar aplicações MPI
jáexistentes, através de poucas modificações
no código original
● AMPI está disponível e é portável para
diversas arquiteturas.

IBM Research
Adaptive MPI: Princípios Gerais
● Em AMPI, cada tarefa MPI é embutida em um objeto
(elemento de vetor, ou thread de usuário) Charm++
● Como todo objeto Charm++, as tarefas AMPI (threads)
são migráveis entre processadores

IBM Research
Adaptive MPI e Virtualização
● Benefícios da virtualização:
– Sobreposição automática entre computação e comunicação
– Melhor uso de cache
– Flexibilidade para se fazer balanceamento de carga

IBM Research
Balanceadores Disponíveis no Charm++

IBM Research
Exemplo de aplicação real:
BRAMS – 64 procs 1024 threads

IBM Research
Final remarks
● MPI / OpenMP hybrid
– Probably the most popular hybrid programming
technologies/standard;
– Suitable for current architectures;
– May not produce the best performance though;
● OpenPower
– Lots of cores and even more threads (lots of fun :-)
● Load balancing may be an issue,
– AMPI is an adaptive alternative for the vanilla MPI

“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil

More Related Content

What's hot

Viewers also liked

Similar to “Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil

More from lccausp

Recently uploaded

“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil