SlideShare a Scribd company logo
1 of 42
Download to read offline
© 2015 IBM Corporation
Tutorial: Programação
paralela híbrida com MPI e
OpenMP
uma abordagem prática
Tutorial: Programação
paralela híbrida com MPI e
OpenMP
uma abordagem prática
Eduardo Rodrigues
edrodri@br.ibm.com
3°. Workshop de High Performance Computing – Convênio: USP – Rice University
IBM Research
IBM Research
Brazil Lab research areas
Industrial Technology and Science
Systems of Engagement and Insight
Social Data Analytics
Natural Resource Solutions
https://jobs3.netmedia1.com/cp/faces/job_summary?job_id=RES-0689175
https://jobs3.netmedia1.com/cp/faces/job_search
IBM Research
Legal stuff
● This presentation represents the views of the
author and does not necessarily represent the
views of IBM.
● Company, product and service names may be
trademarks or service marks of others.
IBM Research
Agenda
● MPI and OpenMP
– Motivation
– Basic functions / directives
– Hybrid usage
– Performance examples
● AMPI – load balancing
IBM Research
Parallel Programming Models
fork-join Message passing
Power8
https://en.wikipedia.org/wiki/Computer_cluster#/media/File:Beowulf.jpg
IBM Research
Motivation
shared memory
fast network
interconnection
Hybrid-model
Why MPI / OpenMP?
They are open standard.
Current HPC architectures
IBM Research
MPI 101
● Message Passing Interface –
share nothing model;
● The most basic functions:
– MPI_Init, MPI_Finalize,
MPI_Comm_rank,
MPI_Comm_size,
MPI_Send, MPI_Recv
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
int rank, size;
int rbuff, sbuff;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
sbuff = rank;
MPI_Send(&sbuff,
1,
MPI_INT,
(rank+1) % size,
1,
MPI_COMM_WORLD);
MPI_Recv(&rbuff,
1,
MPI_INT,
(rank+size-1) % size,
1,
MPI_COMM_WORLD,
&status);
printf("rank %d - rbuff %dn", rank, rbuff);
MPI_Finalize();
return 0;
}
$ mpirun -np 4 ./a.out
rank 0 - rbuff 3
rank 2 - rbuff 1
rank 1 - rbuff 0
rank 3 - rbuff 2
Output:
● Over 500 functions, but why?
IBM Research
Send/Recv flavors (1)
● MPI_Send, MPI_Recv
● MPI_Isend, MPI_Irecv
● MPI_Bsend
● MPI_Ssend
● MPI_Rsend
IBM Research
Send/Recv flavors (2)
● MPI_Send - Basic blocking send operation. Routine returns only after the application
buffer in the sending task is free for reuse.
● MPI_Recv - Receive a message and block until the requested data is available in the
application buffer in the receiving task.
● MPI_Ssend - Synchronous blocking send: Send a message and block until the
application buffer in the sending task is free for reuse and the destination process has
started to receive the message.
●
MPI_Bsend - Buffered blocking send: permits the programmer to allocate the
required amount of buffer space into which data can be copied until it is delivered.
Insulates against the problems associated with insufficient system buffer space.
● MPI_Rsend - Blocking ready send. Should only be used if the programmer is certain
that the matching receive has already been posted.
● MPI_Isend, MPI_Irecv - nonblocking send / recv
● MPI_Wait
● MPI_Probe
IBM Research
Collective communication
IBM Research
Collective communication
how MPI_Bast works
IBM Research
Collective communication
how MPI_All_Reduce
Peter
Pacheco,
Introduction to
Parallel
Programming
IBM Research
(Some) New features
● Process creation (MPI_Comm_spawn);
● MPI I/O (HDF5);
● Non-blocking collectives;
● One-sided communication
IBM Research
One-sided communication
Active target
MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &a);
MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &b);
MPI_Win_create(a, size, sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
for (i = 0; i < size; i++)
a[i] = rank * 100 + i;
printf("Process %d has the following:", rank);
for (i = 0; i < size; i++)
printf(" %d", a[i]);
printf("n");
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
for (i = 0; i < size; i++)
MPI_Get(&b[i], 1, MPI_INT, i, rank, 1, MPI_INT, win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
printf("Process %d obtained the following:", rank);
for (i = 0; i < size; i++)
printf(" %d", b[i]);
printf("n");
MPI_Win_free(&win);
IBM Research
Level of Thread Support
●
MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
● MPI_THREAD_FUNNELED - Level 1: The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled to the main thread.
●
MPI_THREAD_SERIALIZED - Level 2: The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is, calls are not
made concurrently from two distinct threads as all MPI calls are serialized.
● MPI_THREAD_MULTIPLE - Level 3: Multiple threads may call MPI with no
restrictions.
int MPI_Init_thread(int *argc, char *((*argv)[]),
int required, int *provided)
IBM Research
OpenMP
https://en.wikipedia.org/wiki/File:OpenMP_language_extensions.svg
Directives and function library
IBM Research
OpenMP 101
#include <omp.h>
#include <stdio.h>
int main() {
printf("sequential An");
#pragma omp parallel num_threads(3)
{
int id = omp_get_thread_num();
printf("parallel %dn", id);
}
printf("sequential Bn");
}
Points to keep in mind:
- OpenMP uses shared memory
for communication (and
synchronization);
- race condition may occur – the
user is responsible to
synchronize access and avoid
data conflicts;
- synchronization is expensive
and should be avoided;
LOCAL
IBM Research
OpenMP internals
#include <omp.h>
#include <stdio.h>
int main() {
printf("sequential An");
#pragma omp parallel num_threads(3)
{
int id = omp_get_thread_num();
printf("parallel %dn", id);
}
printf("sequential Bn");
}
.LC0:
.string "sequential A"
.align 3
.LC1:
.string "sequential B
(...)
addis 3,2,.LC0@toc@ha
addi 3,3,.LC0@toc@l
bl puts
nop
addis 3,2,main._omp_fn.0@toc@ha
addi 3,3,main._omp_fn.0@toc@l
li 4,0
li 5,5
bl GOMP_parallel_start
nop
li 3,0
bl main._omp_fn.0
bl GOMP_parallel_end
nop
addis 3,2,.LC1@toc@ha
addi 3,3,.LC1@toc@l
bl puts
(...)
main._omp_fn.0:
(…)
bl printf
(...)
libgomp
IBM Research
OpenMP Internals
Tim Mattson, Intel
IBM Research
OpenMP 101
● Parallel loops
● Data environment
● Synchronization
● Reductions
#include <omp.h>
#include <stdio.h>
#define SX 4
#define SY 4
int main() {
int mat[SX][SY];
omp_set_nested(1);
printf(">>> %dn", omp_get_nested());
#pragma omp parallel for num_threads(2)
for (int i = 0; i < SX; i++) {
int outerId = omp_get_thread_num();
#pragma omp parallel for num_threads(2)
for (int j = 0; j < SY; j++) {
int innerId = omp_get_thread_num();
mat[i][j] = (outerId+1)*100 + innerId;
}
}
for (int i = 0; i < SX; i++) {
for (int j = 0; j < SX; j++) {
printf("%d ", mat[i][j]);
}
printf("n");
}
}
IBM Research
Power8
IBM Journal of Research and Development,Issue 1 • Date Jan.-Feb. 2015
IBM Research
Power8
IBM Research
Powe8 performance evaluation
IBM Research
Performance examples: a word of
caution
● Hybrid programming not always good;
● Some examples:
– NAS-NBP;
– Ocean-Land-Atmosphere Model (OLAM);
– Weather Research and Forecasting Model (WRF);
IBM Research
NAS-NPB
● Scalar Pentadiagonal (SP) and Block
Tridiagonal (BT) benchmarks
● Intrepid (BlueGene/P) at Argonne National
Laboratory
Xingfu Wu, Valerie Taylor, Performance Characteristics of HybridMPI/OpenMP
Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore
Clusters, The Computer Journal, 2012.
IBM Research
SP - Hybrid vs. pure MPI
IBM Research
BT - Hybrid vs. pure MPI
IBM Research
OLAM
● Global grid that can be locally refined;
● This feature allows simultaneous representation (and
forecasting) of both the global scale and the local
scale phenomena, as well as bi-directional
interactions between scales
Carla Osthoff et al, Improving Performance on
Atmospheric Models through a Hybrid OpenMP/MPI
Implementation, 9th IEEE International Symposium
on Parallel and Distributed Processing with
Applications, 2011.
IBM Research
OLAM 200Km
IBM Research
OLAM 40Km
IBM Research
OLAM 40Km with Physics
IBM Research
WRF
Don Morton, et al, Pushing WRF To Its Computational Limits, Presentation
at Alaska Weather Symposium, 2010.
IBM Research
WRF
IBM Research
WRF
IBM Research
Motivação para o AMPI
● MPI é um padrão de fato para programação paralela
● Porém, aplicações modernas podem ter:
– distribuição de carga pelos processadores variável ao longo
da simulação;
– refinamentos adaptativos de grades;
– múltiplos módulos relativos a diferentes componentes físicos
combinados na mesma simulação;
– exigências do algoritmo quanto ao número de
processadores a serem utilizados.
● Várias destas características nao combinam bem com
implementações convencionais de MPI
IBM Research
Alternativa: Adaptive MPI
● Adaptive MPI (AMPI) é uma implementação
do padrão
MPI baseada em Charm++
● Com AMPI, é possível utilizar aplicações MPI
jáexistentes, através de poucas modificações
no código original
● AMPI está disponível e é portável para
diversas arquiteturas.
IBM Research
Adaptive MPI: Princípios Gerais
● Em AMPI, cada tarefa MPI é embutida em um objeto
(elemento de vetor, ou thread de usuário) Charm++
● Como todo objeto Charm++, as tarefas AMPI (threads)
são migráveis entre processadores
IBM Research
Adaptive MPI e Virtualização
● Benefícios da virtualização:
– Sobreposição automática entre computação e comunicação
– Melhor uso de cache
– Flexibilidade para se fazer balanceamento de carga
IBM Research
Exemplo
IBM Research
Balanceadores Disponíveis no Charm++
IBM Research
Exemplo de aplicação real:
BRAMS – 64 procs 1024 threads
IBM Research
Final remarks
● MPI / OpenMP hybrid
– Probably the most popular hybrid programming
technologies/standard;
– Suitable for current architectures;
– May not produce the best performance though;
● OpenPower
– Lots of cores and even more threads (lots of fun :-)
● Load balancing may be an issue,
– AMPI is an adaptive alternative for the vanilla MPI

More Related Content

What's hot

Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
MPI Presentation
MPI PresentationMPI Presentation
MPI Presentation
Tayfun Sen
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
Ruymán Reyes
 

What's hot (18)

OpenMP
OpenMPOpenMP
OpenMP
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
openmp
openmpopenmp
openmp
 
MPI Presentation
MPI PresentationMPI Presentation
MPI Presentation
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI
 
Nug2004 yhe
Nug2004 yheNug2004 yhe
Nug2004 yhe
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
mpi4py.pdf
mpi4py.pdfmpi4py.pdf
mpi4py.pdf
 
Multicore
MulticoreMulticore
Multicore
 
Mpi Java
Mpi JavaMpi Java
Mpi Java
 
Open MPI
Open MPIOpen MPI
Open MPI
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 

Viewers also liked

Михаил Кузьмин, "SeoHammer.ru" (Москва) Руководитель проекта
Михаил Кузьмин,  "SeoHammer.ru" (Москва)  Руководитель проектаМихаил Кузьмин,  "SeoHammer.ru" (Москва)  Руководитель проекта
Михаил Кузьмин, "SeoHammer.ru" (Москва) Руководитель проекта
web2win
 

Viewers also liked (20)

“Simulations of Lipidic Bilayers”. Prof. Dr. Kaline Coutinho – IF/USP.
“Simulations of Lipidic Bilayers”. Prof. Dr. Kaline Coutinho – IF/USP.“Simulations of Lipidic Bilayers”. Prof. Dr. Kaline Coutinho – IF/USP.
“Simulations of Lipidic Bilayers”. Prof. Dr. Kaline Coutinho – IF/USP.
 
"The topological instability model for metallic glass formation: MD assessmen...
"The topological instability model for metallic glass formation: MD assessmen..."The topological instability model for metallic glass formation: MD assessmen...
"The topological instability model for metallic glass formation: MD assessmen...
 
Dyn /Delivers/ Where Internet Performance Goes From Here
Dyn /Delivers/ Where Internet Performance Goes From HereDyn /Delivers/ Where Internet Performance Goes From Here
Dyn /Delivers/ Where Internet Performance Goes From Here
 
Nur afni Bt.Aripin
Nur afni Bt.AripinNur afni Bt.Aripin
Nur afni Bt.Aripin
 
Endocrine consultant South San Francisco CA
Endocrine consultant South San Francisco CAEndocrine consultant South San Francisco CA
Endocrine consultant South San Francisco CA
 
Mathematics of outlier_detection_and_pattern_recognition_pharmacy_fraud_2013
Mathematics  of outlier_detection_and_pattern_recognition_pharmacy_fraud_2013Mathematics  of outlier_detection_and_pattern_recognition_pharmacy_fraud_2013
Mathematics of outlier_detection_and_pattern_recognition_pharmacy_fraud_2013
 
Pcm
PcmPcm
Pcm
 
Jobs Life nd leadership
Jobs Life nd leadershipJobs Life nd leadership
Jobs Life nd leadership
 
Ud 12a18 anos
Ud 12a18 anosUd 12a18 anos
Ud 12a18 anos
 
Digital Media Academy - Day 1
Digital Media Academy - Day 1Digital Media Academy - Day 1
Digital Media Academy - Day 1
 
Михаил Кузьмин, "SeoHammer.ru" (Москва) Руководитель проекта
Михаил Кузьмин,  "SeoHammer.ru" (Москва)  Руководитель проектаМихаил Кузьмин,  "SeoHammer.ru" (Москва)  Руководитель проекта
Михаил Кузьмин, "SeoHammer.ru" (Москва) Руководитель проекта
 
2015 Cadillac ATS e-brochure
2015 Cadillac ATS e-brochure2015 Cadillac ATS e-brochure
2015 Cadillac ATS e-brochure
 
Module 2: Overview of Topics
Module 2: Overview of TopicsModule 2: Overview of Topics
Module 2: Overview of Topics
 
Tecnología unidad 3
Tecnología unidad 3Tecnología unidad 3
Tecnología unidad 3
 
Projekt EOD
Projekt EODProjekt EOD
Projekt EOD
 
Semantic technologies to manage ideas
Semantic technologies to manage ideasSemantic technologies to manage ideas
Semantic technologies to manage ideas
 
"De ruimtelijke impact van gemeentefinanciering" (Jan Leroy, VVSG)
"De ruimtelijke impact van gemeentefinanciering" (Jan Leroy, VVSG)"De ruimtelijke impact van gemeentefinanciering" (Jan Leroy, VVSG)
"De ruimtelijke impact van gemeentefinanciering" (Jan Leroy, VVSG)
 
Monika Załęska stypendium z wyboru
Monika Załęska   stypendium z wyboruMonika Załęska   stypendium z wyboru
Monika Załęska stypendium z wyboru
 
Pemahaman
PemahamanPemahaman
Pemahaman
 
δαλάι λάμα
δαλάι λάμαδαλάι λάμα
δαλάι λάμα
 

Similar to “Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil

Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
Marcirio Chaves
 

Similar to “Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil (20)

25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
 
Serverless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIsServerless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIs
 
Review Paper on Online Java Compiler
Review Paper on Online Java CompilerReview Paper on Online Java Compiler
Review Paper on Online Java Compiler
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Mpi.net tutorial
Mpi.net tutorialMpi.net tutorial
Mpi.net tutorial
 
More mpi4py
More mpi4pyMore mpi4py
More mpi4py
 
Python introduction
Python introductionPython introduction
Python introduction
 

More from lccausp

More from lccausp (8)

"O uso da plataforma HPC na descoberta de doenças genéticas" . David Santos M...
"O uso da plataforma HPC na descoberta de doenças genéticas" . David Santos M..."O uso da plataforma HPC na descoberta de doenças genéticas" . David Santos M...
"O uso da plataforma HPC na descoberta de doenças genéticas" . David Santos M...
 
"Aula sobre Paralelização Automática". Rogério A. Gonçalves e Prof. Dr. Alfre...
"Aula sobre Paralelização Automática". Rogério A. Gonçalves e Prof. Dr. Alfre..."Aula sobre Paralelização Automática". Rogério A. Gonçalves e Prof. Dr. Alfre...
"Aula sobre Paralelização Automática". Rogério A. Gonçalves e Prof. Dr. Alfre...
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
“Quantum Theory Calculations of Supported and Unsupported Transition-Metal Cl...
“Quantum Theory Calculations of Supported and Unsupported Transition-Metal Cl...“Quantum Theory Calculations of Supported and Unsupported Transition-Metal Cl...
“Quantum Theory Calculations of Supported and Unsupported Transition-Metal Cl...
 
“Modelagem Computacional Multiescala aplicada a Ciência dos Materiais”. Rober...
“Modelagem Computacional Multiescala aplicada a Ciência dos Materiais”. Rober...“Modelagem Computacional Multiescala aplicada a Ciência dos Materiais”. Rober...
“Modelagem Computacional Multiescala aplicada a Ciência dos Materiais”. Rober...
 
“Interação entre peptídeos virais de fusão com membranas modelo”. Danilo Oliv...
“Interação entre peptídeos virais de fusão com membranas modelo”. Danilo Oliv...“Interação entre peptídeos virais de fusão com membranas modelo”. Danilo Oliv...
“Interação entre peptídeos virais de fusão com membranas modelo”. Danilo Oliv...
 
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
“Large Eddy Simulations of Ethanol Spray Combustion”. Flavio Galeazzo – EP/USP.
 
“Solving QCD: from BG/P to BG/Q”. Prof. Dr. Attilio Cucchieri – IFSC/USP.
“Solving QCD: from BG/P to BG/Q”. Prof. Dr. Attilio Cucchieri – IFSC/USP.“Solving QCD: from BG/P to BG/Q”. Prof. Dr. Attilio Cucchieri – IFSC/USP.
“Solving QCD: from BG/P to BG/Q”. Prof. Dr. Attilio Cucchieri – IFSC/USP.
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

“Programação paralela híbrida com MPI e OpenMP – uma abordagem prática”. Eduardo Rodrigues – IBM Research Brasil

  • 1. © 2015 IBM Corporation Tutorial: Programação paralela híbrida com MPI e OpenMP uma abordagem prática Tutorial: Programação paralela híbrida com MPI e OpenMP uma abordagem prática Eduardo Rodrigues edrodri@br.ibm.com 3°. Workshop de High Performance Computing – Convênio: USP – Rice University
  • 2. IBM Research IBM Research Brazil Lab research areas Industrial Technology and Science Systems of Engagement and Insight Social Data Analytics Natural Resource Solutions https://jobs3.netmedia1.com/cp/faces/job_summary?job_id=RES-0689175 https://jobs3.netmedia1.com/cp/faces/job_search
  • 3. IBM Research Legal stuff ● This presentation represents the views of the author and does not necessarily represent the views of IBM. ● Company, product and service names may be trademarks or service marks of others.
  • 4. IBM Research Agenda ● MPI and OpenMP – Motivation – Basic functions / directives – Hybrid usage – Performance examples ● AMPI – load balancing
  • 5. IBM Research Parallel Programming Models fork-join Message passing Power8 https://en.wikipedia.org/wiki/Computer_cluster#/media/File:Beowulf.jpg
  • 6. IBM Research Motivation shared memory fast network interconnection Hybrid-model Why MPI / OpenMP? They are open standard. Current HPC architectures
  • 7. IBM Research MPI 101 ● Message Passing Interface – share nothing model; ● The most basic functions: – MPI_Init, MPI_Finalize, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank, size; int rbuff, sbuff; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); sbuff = rank; MPI_Send(&sbuff, 1, MPI_INT, (rank+1) % size, 1, MPI_COMM_WORLD); MPI_Recv(&rbuff, 1, MPI_INT, (rank+size-1) % size, 1, MPI_COMM_WORLD, &status); printf("rank %d - rbuff %dn", rank, rbuff); MPI_Finalize(); return 0; } $ mpirun -np 4 ./a.out rank 0 - rbuff 3 rank 2 - rbuff 1 rank 1 - rbuff 0 rank 3 - rbuff 2 Output: ● Over 500 functions, but why?
  • 8. IBM Research Send/Recv flavors (1) ● MPI_Send, MPI_Recv ● MPI_Isend, MPI_Irecv ● MPI_Bsend ● MPI_Ssend ● MPI_Rsend
  • 9. IBM Research Send/Recv flavors (2) ● MPI_Send - Basic blocking send operation. Routine returns only after the application buffer in the sending task is free for reuse. ● MPI_Recv - Receive a message and block until the requested data is available in the application buffer in the receiving task. ● MPI_Ssend - Synchronous blocking send: Send a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message. ● MPI_Bsend - Buffered blocking send: permits the programmer to allocate the required amount of buffer space into which data can be copied until it is delivered. Insulates against the problems associated with insufficient system buffer space. ● MPI_Rsend - Blocking ready send. Should only be used if the programmer is certain that the matching receive has already been posted. ● MPI_Isend, MPI_Irecv - nonblocking send / recv ● MPI_Wait ● MPI_Probe
  • 12. IBM Research Collective communication how MPI_All_Reduce Peter Pacheco, Introduction to Parallel Programming
  • 13. IBM Research (Some) New features ● Process creation (MPI_Comm_spawn); ● MPI I/O (HDF5); ● Non-blocking collectives; ● One-sided communication
  • 14. IBM Research One-sided communication Active target MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &a); MPI_Alloc_mem(sizeof(int)*size, MPI_INFO_NULL, &b); MPI_Win_create(a, size, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); for (i = 0; i < size; i++) a[i] = rank * 100 + i; printf("Process %d has the following:", rank); for (i = 0; i < size; i++) printf(" %d", a[i]); printf("n"); MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win); for (i = 0; i < size; i++) MPI_Get(&b[i], 1, MPI_INT, i, rank, 1, MPI_INT, win); MPI_Win_fence(MPI_MODE_NOSUCCEED, win); printf("Process %d obtained the following:", rank); for (i = 0; i < size; i++) printf(" %d", b[i]); printf("n"); MPI_Win_free(&win);
  • 15. IBM Research Level of Thread Support ● MPI_THREAD_SINGLE - Level 0: Only one thread will execute. ● MPI_THREAD_FUNNELED - Level 1: The process may be multi-threaded, but only the main thread will make MPI calls - all MPI calls are funneled to the main thread. ● MPI_THREAD_SERIALIZED - Level 2: The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time. That is, calls are not made concurrently from two distinct threads as all MPI calls are serialized. ● MPI_THREAD_MULTIPLE - Level 3: Multiple threads may call MPI with no restrictions. int MPI_Init_thread(int *argc, char *((*argv)[]), int required, int *provided)
  • 17. IBM Research OpenMP 101 #include <omp.h> #include <stdio.h> int main() { printf("sequential An"); #pragma omp parallel num_threads(3) { int id = omp_get_thread_num(); printf("parallel %dn", id); } printf("sequential Bn"); } Points to keep in mind: - OpenMP uses shared memory for communication (and synchronization); - race condition may occur – the user is responsible to synchronize access and avoid data conflicts; - synchronization is expensive and should be avoided; LOCAL
  • 18. IBM Research OpenMP internals #include <omp.h> #include <stdio.h> int main() { printf("sequential An"); #pragma omp parallel num_threads(3) { int id = omp_get_thread_num(); printf("parallel %dn", id); } printf("sequential Bn"); } .LC0: .string "sequential A" .align 3 .LC1: .string "sequential B (...) addis 3,2,.LC0@toc@ha addi 3,3,.LC0@toc@l bl puts nop addis 3,2,main._omp_fn.0@toc@ha addi 3,3,main._omp_fn.0@toc@l li 4,0 li 5,5 bl GOMP_parallel_start nop li 3,0 bl main._omp_fn.0 bl GOMP_parallel_end nop addis 3,2,.LC1@toc@ha addi 3,3,.LC1@toc@l bl puts (...) main._omp_fn.0: (…) bl printf (...) libgomp
  • 20. IBM Research OpenMP 101 ● Parallel loops ● Data environment ● Synchronization ● Reductions #include <omp.h> #include <stdio.h> #define SX 4 #define SY 4 int main() { int mat[SX][SY]; omp_set_nested(1); printf(">>> %dn", omp_get_nested()); #pragma omp parallel for num_threads(2) for (int i = 0; i < SX; i++) { int outerId = omp_get_thread_num(); #pragma omp parallel for num_threads(2) for (int j = 0; j < SY; j++) { int innerId = omp_get_thread_num(); mat[i][j] = (outerId+1)*100 + innerId; } } for (int i = 0; i < SX; i++) { for (int j = 0; j < SX; j++) { printf("%d ", mat[i][j]); } printf("n"); } }
  • 21. IBM Research Power8 IBM Journal of Research and Development,Issue 1 • Date Jan.-Feb. 2015
  • 24. IBM Research Performance examples: a word of caution ● Hybrid programming not always good; ● Some examples: – NAS-NBP; – Ocean-Land-Atmosphere Model (OLAM); – Weather Research and Forecasting Model (WRF);
  • 25. IBM Research NAS-NPB ● Scalar Pentadiagonal (SP) and Block Tridiagonal (BT) benchmarks ● Intrepid (BlueGene/P) at Argonne National Laboratory Xingfu Wu, Valerie Taylor, Performance Characteristics of HybridMPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters, The Computer Journal, 2012.
  • 26. IBM Research SP - Hybrid vs. pure MPI
  • 27. IBM Research BT - Hybrid vs. pure MPI
  • 28. IBM Research OLAM ● Global grid that can be locally refined; ● This feature allows simultaneous representation (and forecasting) of both the global scale and the local scale phenomena, as well as bi-directional interactions between scales Carla Osthoff et al, Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation, 9th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2011.
  • 31. IBM Research OLAM 40Km with Physics
  • 32. IBM Research WRF Don Morton, et al, Pushing WRF To Its Computational Limits, Presentation at Alaska Weather Symposium, 2010.
  • 35. IBM Research Motivação para o AMPI ● MPI é um padrão de fato para programação paralela ● Porém, aplicações modernas podem ter: – distribuição de carga pelos processadores variável ao longo da simulação; – refinamentos adaptativos de grades; – múltiplos módulos relativos a diferentes componentes físicos combinados na mesma simulação; – exigências do algoritmo quanto ao número de processadores a serem utilizados. ● Várias destas características nao combinam bem com implementações convencionais de MPI
  • 36. IBM Research Alternativa: Adaptive MPI ● Adaptive MPI (AMPI) é uma implementação do padrão MPI baseada em Charm++ ● Com AMPI, é possível utilizar aplicações MPI jáexistentes, através de poucas modificações no código original ● AMPI está disponível e é portável para diversas arquiteturas.
  • 37. IBM Research Adaptive MPI: Princípios Gerais ● Em AMPI, cada tarefa MPI é embutida em um objeto (elemento de vetor, ou thread de usuário) Charm++ ● Como todo objeto Charm++, as tarefas AMPI (threads) são migráveis entre processadores
  • 38. IBM Research Adaptive MPI e Virtualização ● Benefícios da virtualização: – Sobreposição automática entre computação e comunicação – Melhor uso de cache – Flexibilidade para se fazer balanceamento de carga
  • 41. IBM Research Exemplo de aplicação real: BRAMS – 64 procs 1024 threads
  • 42. IBM Research Final remarks ● MPI / OpenMP hybrid – Probably the most popular hybrid programming technologies/standard; – Suitable for current architectures; – May not produce the best performance though; ● OpenPower – Lots of cores and even more threads (lots of fun :-) ● Load balancing may be an issue, – AMPI is an adaptive alternative for the vanilla MPI