SlideShare a Scribd company logo
1 of 36
Hybrid OpenMP and MPI
Programming and Tuning
Yun (Helen) He and Chris Ding
Lawrence Berkeley National
Laboratory

06/24/2004

NUG2004

1
Outline
Introduction

Why Hybrid
Compile, Link, and Run
Parallelization Strategies
Simple Example: Ax=b
MPI_init_thread Choices
Debug and Tune

Examples

Multi-dimensional Array Transpose
Community Atmosphere Model
MM5 Regional Climate Model
Some Other Benchmarks

Conclusions
06/24/2004

NUG2004

2
MPI vs. OpenMP
Pure MPI Pro:
Portable to distributed and
shared memory machines.
Scales beyond one node
No data placement problem
Pure MPI Con:
Difficult to develop and debug
High latency, low bandwidth
Explicit communication
Large granularity
Difficult load balancing

06/24/2004

NUG2004

Pure OpenMP Pro:
Easy to implement parallelism
Low latency, high bandwidth
Implicit Communication
Coarse and fine granularity
Dynamic load balancing
Pure OpenMP Con:
Only on shared memory
machines
Scale within one node
Possible data placement problem
No specific thread order

3
Why Hybrid
Hybrid MPI/OpenMP paradigm is the software trend for clusters of
SMP architectures.
Elegant in concept and architecture: using MPI across nodes and
OpenMP within nodes. Good usage of shared memory system
resource (memory, latency, and bandwidth).
Avoids the extra communication overhead with MPI within node.
OpenMP adds fine granularity (larger message sizes) and allows
increased and/or dynamic load balancing.
Some problems have two-level parallelism naturally.
Some problems could only use restricted number of MPI tasks.
Could have better scalability than both pure MPI and pure
OpenMP.
My code speeds up by a factor of 4.44.

06/24/2004

NUG2004

4
Why Mixed OpenMP/MPI Code is
Sometimes Slower?
OpenMP has less scalability due to implicit parallelism while MPI
allows multi-dimensional blocking .
All threads are idle except one while MPI communication.
Need overlap comp and comm for better performance.
Critical Section for shared variables.

Thread creation overhead
Cache coherence, data placement.
Natural one level parallelism problems.
Pure OpenMP code performs worse than pure MPI within node.
Lack of optimized OpenMP compilers/libraries.
Positive and Negative experiences:
Positive: CAM, MM5, …
Negative: NAS, CG, PS, …

06/24/2004

NUG2004

5
A Pseudo Hybrid Code
Program hybrid
call MPI_INIT (ierr)
call MPI_COMM_RANK (…)
call MPI_COMM_SIZE (…)
… some computation and MPI communication
call OMP_SET_NUM_THREADS(4)
!$OMP PARALLEL DO PRIVATE(i)
!$OMP&
SHARED(n)
do i=1,n
… computation
enddo
!$OMP END PARALLEL DO
… some computation and MPI communication
call MPI_FINALIZE (ierr)
end

06/24/2004

NUG2004

6
Compile, link, and Run
% mpxlf90_r –qsmp=omp -o hybrid –O3 hybrid.f90
% setenv XLSMPOPTS parthds=4
(or % setenv OMP_NUM_THREADS 4)
% poe hybrid –nodes 2 –tasks_per_node 4
Loadleveler Script: (% llsubmit job.hybrid)
#@ shell = /usr/bin/csh
#@ output = $(jobid).$(stepid).out
#@ error = $(jobid).$(stepid).err
#@ class = debug
#@ node = 2
#@ tasks_per_node = 4
#@ network.MPI = csss,not_shared,us
#@ wall_clock_limit = 00:02:00
#@ notification = complete
#@ job_type = parallel
#@ environment = COPY_ALL
#@ queue
hybrid
exit

06/24/2004

NUG2004

7
Other Environment Variables
MP_WAIT_MODE: Tasks wait mode, could be poll, yield, or
sleep. Default value is poll for US and sleep for IP.
MP_POLLING_INTERVAL: the polling interval.
By default, a thread in OpenMP application goes to sleep
after finish its work.
By putting thread in a busy-waiting instead of sleep could
reduce overhead in thread reactivation.
SPINLOOPTIME: time spent in busy wait before yield
YIELDLOOPTIME: time spent in spin-yield cycle before fall asleep.

06/24/2004

NUG2004

8
Loop-based vs. SPMD
Loop-based:

SPMD:

!$OMP PARALLEL DO PRIVATE(start, end,
!$OMP PARALLEL DO PRIVATE(i) i)
!$OMP&
SHARED(a,b,n)
!$OMP&
SHARED(a,b)
do i=1,n
num_thrds = omp_get_num_threads()
a(i)=a(i)+b(i)
thrd_id = omp_get_thread_num()
enddo
start = n* thrd_id/num_thrds + 1
!$OMP END PARALLEL DO
end = n*(thrd_num+1)/num_thrds
do i = start, end
a(i)=a(i)+b(i)
enddo
!$OMP END PARALLEL DO

SPMD code normally gives better performance than
loop-based code, but more difficult to implement:
• Less thread synchronization.
• Less cache misses.
• More compiler optimizations.

06/24/2004

NUG2004

9
Hybrid Parallelization Strategies
From sequential code, decompose with MPI first, then
add OpenMP.
From OpenMP code, treat as serial code.
From MPI code, add OpenMP.
Simplest and least error-prone way is to use MPI outside
parallel region, and allow only master thread to
communicate between MPI tasks.
Could use MPI inside parallel region with thread-safe MPI.

06/24/2004

NUG2004

10
A Simple Example: Ax=b

dae h
r t

=
process

c = 0.0
do j = 1, n_loc
!$OMP DO PARALLEL
!$OMP SHARED(a,b),
PRIVATE(i)
!$OMP REDUCTION(+:c)
do i = 1, nrows
c(i) = c(i) + a(i,j)*b(i)
enddo
enddo
call MPI_REDUCE_SCATTER(c)

06/24/2004

• OMP does not support vector reduction
• Wrong answer since c is shared!

NUG2004

11
Correct Implementations
OPENMP:

IBM SMP:
c = 0.0
!$SMP PARALLEL REDUCTION(+:c)
c = 0.0
do j = 1, n_loc
!$SMP DO PRIVATE(i)
do i = 1, nrows
c(i) = c(i) + a(i,j)*b(i)
enddo
!$SMP END DO NOWAIT
enddo
!$SMP END PARALLEL
call MPI_REDUCE_SCATTER(c)

06/24/2004

c = 0.0
!$OMP PARALLEL SHARED(c),
PRIVATE(c_loc)
c_loc = 0.0
do j = 1, n_loc
!$OMP DO PRIVATE(i)
do i = 1, nrows
c_loc(i) = c_loc(i) + a(i,j)*b(i)
enddo
!$OMP END DO NOWAIT
enddo
!$OMP CRITICAL
c = c + c_loc
!$OMP END CRITICAL
!$OMP END PARALLEL
call MPI_REDUCE_SCATTER(c)

NUG2004

12
MPI_INIT_Thread Choices
MPI_INIT_THREAD (required, provided, ierr)
IN: required, desired level of thread support (integer).
OUT: provided, provided level of thread support (integer).
Returned provided maybe less than required.

Thread support levels:

MPI_THREAD_SINGLE: Only one thread will execute.
MPI_THREAD_FUNNELED: Process may be multi-threaded, but only main
thread will make MPI calls (all MPI calls are ’’funneled'' to main thread).
Default value for SP.
MPI_THREAD_SERIALIZED: Process may be multi-threaded, multiple
threads may make MPI calls, but only one at a time: MPI calls are not made
concurrently from two distinct threads (all MPI calls are ’’serialized'').
MPI_THREAD_MULTIPLE: Multiple threads may call MPI, with no
restrictions.

06/24/2004

NUG2004

13
MPI Calls Inside OMP MASTER
MPI_THREAD_FUNNELED is required.
OMP_BARRIER is needed since there is no
synchronization with OMP_MASTER.
It implies all other threads are sleeping!
!$OMP BARRIER
!$OMP MASTER
call MPI_xxx(…)
!$OMP END MASTER
!$OMP BARRIER

06/24/2004

NUG2004

14
MPI Calls Inside OMP SINGLE
MPI_THREAD_SERIALIZED is required.
OMP_BARRIER is needed since OMP_SINGLE only
guarantees synchronization at the end.
It also implies all other threads are sleeping!
!$OMP BARRIER
!$OMP SINGLE
call MPI_xxx(…)
!$OMP END SINGLE

06/24/2004

NUG2004

15
THREAD FUNNELED/SERIALIZED
vs. Pure MPI
FUNNELED/SERIALIZED:
All other threads are sleeping while single thread
communicating.
Only one thread communicating maybe not able to
saturate the inter-node bandwidth.
Pure MPI:
Every CPU communicating may over saturate the
inter-node bandwidth.

Overlap communication with computation!
06/24/2004

NUG2004

16
Overlap COMM and COMP
Need at least MPI_THREAD_FUNNELED.
While master or single thread is making MPI calls,
other threads are computing!
Must be able to separate codes that can run before or
after halo info is received. Very hard!
!$OMP PARALLEL
if (my_thread_rank < 1) then
call MPI_xxx(…)
else
do some computation
endif
!$OMP END PARALLEL

06/24/2004

NUG2004

17
Scheduling for OpenMP
Static: Loops are divided into #thrds partitions, each containing

ceiling(#iters/#thrds) iterations.
Affinity: Loops are divided into n_thrds partitions, each containing
ceiling(#iters/#thrds) iterations. Then each partition is subdivided into
chunks containing ceiling(#left_iters_in_partion/2) iterations.
Guided: Loops are divided into progressively smaller chunks until
the chunk size is 1. The first chunk contains ceiling(#iters/#thrds)
iterations. Subsequent chunk contains ceiling(#left_iters/#thrds)
iterations.
Dynamic, n: Loops are divided into chunks containing n
iterations. We choose different chunk sizes.

06/24/2004

NUG2004

18
Debug and Tune Hybrid Codes
Debug and Tune MPI code and OpenMP code separately.
Use Guideview or Assureview to tune OpenMP code.
Use Vampir to tune MPI code.

Decide which loop to parallelize. Better to parallelize outer loop.
Decide whether Loop permutation, fusion or exchange is needed.
Choose between loop-based or SPMD.
Use different OpenMP task scheduling options.
Experiment with different combinations of MPI tasks and number
of threads per MPI task. Less MPI tasks may not saturate internode bandwidth.
Adjust environment variables.
Aggressively investigate different thread initialization options and
the possibility of overlapping communication with computation.

06/24/2004

NUG2004

19
KAP OpenMP Compiler - Guide
A high-performance OpenMP compiler for Fortran,
C and C++.
Also supports the full debugging and performance
analysis of OpenMP and hybrid MPI/OpenMP
programs via Guideview.
% guidef90 <driver options> -WG,<guide options>
<filename> <xlf compiler options>
% guideview <statfile>

06/24/2004

NUG2004

20
KAP OpenMP Debugging Tools Assure
A programming tool to validate the correctness of
an OpenMP program.
% assuref90 -WApname=pg –o a.exe a.f -O3
% a.exe
% assureview pg

Could also be used to validate the OpenMP section
in mpassuref90 <driver options> -WA,<assure options>
% a hybrid MPI/OpenMP code.
<filename> <xlf compiler options>
% setenv KDD_OUTPUT=project.%H.%I
% poe ./a.out –procs 2 –nodes 4
% assureview assure.prj project.{hostname}.{process-id}.kdd

06/24/2004

NUG2004

21
Other Debugging, Performance
Monitoring and Tuning Tools
HPM Toolkit: IBM hardware performance monitor for
C/C++, Fortran77/90, HPF.
TAU: C/C++, Fortran, Java performance tool.
Totalview: Graphic parallel debugger for C/C++, F90.
Vampir: MPI performance tool.
Xprofiler: Graphic profiling tool.

06/24/2004

NUG2004

22
Story 1: Distributed Multi-Dimensional Array
Transpose With Vacancy Tracking Method
A(3,2)  A(2,3)
Tracking cycle: 1 – 3 – 4 – 2 - 1

A(2,3,4) A(3,4,2), tracking cycles:
1 - 4 - 16 - 18 - 3 - 12 - 2 - 8 - 9 - 13 - 6 - 1
5 - 20 - 11 - 21 - 15 - 14 - 10 - 17 - 22 - 19 - 7 – 5

Cycles are closed, nonoverlapping.
06/24/2004
NUG2004

23
Multi-Threaded Parallelism
Key: Independence of tracking cycles.
!$OMP PARALLEL DO DEFAULT (PRIVATE)
!$OMP&
SHARED (N_cycles, info_table, Array)
(C.2)
!$OMP&
SCHEDULE (AFFINITY)
do k = 1, N_cycles
an inner loop of memory exchange for each cycle using info_table
enddo
!$OMP END PARALLEL DO

06/24/2004

NUG2004

24
Scheduling for OpenMP
within one Node
64x512x128: N_cycles = 4114, cycle_lengths = 16
16x1024x256: N_cycles = 29140, cycle_lengths= 9, 3
Schedule “affinity” is the best for large number of cycles
and regular short cycles.
8x1000x500: N_cycles = 132, cycle_lengths = 8890, 1778, 70, 14, 5
32x100x25: N_cycles = 42, cycle_lengths = 168, 24, 21, 8, 3.
Schedule “dynamic,1” is the best for small number of cycles with
large irregular cycle lengths.

06/24/2004

NUG2004

25
Pure MPI and Pure
OpenMP within One Node

OpenMP vs. MPI (16 CPUs)

64x512x128: 2.76 times faster
16x1024x256:1.99 times faster

06/24/2004

NUG2004

26
Pure MPI and Hybrid MPI/OpenMP
Across Nodes

With 128 CPUs, n_thrds=4
hybrid MPI/OpenMP performs
faster than n_thrds=16 hybrid
by a factor of 1.59, and faster
than pure MPI by a factor of
4.44.

06/24/2004

NUG2004

27
Story 2: Community Atmosphere Model
(CAM) Performance on SP
Pat Worley, ORNL

T42L26 grid size:
128(lon)*64(lat)
*26 (vertical)

06/24/2004

NUG2004

28
CAM Observation
CAM has two computational phases: dynamics and
physics. Dynamics need much more interprocessor
communication than physics.
Original parallelization with pure MPI is limited to 1-D
domain decomposition; the number of maximum CPUs
used is limited to the number of latitude grids.

06/24/2004

NUG2004

29
edui t aL
t

CAM New Concept: Chunks

Longitude

06/24/2004

NUG2004

30
What Have Been Done to Improve
CAM?
The incorporation of chunks (column based data structures) allows
dynamic load balancing and the usage of hybrid MPI/OpenMP
method:
Chunking in physics provides extra granularity. It allows an
increase in the number of processors used.
Multiple chunks are assigned to each MPI processor, OpenMP
threads loop over each local chunk. Dynamic load balancing is
adopted.
The optimal chunk size depends on the machine architecture, 1632 for SP.
Overall Performance increases from 7 models years per simulation
day with pure MPI to 36 model years with hybrid MPI/OpenMP (allow
more CPUs), load balanced, updated dynamical core and community
land model (CLM).
(11 years with pure MPI vs. 14 years with MPI/OpenMP both with 64
CPUs and load-balanced)

06/24/2004

NUG2004

31
Story 3: MM5 Regional Weather
Prediction Model
MM5  is approximately 50,000 lines of Fortran 77 with
Cray extensions. It runs in pure shared-memory, pure
distributed memory and mixed shared/distributedmemory mode.
The code is parallelized by FLIC, a translator for samesource parallel implementation of regular grid
applications.
The different method of parallelization is implemented
easily by including appropriate compiler commands and
options to the existing configure.user build mechanism.

06/24/2004

NUG2004

32
MM5 Performance on 332 MHz
SMP
Method

Communication (sec) Total (sec)

64 MPI tasks

494

1755

16 MPI tasks with 4 281
threads/task

1505

85% total reduction is in communication.
threading also speeds up computation.
Data from: http://www.chp.usherb.ca/doc/pdf/sp3/Atelier_IBM_CACPUS_oct2000/hybrid_programming_MPIOpenMP.PDF

06/24/2004

NUG2004

33
Story 4: Some Benchmark Results
Performance depends on:
benchmark features
Communication/computation
patterns
Problem size

Hardware features
Number of nodes
Relative performance of CPU,
memory, and communication
system (latency, bandwidth)
Data from: http://www.eecg.toronto.edu/~de/Pa-06.pdf

06/24/2004

NUG2004

34
Conclusions
Pure OpenMP performs better than pure MPI within
node is a necessity to have hybrid code better than pure
MPI across node.
Whether the hybrid code performs better than MPI code
depends on whether the communication advantage
outcomes the thread overhead, etc. or not.
There are more positive experiences of developing
hybrid MPI/OpenMP parallel paradigms now. It’s
encouraging to adopt hybrid paradigm in your own
application.

06/24/2004

NUG2004

35
The End
Thank you very much!

06/24/2004

NUG2004

36

More Related Content

What's hot

Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCAFxX
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Shivang Bajaniya
 
MPI Introduction
MPI IntroductionMPI Introduction
MPI IntroductionRohit Banga
 
ModRef'09: Gecode support for MCP
ModRef'09: Gecode support for MCPModRef'09: Gecode support for MCP
ModRef'09: Gecode support for MCPpwuille
 
GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005Saleem Ansari
 
New c sharp4_features_part_i
New c sharp4_features_part_iNew c sharp4_features_part_i
New c sharp4_features_part_iNico Ludwig
 
The Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsThe Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsJeff Squyres
 
cmp104 lec 8
cmp104 lec 8cmp104 lec 8
cmp104 lec 8kapil078
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntaxcsandit
 
Emscripten - compile your C/C++ to JavaScript
Emscripten - compile your C/C++ to JavaScriptEmscripten - compile your C/C++ to JavaScript
Emscripten - compile your C/C++ to JavaScript穎睿 梁
 
Theory Psyco
Theory PsycoTheory Psyco
Theory Psycodidip
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?Jeff Squyres
 

What's hot (19)

Message passing interface
Message passing interfaceMessage passing interface
Message passing interface
 
Open MPI
Open MPIOpen MPI
Open MPI
 
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flattening
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
 
MPI Introduction
MPI IntroductionMPI Introduction
MPI Introduction
 
Unit1 cd
Unit1 cdUnit1 cd
Unit1 cd
 
ModRef'09: Gecode support for MCP
ModRef'09: Gecode support for MCPModRef'09: Gecode support for MCP
ModRef'09: Gecode support for MCP
 
MPI
MPIMPI
MPI
 
GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005
 
New c sharp4_features_part_i
New c sharp4_features_part_iNew c sharp4_features_part_i
New c sharp4_features_part_i
 
The Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsThe Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's Terms
 
cmp104 lec 8
cmp104 lec 8cmp104 lec 8
cmp104 lec 8
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntax
 
4.languagebasics
4.languagebasics4.languagebasics
4.languagebasics
 
Emscripten - compile your C/C++ to JavaScript
Emscripten - compile your C/C++ to JavaScriptEmscripten - compile your C/C++ to JavaScript
Emscripten - compile your C/C++ to JavaScript
 
Introduction Of C++
Introduction Of C++Introduction Of C++
Introduction Of C++
 
Theory Psyco
Theory PsycoTheory Psyco
Theory Psyco
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 

Viewers also liked

David Shastry Experience Design Portfolio
David Shastry Experience Design PortfolioDavid Shastry Experience Design Portfolio
David Shastry Experience Design PortfolioDavid Shastry
 
Threading Bobbin Tutorial
Threading Bobbin TutorialThreading Bobbin Tutorial
Threading Bobbin TutorialRandi Burford
 
Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Chih-Hsuan Kuo
 
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trends
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and TrendsPixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trends
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trendspixellab
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
Artificial Intelligence in Computer and Video Games
Artificial Intelligence in Computer and Video GamesArtificial Intelligence in Computer and Video Games
Artificial Intelligence in Computer and Video GamesLuke Dicken
 
Web 2.0 At Work - Simple And Social Collaboration Between Coworkers
Web 2.0 At Work - Simple And Social Collaboration Between CoworkersWeb 2.0 At Work - Simple And Social Collaboration Between Coworkers
Web 2.0 At Work - Simple And Social Collaboration Between CoworkersAcando Consulting
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by ExampleOlve Maudal
 

Viewers also liked (12)

David Shastry Experience Design Portfolio
David Shastry Experience Design PortfolioDavid Shastry Experience Design Portfolio
David Shastry Experience Design Portfolio
 
Mnk hsa ppt
Mnk hsa pptMnk hsa ppt
Mnk hsa ppt
 
Threading Bobbin Tutorial
Threading Bobbin TutorialThreading Bobbin Tutorial
Threading Bobbin Tutorial
 
AMC Minor Technical Issues
AMC Minor Technical IssuesAMC Minor Technical Issues
AMC Minor Technical Issues
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36
 
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trends
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and TrendsPixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trends
Pixel-Lab / Games:EDU / Michel Kripalani / Games Industry Overview and Trends
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
Artificial Intelligence in Computer and Video Games
Artificial Intelligence in Computer and Video GamesArtificial Intelligence in Computer and Video Games
Artificial Intelligence in Computer and Video Games
 
C++11
C++11C++11
C++11
 
Web 2.0 At Work - Simple And Social Collaboration Between Coworkers
Web 2.0 At Work - Simple And Social Collaboration Between CoworkersWeb 2.0 At Work - Simple And Social Collaboration Between Coworkers
Web 2.0 At Work - Simple And Social Collaboration Between Coworkers
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
 

Similar to Nug2004 yhe

Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPIAnkit Mahato
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Marcirio Chaves
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptMALARMANNANA1
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelIntel® Software
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 

Similar to Nug2004 yhe (20)

Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
Lecture6
Lecture6Lecture6
Lecture6
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
openmp final2.pptx
openmp final2.pptxopenmp final2.pptx
openmp final2.pptx
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Lecture5
Lecture5Lecture5
Lecture5
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Python introduction
Python introductionPython introduction
Python introduction
 
OpenMP
OpenMPOpenMP
OpenMP
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Nug2004 yhe

  • 1. Hybrid OpenMP and MPI Programming and Tuning Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory 06/24/2004 NUG2004 1
  • 2. Outline Introduction Why Hybrid Compile, Link, and Run Parallelization Strategies Simple Example: Ax=b MPI_init_thread Choices Debug and Tune Examples Multi-dimensional Array Transpose Community Atmosphere Model MM5 Regional Climate Model Some Other Benchmarks Conclusions 06/24/2004 NUG2004 2
  • 3. MPI vs. OpenMP Pure MPI Pro: Portable to distributed and shared memory machines. Scales beyond one node No data placement problem Pure MPI Con: Difficult to develop and debug High latency, low bandwidth Explicit communication Large granularity Difficult load balancing 06/24/2004 NUG2004 Pure OpenMP Pro: Easy to implement parallelism Low latency, high bandwidth Implicit Communication Coarse and fine granularity Dynamic load balancing Pure OpenMP Con: Only on shared memory machines Scale within one node Possible data placement problem No specific thread order 3
  • 4. Why Hybrid Hybrid MPI/OpenMP paradigm is the software trend for clusters of SMP architectures. Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). Avoids the extra communication overhead with MPI within node. OpenMP adds fine granularity (larger message sizes) and allows increased and/or dynamic load balancing. Some problems have two-level parallelism naturally. Some problems could only use restricted number of MPI tasks. Could have better scalability than both pure MPI and pure OpenMP. My code speeds up by a factor of 4.44. 06/24/2004 NUG2004 4
  • 5. Why Mixed OpenMP/MPI Code is Sometimes Slower? OpenMP has less scalability due to implicit parallelism while MPI allows multi-dimensional blocking . All threads are idle except one while MPI communication. Need overlap comp and comm for better performance. Critical Section for shared variables. Thread creation overhead Cache coherence, data placement. Natural one level parallelism problems. Pure OpenMP code performs worse than pure MPI within node. Lack of optimized OpenMP compilers/libraries. Positive and Negative experiences: Positive: CAM, MM5, … Negative: NAS, CG, PS, … 06/24/2004 NUG2004 5
  • 6. A Pseudo Hybrid Code Program hybrid call MPI_INIT (ierr) call MPI_COMM_RANK (…) call MPI_COMM_SIZE (…) … some computation and MPI communication call OMP_SET_NUM_THREADS(4) !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(n) do i=1,n … computation enddo !$OMP END PARALLEL DO … some computation and MPI communication call MPI_FINALIZE (ierr) end 06/24/2004 NUG2004 6
  • 7. Compile, link, and Run % mpxlf90_r –qsmp=omp -o hybrid –O3 hybrid.f90 % setenv XLSMPOPTS parthds=4 (or % setenv OMP_NUM_THREADS 4) % poe hybrid –nodes 2 –tasks_per_node 4 Loadleveler Script: (% llsubmit job.hybrid) #@ shell = /usr/bin/csh #@ output = $(jobid).$(stepid).out #@ error = $(jobid).$(stepid).err #@ class = debug #@ node = 2 #@ tasks_per_node = 4 #@ network.MPI = csss,not_shared,us #@ wall_clock_limit = 00:02:00 #@ notification = complete #@ job_type = parallel #@ environment = COPY_ALL #@ queue hybrid exit 06/24/2004 NUG2004 7
  • 8. Other Environment Variables MP_WAIT_MODE: Tasks wait mode, could be poll, yield, or sleep. Default value is poll for US and sleep for IP. MP_POLLING_INTERVAL: the polling interval. By default, a thread in OpenMP application goes to sleep after finish its work. By putting thread in a busy-waiting instead of sleep could reduce overhead in thread reactivation. SPINLOOPTIME: time spent in busy wait before yield YIELDLOOPTIME: time spent in spin-yield cycle before fall asleep. 06/24/2004 NUG2004 8
  • 9. Loop-based vs. SPMD Loop-based: SPMD: !$OMP PARALLEL DO PRIVATE(start, end, !$OMP PARALLEL DO PRIVATE(i) i) !$OMP& SHARED(a,b,n) !$OMP& SHARED(a,b) do i=1,n num_thrds = omp_get_num_threads() a(i)=a(i)+b(i) thrd_id = omp_get_thread_num() enddo start = n* thrd_id/num_thrds + 1 !$OMP END PARALLEL DO end = n*(thrd_num+1)/num_thrds do i = start, end a(i)=a(i)+b(i) enddo !$OMP END PARALLEL DO SPMD code normally gives better performance than loop-based code, but more difficult to implement: • Less thread synchronization. • Less cache misses. • More compiler optimizations. 06/24/2004 NUG2004 9
  • 10. Hybrid Parallelization Strategies From sequential code, decompose with MPI first, then add OpenMP. From OpenMP code, treat as serial code. From MPI code, add OpenMP. Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. Could use MPI inside parallel region with thread-safe MPI. 06/24/2004 NUG2004 10
  • 11. A Simple Example: Ax=b dae h r t = process c = 0.0 do j = 1, n_loc !$OMP DO PARALLEL !$OMP SHARED(a,b), PRIVATE(i) !$OMP REDUCTION(+:c) do i = 1, nrows c(i) = c(i) + a(i,j)*b(i) enddo enddo call MPI_REDUCE_SCATTER(c) 06/24/2004 • OMP does not support vector reduction • Wrong answer since c is shared! NUG2004 11
  • 12. Correct Implementations OPENMP: IBM SMP: c = 0.0 !$SMP PARALLEL REDUCTION(+:c) c = 0.0 do j = 1, n_loc !$SMP DO PRIVATE(i) do i = 1, nrows c(i) = c(i) + a(i,j)*b(i) enddo !$SMP END DO NOWAIT enddo !$SMP END PARALLEL call MPI_REDUCE_SCATTER(c) 06/24/2004 c = 0.0 !$OMP PARALLEL SHARED(c), PRIVATE(c_loc) c_loc = 0.0 do j = 1, n_loc !$OMP DO PRIVATE(i) do i = 1, nrows c_loc(i) = c_loc(i) + a(i,j)*b(i) enddo !$OMP END DO NOWAIT enddo !$OMP CRITICAL c = c + c_loc !$OMP END CRITICAL !$OMP END PARALLEL call MPI_REDUCE_SCATTER(c) NUG2004 12
  • 13. MPI_INIT_Thread Choices MPI_INIT_THREAD (required, provided, ierr) IN: required, desired level of thread support (integer). OUT: provided, provided level of thread support (integer). Returned provided maybe less than required. Thread support levels: MPI_THREAD_SINGLE: Only one thread will execute. MPI_THREAD_FUNNELED: Process may be multi-threaded, but only main thread will make MPI calls (all MPI calls are ’’funneled'' to main thread). Default value for SP. MPI_THREAD_SERIALIZED: Process may be multi-threaded, multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ’’serialized''). MPI_THREAD_MULTIPLE: Multiple threads may call MPI, with no restrictions. 06/24/2004 NUG2004 13
  • 14. MPI Calls Inside OMP MASTER MPI_THREAD_FUNNELED is required. OMP_BARRIER is needed since there is no synchronization with OMP_MASTER. It implies all other threads are sleeping! !$OMP BARRIER !$OMP MASTER call MPI_xxx(…) !$OMP END MASTER !$OMP BARRIER 06/24/2004 NUG2004 14
  • 15. MPI Calls Inside OMP SINGLE MPI_THREAD_SERIALIZED is required. OMP_BARRIER is needed since OMP_SINGLE only guarantees synchronization at the end. It also implies all other threads are sleeping! !$OMP BARRIER !$OMP SINGLE call MPI_xxx(…) !$OMP END SINGLE 06/24/2004 NUG2004 15
  • 16. THREAD FUNNELED/SERIALIZED vs. Pure MPI FUNNELED/SERIALIZED: All other threads are sleeping while single thread communicating. Only one thread communicating maybe not able to saturate the inter-node bandwidth. Pure MPI: Every CPU communicating may over saturate the inter-node bandwidth. Overlap communication with computation! 06/24/2004 NUG2004 16
  • 17. Overlap COMM and COMP Need at least MPI_THREAD_FUNNELED. While master or single thread is making MPI calls, other threads are computing! Must be able to separate codes that can run before or after halo info is received. Very hard! !$OMP PARALLEL if (my_thread_rank < 1) then call MPI_xxx(…) else do some computation endif !$OMP END PARALLEL 06/24/2004 NUG2004 17
  • 18. Scheduling for OpenMP Static: Loops are divided into #thrds partitions, each containing ceiling(#iters/#thrds) iterations. Affinity: Loops are divided into n_thrds partitions, each containing ceiling(#iters/#thrds) iterations. Then each partition is subdivided into chunks containing ceiling(#left_iters_in_partion/2) iterations. Guided: Loops are divided into progressively smaller chunks until the chunk size is 1. The first chunk contains ceiling(#iters/#thrds) iterations. Subsequent chunk contains ceiling(#left_iters/#thrds) iterations. Dynamic, n: Loops are divided into chunks containing n iterations. We choose different chunk sizes. 06/24/2004 NUG2004 18
  • 19. Debug and Tune Hybrid Codes Debug and Tune MPI code and OpenMP code separately. Use Guideview or Assureview to tune OpenMP code. Use Vampir to tune MPI code. Decide which loop to parallelize. Better to parallelize outer loop. Decide whether Loop permutation, fusion or exchange is needed. Choose between loop-based or SPMD. Use different OpenMP task scheduling options. Experiment with different combinations of MPI tasks and number of threads per MPI task. Less MPI tasks may not saturate internode bandwidth. Adjust environment variables. Aggressively investigate different thread initialization options and the possibility of overlapping communication with computation. 06/24/2004 NUG2004 19
  • 20. KAP OpenMP Compiler - Guide A high-performance OpenMP compiler for Fortran, C and C++. Also supports the full debugging and performance analysis of OpenMP and hybrid MPI/OpenMP programs via Guideview. % guidef90 <driver options> -WG,<guide options> <filename> <xlf compiler options> % guideview <statfile> 06/24/2004 NUG2004 20
  • 21. KAP OpenMP Debugging Tools Assure A programming tool to validate the correctness of an OpenMP program. % assuref90 -WApname=pg –o a.exe a.f -O3 % a.exe % assureview pg Could also be used to validate the OpenMP section in mpassuref90 <driver options> -WA,<assure options> % a hybrid MPI/OpenMP code. <filename> <xlf compiler options> % setenv KDD_OUTPUT=project.%H.%I % poe ./a.out –procs 2 –nodes 4 % assureview assure.prj project.{hostname}.{process-id}.kdd 06/24/2004 NUG2004 21
  • 22. Other Debugging, Performance Monitoring and Tuning Tools HPM Toolkit: IBM hardware performance monitor for C/C++, Fortran77/90, HPF. TAU: C/C++, Fortran, Java performance tool. Totalview: Graphic parallel debugger for C/C++, F90. Vampir: MPI performance tool. Xprofiler: Graphic profiling tool. 06/24/2004 NUG2004 22
  • 23. Story 1: Distributed Multi-Dimensional Array Transpose With Vacancy Tracking Method A(3,2)  A(2,3) Tracking cycle: 1 – 3 – 4 – 2 - 1 A(2,3,4) A(3,4,2), tracking cycles: 1 - 4 - 16 - 18 - 3 - 12 - 2 - 8 - 9 - 13 - 6 - 1 5 - 20 - 11 - 21 - 15 - 14 - 10 - 17 - 22 - 19 - 7 – 5 Cycles are closed, nonoverlapping. 06/24/2004 NUG2004 23
  • 24. Multi-Threaded Parallelism Key: Independence of tracking cycles. !$OMP PARALLEL DO DEFAULT (PRIVATE) !$OMP& SHARED (N_cycles, info_table, Array) (C.2) !$OMP& SCHEDULE (AFFINITY) do k = 1, N_cycles an inner loop of memory exchange for each cycle using info_table enddo !$OMP END PARALLEL DO 06/24/2004 NUG2004 24
  • 25. Scheduling for OpenMP within one Node 64x512x128: N_cycles = 4114, cycle_lengths = 16 16x1024x256: N_cycles = 29140, cycle_lengths= 9, 3 Schedule “affinity” is the best for large number of cycles and regular short cycles. 8x1000x500: N_cycles = 132, cycle_lengths = 8890, 1778, 70, 14, 5 32x100x25: N_cycles = 42, cycle_lengths = 168, 24, 21, 8, 3. Schedule “dynamic,1” is the best for small number of cycles with large irregular cycle lengths. 06/24/2004 NUG2004 25
  • 26. Pure MPI and Pure OpenMP within One Node OpenMP vs. MPI (16 CPUs) 64x512x128: 2.76 times faster 16x1024x256:1.99 times faster 06/24/2004 NUG2004 26
  • 27. Pure MPI and Hybrid MPI/OpenMP Across Nodes With 128 CPUs, n_thrds=4 hybrid MPI/OpenMP performs faster than n_thrds=16 hybrid by a factor of 1.59, and faster than pure MPI by a factor of 4.44. 06/24/2004 NUG2004 27
  • 28. Story 2: Community Atmosphere Model (CAM) Performance on SP Pat Worley, ORNL T42L26 grid size: 128(lon)*64(lat) *26 (vertical) 06/24/2004 NUG2004 28
  • 29. CAM Observation CAM has two computational phases: dynamics and physics. Dynamics need much more interprocessor communication than physics. Original parallelization with pure MPI is limited to 1-D domain decomposition; the number of maximum CPUs used is limited to the number of latitude grids. 06/24/2004 NUG2004 29
  • 30. edui t aL t CAM New Concept: Chunks Longitude 06/24/2004 NUG2004 30
  • 31. What Have Been Done to Improve CAM? The incorporation of chunks (column based data structures) allows dynamic load balancing and the usage of hybrid MPI/OpenMP method: Chunking in physics provides extra granularity. It allows an increase in the number of processors used. Multiple chunks are assigned to each MPI processor, OpenMP threads loop over each local chunk. Dynamic load balancing is adopted. The optimal chunk size depends on the machine architecture, 1632 for SP. Overall Performance increases from 7 models years per simulation day with pure MPI to 36 model years with hybrid MPI/OpenMP (allow more CPUs), load balanced, updated dynamical core and community land model (CLM). (11 years with pure MPI vs. 14 years with MPI/OpenMP both with 64 CPUs and load-balanced) 06/24/2004 NUG2004 31
  • 32. Story 3: MM5 Regional Weather Prediction Model MM5  is approximately 50,000 lines of Fortran 77 with Cray extensions. It runs in pure shared-memory, pure distributed memory and mixed shared/distributedmemory mode. The code is parallelized by FLIC, a translator for samesource parallel implementation of regular grid applications. The different method of parallelization is implemented easily by including appropriate compiler commands and options to the existing configure.user build mechanism. 06/24/2004 NUG2004 32
  • 33. MM5 Performance on 332 MHz SMP Method Communication (sec) Total (sec) 64 MPI tasks 494 1755 16 MPI tasks with 4 281 threads/task 1505 85% total reduction is in communication. threading also speeds up computation. Data from: http://www.chp.usherb.ca/doc/pdf/sp3/Atelier_IBM_CACPUS_oct2000/hybrid_programming_MPIOpenMP.PDF 06/24/2004 NUG2004 33
  • 34. Story 4: Some Benchmark Results Performance depends on: benchmark features Communication/computation patterns Problem size Hardware features Number of nodes Relative performance of CPU, memory, and communication system (latency, bandwidth) Data from: http://www.eecg.toronto.edu/~de/Pa-06.pdf 06/24/2004 NUG2004 34
  • 35. Conclusions Pure OpenMP performs better than pure MPI within node is a necessity to have hybrid code better than pure MPI across node. Whether the hybrid code performs better than MPI code depends on whether the communication advantage outcomes the thread overhead, etc. or not. There are more positive experiences of developing hybrid MPI/OpenMP parallel paradigms now. It’s encouraging to adopt hybrid paradigm in your own application. 06/24/2004 NUG2004 35
  • 36. The End Thank you very much! 06/24/2004 NUG2004 36