www.bsc.es
Leipzig, June 17th 2013
Jesus Labarta
Jesus.labarta@bsc.es
OmpSs – improving the
scalability of OpenMP
2
The parallel programming revolution
Parallel programming in the past
– Where to place data
– What to run where
– How to communicate
Parallel programming in the future
– What do I need to compute
– What data do I need to use
– hints (not necessarily very precise) on
potential concurrency, locality,…
Schedule @ programmers mind
Static
Schedule @ system
Dynamic
Complexity  ! Awareness
Variability
3
Key concept
– Sequential task based program on single address/name space
+ directionality annotations
– Happens to execute parallel: Automatic run time computation of
dependencies between tasks
Differentiation of StarSs
– Dependences: Tasks instantiated but not ready. Order IS defined
• Lookahead
– Avoid stalling the main control flow when a computation depending on previous
tasks is reached
– Possibility to “see” the future searching for further potential concurrency
– Locality aware
– Homogenizing heterogeneity
The StarSs family of programming models
4
The StarSs “Granularities”
StarSs
OmpSs COMPSs
@ SMP @ GPU @ Cluster
Average task Granularity:
100 microseconds – 10 milliseconds 1second - 1 day
Language binding:
C, C++, FORTRAN Java, Python
Address space to compute dependences:
Memory Files, Objects (SCM)
Parallel Ensemble, workflow
5
OpenMP
void Cholesky(int NT, float *A[NT][NT] ) {
#pragma omp parallel
#pragma omp single
for (int k=0; k<NT; k++) {
#pragma omp task
spotrf (A[k][k], TS);
#pragma omp taskwait
for (int i=k+1; i<NT; i++) {
#pragma omp task
strsm (A[k][k], A[k][i], TS);
}
#pragma omp taskwait
for (int i=k+1; i<NT; i++) {
for (j=k+1; j<i; j++)
#pragma omp task
sgemm( A[k][i], A[k][j], A[j][i], TS);
#pragma omp task
ssyrk (A[k][i], A[i][i], TS);
#pragma omp taskwait
}
}
}
6
Extend OpenMP with a data-flow execution model to exploit
unstructured parallelism (compared to fork-join)
– in/out pragmas to enable dependence and locality
– Inlined/outlined
void Cholesky(int NT, float *A[NT][NT] ) {
for (int k=0; k<NT; k++) {
#pragma omp task inout ([TS][TS]A[k][k])
spotrf (A[k][k], TS) ;
for (int i=k+1; i<NT; i++) {
#pragma omp task in(([TS][TS]A[k][k])) 
inout ([TS][TS]A[k][i])
strsm (A[k][k], A[k][i], TS);
}
for (int i=k+1; i<NT; i++) {
for (j=k+1; j<i; j++) {
#pragma omp task in([TS][TS]A[k][i]),
in([TS][TS]A[k][j]) inout ([TS][TS]A[j][i])
sgemm( A[k][i], A[k][j], A[j][i], TS);
}
#pragma omp task in ([TS][TS]A[k][i]) 
inout([TS][TS]A[i][i])
ssyrk (A[k][i], A[i][i], TS);
}
}
}
OmpSs
7
OmpSs
Other features and characteristics
– Contiguous / strided dependence detection and data management
– Nesting/recursion
– Multiple implementations for a given task
– Integrated with powerful performance tools (Extrae+Paraver)
Continuous development and use
– Since 2004
– On large applications and systems
Pushing ideas into the OpenMP standard
– Developer Positioning: efforts in OmpSs will not be lost.
– Dependences  v 4.0
18
Used in projects and applications …
Undertaken significant efforts to port real large scale
applications:
–
• Scalapack, PLASMA, SPECFEM3D, LBC, CPMD PSC, PEPC,
LS1 Mardyn, Asynchronous algorithms, Microbenchmarks
–
• YALES2, EUTERPE, SPECFEM3D, MP2C, BigDFT,
QuantumESPRESSO, PEPC, SMMP, ProFASI, COSMO, BQCD
– DEEP
• NEURON, iPIC3D, ECHAM/MESSy, AVBP, TurboRVB, Seismic
– G8_ECS
• CGPOP, NICAM (planed) …
– Consolider project (Spanish ministry)
• MRGENESIS
– BSC initiatives and collaborations:
• GROMACS, GADGET, WRF,…
19
… but NOT only for «scientific computing» …
Plagiarism detection
– Histograms, sorting, … (FhI FIRST)
Trace browsing
– Paraver (BSC)
Clustering algorithms
– G-means (BSC)
Image processing
– Tracking (USAF)
Embedded and consumer
– H.264 (TUBerlin), …
8
The potential of asynchrony
Performance …
– Cholesky 8K x 8K @ SandyBridge (E5-2670, 2.6 GHz)
– Intel Math Kernel Library (MKL) parallel vs. OmpSs+ MKL sequential
• Nested OmpSs (large block size, small block size)
 Reproducibility / less sensitivity to environment options
Strong scaling
Impact of problem size
for fixed core count
9
• Detected issues (ongoing work)
– scheduling policies (generic of nested OmpSs),
– contention on task queues (specific of high #cores),…
• Still fairly good performance!!!
Xeon Phi (I)
10
Hybrid MPI/SMPSs
Linpack example
Overlap communication/computation
Extend asynchronous data-flow execution to
outer level
Automatic lookahead
…
for (k=0; k<N; k++) {
if (mine) {
Factor_panel(A[k]);
send (A[k])
} else {
receive (A[k]);
if (necessary) resend (A[k]);
}
for (j=k+1; j<N; j++)
update (A[k], A[j]);
…
#pragma css task inout(A[SIZE])
void Factor_panel(float *A);
#pragma css task input(A[SIZE]) inout(B[SIZE])
void update(float *A, float *B);
#pragma css task input(A[SIZE])
void send(float *A);
#pragma css task output(A[SIZE])
void receive(float *A);
#pragma css task input(A[SIZE])
void resend(float *A);
P0 P1 P2
V. Marjanovic, et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010
11
Automatically achieved by the runtime
– Load balance within node
– Fine grain.
– Complementary to application level
load balance.
– Leverage OmpSs malleability
LeWI: Lend core When Idle
– User level Run time Library (DLB)
coordinating processes within node
– Fighting the Linux kernel: Explicit pinning of
threads and handoff scheduling
Overcome Amdahl’s law in hybrid
programming
– Enable partial node level parallelization
– hope for lazy programmers
… and us all
Dynamic Load Balancing: LeWI
Communication
Running task
Idle
4 MPI processes @ 4 cores node
“LeWI: A Runtime Balancing
Algorithm for Nested Parallelism”.
M.Garcia et al. ICPP09
12
Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)
Host
KNC
KNC
KNC
KNC
Host
KNC
KNC
KNC
KNC
MPI
MPI
MPI
void main(int argc, char* argv[]) {
//MPI stuff ….
for(int i=0; i<my_jobs; i++) {
int idx = i % workers;
#pragma omp task out(worker_image[idx])
worker(worker_image[idx]);
#pragma omp task inout(master_image) in(worker_image[idx]);
accum(master_image, worker_image[idx])
}
#pragma omp taskwait
global_accum(global_pool, master_image); // MPI
}
Host
KNC
KNC
KNC
KNC
Host
KNC
KNC
KNC
KNC
MPI
OmpSs
OmpSs
void main(int argc, char* argv[]) {
//MPI stuff ….
for(int i=0; i<my_jobs; i++) {
int idx = i % workers;
// MPI stuff to send job to worker
//…
//….manually taking care of load balance
// …
// …. and collect results on worker_pool[idx]
accum(master_image, worker_image[idx])
}
global_accum(global_image, master_image); // MPI
}
void main(int argc, char* argv[]) {
for(int i=0; i<my_jobs; i++) {
int idx = i % workers;
#pragma omp task out(worker_image[idx])
worker(worker_image[idx]);
#pragma omp task inout(master_image) in(worker_image[idx]);
accum(global_image, worker_image[idx])
}
#pragma omp taskwait
}
13
Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)
Productivity …
… and performance
… flexibility …
14
MPI + OmpSs @ MIC
– Hydro 8 MPI processes, 2 cores/process, 4 OmpSs threads/process
– Mapped to 64 hardware threads (16 cores)
4 OmpSs threads/Core
2 OmpSs threads/Core
3 OmpSs threads@ Core 1
2 OmpSs threads/Core
1 OmpSs threads@ Core 2
15
OmpSs: Enabler for exascale
 Can exploit very unstructured
parallelism
 Not just loop/data parallelism
 Easy to change structure
 Supports large amounts of
lookahead
 Not stalling for dependence satisfaction
 Allow for locality optimizations to
tolerate latency
 Overlap data transfers, prefetch
 Reuse
 Nicely hybridizes into MPI/StarSs
 Propagates to large scale the node level
dataflow characteristics
 Overlap communication and computation
 A chance against Amdahl’s law
 Support for heterogeneity
 Any # and combination of CPUs, GPUs
 Including autotuning
 Malleability: Decouple program from
resources
 Allowing dynamic resource allocation and
load balance
 Tolerate noise
15
Asynchrony: Data-flow
Locality
Simple / incremental
interface to programmers
Compatible with proprietary
lower level technologies
THANKS

OmpSs – improving the scalability of OpenMP

  • 1.
    www.bsc.es Leipzig, June 17th2013 Jesus Labarta Jesus.labarta@bsc.es OmpSs – improving the scalability of OpenMP
  • 2.
    2 The parallel programmingrevolution Parallel programming in the past – Where to place data – What to run where – How to communicate Parallel programming in the future – What do I need to compute – What data do I need to use – hints (not necessarily very precise) on potential concurrency, locality,… Schedule @ programmers mind Static Schedule @ system Dynamic Complexity  ! Awareness Variability
  • 3.
    3 Key concept – Sequentialtask based program on single address/name space + directionality annotations – Happens to execute parallel: Automatic run time computation of dependencies between tasks Differentiation of StarSs – Dependences: Tasks instantiated but not ready. Order IS defined • Lookahead – Avoid stalling the main control flow when a computation depending on previous tasks is reached – Possibility to “see” the future searching for further potential concurrency – Locality aware – Homogenizing heterogeneity The StarSs family of programming models
  • 4.
    4 The StarSs “Granularities” StarSs OmpSsCOMPSs @ SMP @ GPU @ Cluster Average task Granularity: 100 microseconds – 10 milliseconds 1second - 1 day Language binding: C, C++, FORTRAN Java, Python Address space to compute dependences: Memory Files, Objects (SCM) Parallel Ensemble, workflow
  • 5.
    5 OpenMP void Cholesky(int NT,float *A[NT][NT] ) { #pragma omp parallel #pragma omp single for (int k=0; k<NT; k++) { #pragma omp task spotrf (A[k][k], TS); #pragma omp taskwait for (int i=k+1; i<NT; i++) { #pragma omp task strsm (A[k][k], A[k][i], TS); } #pragma omp taskwait for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) #pragma omp task sgemm( A[k][i], A[k][j], A[j][i], TS); #pragma omp task ssyrk (A[k][i], A[i][i], TS); #pragma omp taskwait } } }
  • 6.
    6 Extend OpenMP witha data-flow execution model to exploit unstructured parallelism (compared to fork-join) – in/out pragmas to enable dependence and locality – Inlined/outlined void Cholesky(int NT, float *A[NT][NT] ) { for (int k=0; k<NT; k++) { #pragma omp task inout ([TS][TS]A[k][k]) spotrf (A[k][k], TS) ; for (int i=k+1; i<NT; i++) { #pragma omp task in(([TS][TS]A[k][k])) inout ([TS][TS]A[k][i]) strsm (A[k][k], A[k][i], TS); } for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in([TS][TS]A[k][i]), in([TS][TS]A[k][j]) inout ([TS][TS]A[j][i]) sgemm( A[k][i], A[k][j], A[j][i], TS); } #pragma omp task in ([TS][TS]A[k][i]) inout([TS][TS]A[i][i]) ssyrk (A[k][i], A[i][i], TS); } } } OmpSs
  • 7.
    7 OmpSs Other features andcharacteristics – Contiguous / strided dependence detection and data management – Nesting/recursion – Multiple implementations for a given task – Integrated with powerful performance tools (Extrae+Paraver) Continuous development and use – Since 2004 – On large applications and systems Pushing ideas into the OpenMP standard – Developer Positioning: efforts in OmpSs will not be lost. – Dependences  v 4.0 18 Used in projects and applications … Undertaken significant efforts to port real large scale applications: – • Scalapack, PLASMA, SPECFEM3D, LBC, CPMD PSC, PEPC, LS1 Mardyn, Asynchronous algorithms, Microbenchmarks – • YALES2, EUTERPE, SPECFEM3D, MP2C, BigDFT, QuantumESPRESSO, PEPC, SMMP, ProFASI, COSMO, BQCD – DEEP • NEURON, iPIC3D, ECHAM/MESSy, AVBP, TurboRVB, Seismic – G8_ECS • CGPOP, NICAM (planed) … – Consolider project (Spanish ministry) • MRGENESIS – BSC initiatives and collaborations: • GROMACS, GADGET, WRF,… 19 … but NOT only for «scientific computing» … Plagiarism detection – Histograms, sorting, … (FhI FIRST) Trace browsing – Paraver (BSC) Clustering algorithms – G-means (BSC) Image processing – Tracking (USAF) Embedded and consumer – H.264 (TUBerlin), …
  • 8.
    8 The potential ofasynchrony Performance … – Cholesky 8K x 8K @ SandyBridge (E5-2670, 2.6 GHz) – Intel Math Kernel Library (MKL) parallel vs. OmpSs+ MKL sequential • Nested OmpSs (large block size, small block size)  Reproducibility / less sensitivity to environment options Strong scaling Impact of problem size for fixed core count
  • 9.
    9 • Detected issues(ongoing work) – scheduling policies (generic of nested OmpSs), – contention on task queues (specific of high #cores),… • Still fairly good performance!!! Xeon Phi (I)
  • 10.
    10 Hybrid MPI/SMPSs Linpack example Overlapcommunication/computation Extend asynchronous data-flow execution to outer level Automatic lookahead … for (k=0; k<N; k++) { if (mine) { Factor_panel(A[k]); send (A[k]) } else { receive (A[k]); if (necessary) resend (A[k]); } for (j=k+1; j<N; j++) update (A[k], A[j]); … #pragma css task inout(A[SIZE]) void Factor_panel(float *A); #pragma css task input(A[SIZE]) inout(B[SIZE]) void update(float *A, float *B); #pragma css task input(A[SIZE]) void send(float *A); #pragma css task output(A[SIZE]) void receive(float *A); #pragma css task input(A[SIZE]) void resend(float *A); P0 P1 P2 V. Marjanovic, et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010
  • 11.
    11 Automatically achieved bythe runtime – Load balance within node – Fine grain. – Complementary to application level load balance. – Leverage OmpSs malleability LeWI: Lend core When Idle – User level Run time Library (DLB) coordinating processes within node – Fighting the Linux kernel: Explicit pinning of threads and handoff scheduling Overcome Amdahl’s law in hybrid programming – Enable partial node level parallelization – hope for lazy programmers … and us all Dynamic Load Balancing: LeWI Communication Running task Idle 4 MPI processes @ 4 cores node “LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09
  • 12.
    12 Offloading to Intel®Xeon Phi™ (SRMPI app @ DEEP) Host KNC KNC KNC KNC Host KNC KNC KNC KNC MPI MPI MPI void main(int argc, char* argv[]) { //MPI stuff …. for(int i=0; i<my_jobs; i++) { int idx = i % workers; #pragma omp task out(worker_image[idx]) worker(worker_image[idx]); #pragma omp task inout(master_image) in(worker_image[idx]); accum(master_image, worker_image[idx]) } #pragma omp taskwait global_accum(global_pool, master_image); // MPI } Host KNC KNC KNC KNC Host KNC KNC KNC KNC MPI OmpSs OmpSs void main(int argc, char* argv[]) { //MPI stuff …. for(int i=0; i<my_jobs; i++) { int idx = i % workers; // MPI stuff to send job to worker //… //….manually taking care of load balance // … // …. and collect results on worker_pool[idx] accum(master_image, worker_image[idx]) } global_accum(global_image, master_image); // MPI } void main(int argc, char* argv[]) { for(int i=0; i<my_jobs; i++) { int idx = i % workers; #pragma omp task out(worker_image[idx]) worker(worker_image[idx]); #pragma omp task inout(master_image) in(worker_image[idx]); accum(global_image, worker_image[idx]) } #pragma omp taskwait }
  • 13.
    13 Offloading to Intel®Xeon Phi™ (SRMPI app @ DEEP) Productivity … … and performance … flexibility …
  • 14.
    14 MPI + OmpSs@ MIC – Hydro 8 MPI processes, 2 cores/process, 4 OmpSs threads/process – Mapped to 64 hardware threads (16 cores) 4 OmpSs threads/Core 2 OmpSs threads/Core 3 OmpSs threads@ Core 1 2 OmpSs threads/Core 1 OmpSs threads@ Core 2
  • 15.
    15 OmpSs: Enabler forexascale  Can exploit very unstructured parallelism  Not just loop/data parallelism  Easy to change structure  Supports large amounts of lookahead  Not stalling for dependence satisfaction  Allow for locality optimizations to tolerate latency  Overlap data transfers, prefetch  Reuse  Nicely hybridizes into MPI/StarSs  Propagates to large scale the node level dataflow characteristics  Overlap communication and computation  A chance against Amdahl’s law  Support for heterogeneity  Any # and combination of CPUs, GPUs  Including autotuning  Malleability: Decouple program from resources  Allowing dynamic resource allocation and load balance  Tolerate noise 15 Asynchrony: Data-flow Locality Simple / incremental interface to programmers Compatible with proprietary lower level technologies
  • 16.