OmpSs – improving the scalability of OpenMP

www.bsc.es
Leipzig, June 17th 2013
Jesus Labarta
Jesus.labarta@bsc.es
OmpSs – improving the
scalability of OpenMP

2
The parallel programming revolution
Parallel programming in the past
– Where to place data
– What to run where
– How to communicate
Parallel programming in the future
– What do I need to compute
– What data do I need to use
– hints (not necessarily very precise) on
potential concurrency, locality,…
Schedule @ programmers mind
Static
Schedule @ system
Dynamic
Complexity  ! Awareness
Variability

3
Key concept
– Sequential task based program on single address/name space
+ directionality annotations
– Happens to execute parallel: Automatic run time computation of
dependencies between tasks
Differentiation of StarSs
– Dependences: Tasks instantiated but not ready. Order IS defined
• Lookahead
– Avoid stalling the main control flow when a computation depending on previous
tasks is reached
– Possibility to “see” the future searching for further potential concurrency
– Locality aware
– Homogenizing heterogeneity
The StarSs family of programming models

4
The StarSs “Granularities”
StarSs
OmpSs COMPSs
@ SMP @ GPU @ Cluster
Average task Granularity:
100 microseconds – 10 milliseconds 1second - 1 day
Language binding:
C, C++, FORTRAN Java, Python
Address space to compute dependences:
Memory Files, Objects (SCM)
Parallel Ensemble, workflow

5
OpenMP
void Cholesky(int NT, float *A[NT][NT] ) {
#pragma omp parallel
#pragma omp single
for (int k=0; k<NT; k++) {
#pragma omp task
spotrf (A[k][k], TS);
#pragma omp taskwait
for (int i=k+1; i<NT; i++) {
#pragma omp task
strsm (A[k][k], A[k][i], TS);
}
for (j=k+1; j<i; j++)
#pragma omp task
sgemm( A[k][i], A[k][j], A[j][i], TS);
#pragma omp task
ssyrk (A[k][i], A[i][i], TS);
}
}
}

6
Extend OpenMP with a data-flow execution model to exploit
unstructured parallelism (compared to fork-join)
– in/out pragmas to enable dependence and locality
– Inlined/outlined
void Cholesky(int NT, float *A[NT][NT] ) {
for (int k=0; k<NT; k++) {
#pragma omp task inout ([TS][TS]A[k][k])
spotrf (A[k][k], TS) ;
#pragma omp task in(([TS][TS]A[k][k]))
inout ([TS][TS]A[k][i])
strsm (A[k][k], A[k][i], TS);
}
for (j=k+1; j<i; j++) {
#pragma omp task in([TS][TS]A[k][i]),
in([TS][TS]A[k][j]) inout ([TS][TS]A[j][i])
sgemm( A[k][i], A[k][j], A[j][i], TS);
}
#pragma omp task in ([TS][TS]A[k][i])
inout([TS][TS]A[i][i])
ssyrk (A[k][i], A[i][i], TS);
}
}
}
OmpSs

7
OmpSs
Other features and characteristics
– Contiguous / strided dependence detection and data management
– Nesting/recursion
– Multiple implementations for a given task
– Integrated with powerful performance tools (Extrae+Paraver)
Continuous development and use
– Since 2004
– On large applications and systems
Pushing ideas into the OpenMP standard
– Developer Positioning: efforts in OmpSs will not be lost.
– Dependences  v 4.0
18
Used in projects and applications …
Undertaken significant efforts to port real large scale
applications:
–
• Scalapack, PLASMA, SPECFEM3D, LBC, CPMD PSC, PEPC,
LS1 Mardyn, Asynchronous algorithms, Microbenchmarks
–
• YALES2, EUTERPE, SPECFEM3D, MP2C, BigDFT,
QuantumESPRESSO, PEPC, SMMP, ProFASI, COSMO, BQCD
– DEEP
• NEURON, iPIC3D, ECHAM/MESSy, AVBP, TurboRVB, Seismic
– G8_ECS
• CGPOP, NICAM (planed) …
– Consolider project (Spanish ministry)
• MRGENESIS
– BSC initiatives and collaborations:
• GROMACS, GADGET, WRF,…
19
… but NOT only for «scientific computing» …
Plagiarism detection
– Histograms, sorting, … (FhI FIRST)
Trace browsing
– Paraver (BSC)
Clustering algorithms
– G-means (BSC)
Image processing
– Tracking (USAF)
Embedded and consumer
– H.264 (TUBerlin), …

8
The potential of asynchrony
Performance …
– Cholesky 8K x 8K @ SandyBridge (E5-2670, 2.6 GHz)
– Intel Math Kernel Library (MKL) parallel vs. OmpSs+ MKL sequential
• Nested OmpSs (large block size, small block size)
 Reproducibility / less sensitivity to environment options
Strong scaling
Impact of problem size
for fixed core count

9
• Detected issues (ongoing work)
– scheduling policies (generic of nested OmpSs),
– contention on task queues (specific of high #cores),…
• Still fairly good performance!!!
Xeon Phi (I)

10
Hybrid MPI/SMPSs
Linpack example
Overlap communication/computation
Extend asynchronous data-flow execution to
outer level
Automatic lookahead
…
for (k=0; k<N; k++) {
if (mine) {
Factor_panel(A[k]);
send (A[k])
} else {
receive (A[k]);
if (necessary) resend (A[k]);
}
for (j=k+1; j<N; j++)
update (A[k], A[j]);
…
#pragma css task inout(A[SIZE])
void Factor_panel(float *A);
#pragma css task input(A[SIZE]) inout(B[SIZE])
void update(float *A, float *B);
#pragma css task input(A[SIZE])
void send(float *A);
#pragma css task output(A[SIZE])
void receive(float *A);
#pragma css task input(A[SIZE])
void resend(float *A);
P0 P1 P2
V. Marjanovic, et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010

11
Automatically achieved by the runtime
– Load balance within node
– Fine grain.
– Complementary to application level
load balance.
– Leverage OmpSs malleability
LeWI: Lend core When Idle
– User level Run time Library (DLB)
coordinating processes within node
– Fighting the Linux kernel: Explicit pinning of
threads and handoff scheduling
Overcome Amdahl’s law in hybrid
programming
– Enable partial node level parallelization
– hope for lazy programmers
… and us all
Dynamic Load Balancing: LeWI
Communication
Running task
Idle
4 MPI processes @ 4 cores node
“LeWI: A Runtime Balancing
Algorithm for Nested Parallelism”.
M.Garcia et al. ICPP09

12
Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)
Host
KNC
KNC
KNC
KNC
Host
KNC
KNC
KNC
KNC
MPI
MPI
MPI
void main(int argc, char* argv[]) {
//MPI stuff ….
for(int i=0; i<my_jobs; i++) {
int idx = i % workers;
#pragma omp task out(worker_image[idx])
worker(worker_image[idx]);
#pragma omp task inout(master_image) in(worker_image[idx]);
accum(master_image, worker_image[idx])
}
global_accum(global_pool, master_image); // MPI
}
Host
KNC
KNC
KNC
KNC
Host
KNC
KNC
KNC
KNC
MPI
OmpSs
OmpSs
//MPI stuff ….
// MPI stuff to send job to worker
//…
//….manually taking care of load balance
// …
// …. and collect results on worker_pool[idx]
accum(master_image, worker_image[idx])
}
global_accum(global_image, master_image); // MPI
}
#pragma omp task out(worker_image[idx])
worker(worker_image[idx]);
#pragma omp task inout(master_image) in(worker_image[idx]);
accum(global_image, worker_image[idx])
}
}

13
Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)
Productivity …
… and performance
… flexibility …

14
MPI + OmpSs @ MIC
– Hydro 8 MPI processes, 2 cores/process, 4 OmpSs threads/process
– Mapped to 64 hardware threads (16 cores)
4 OmpSs threads/Core
3 OmpSs threads@ Core 1
1 OmpSs threads@ Core 2

15
OmpSs: Enabler for exascale
 Can exploit very unstructured
parallelism
 Not just loop/data parallelism
 Easy to change structure
 Supports large amounts of
lookahead
 Not stalling for dependence satisfaction
 Allow for locality optimizations to
tolerate latency
 Overlap data transfers, prefetch
 Reuse
 Nicely hybridizes into MPI/StarSs
 Propagates to large scale the node level
dataflow characteristics
 Overlap communication and computation
 A chance against Amdahl’s law
 Support for heterogeneity
 Any # and combination of CPUs, GPUs
 Including autotuning
 Malleability: Decouple program from
resources
 Allowing dynamic resource allocation and
load balance
 Tolerate noise
15
Asynchrony: Data-flow
Locality
Simple / incremental
interface to programmers
Compatible with proprietary
lower level technologies

OmpSs – improving the scalability of OpenMP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OmpSs – improving the scalability of OpenMP

Similar to OmpSs – improving the scalability of OpenMP (20)

More from Intel IT Center

More from Intel IT Center (20)

Recently uploaded

Recently uploaded (20)

OmpSs – improving the scalability of OpenMP