www.bsc.esLeipzig, June 17th 2013Jesus LabartaJesus.labarta@bsc.esOmpSs – improving thescalability of OpenMP
2The parallel programming revolutionParallel programming in the past– Where to place data– What to run where– How to commu...
3Key concept– Sequential task based program on single address/name space+ directionality annotations– Happens to execute p...
4The StarSs “Granularities”StarSsOmpSs COMPSs@ SMP @ GPU @ ClusterAverage task Granularity:100 microseconds – 10 milliseco...
5OpenMPvoid Cholesky(int NT, float *A[NT][NT] ) {#pragma omp parallel#pragma omp singlefor (int k=0; k<NT; k++) {#pragma o...
6Extend OpenMP with a data-flow execution model to exploitunstructured parallelism (compared to fork-join)– in/out pragmas...
7OmpSsOther features and characteristics– Contiguous / strided dependence detection and data management– Nesting/recursion...
8The potential of asynchronyPerformance …– Cholesky 8K x 8K @ SandyBridge (E5-2670, 2.6 GHz)– Intel Math Kernel Library (M...
9• Detected issues (ongoing work)– scheduling policies (generic of nested OmpSs),– contention on task queues (specific of ...
10Hybrid MPI/SMPSsLinpack exampleOverlap communication/computationExtend asynchronous data-flow execution toouter levelAut...
11Automatically achieved by the runtime– Load balance within node– Fine grain.– Complementary to application levelload bal...
12Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)HostKNCKNCKNCKNCHostKNCKNCKNCKNCMPIMPIMPIvoid main(int argc, char* argv...
13Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)Productivity …… and performance… flexibility …
14MPI + OmpSs @ MIC– Hydro 8 MPI processes, 2 cores/process, 4 OmpSs threads/process– Mapped to 64 hardware threads (16 co...
15OmpSs: Enabler for exascale Can exploit very unstructuredparallelism Not just loop/data parallelism Easy to change st...
THANKS
Upcoming SlideShare
Loading in...5
×

OmpSs – improving the scalability of OpenMP

271

Published on

"Improving the Scalability of OpenMP" with Barcelona Supercomputing Center at the Intel Theater

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
271
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

OmpSs – improving the scalability of OpenMP

  1. 1. www.bsc.esLeipzig, June 17th 2013Jesus LabartaJesus.labarta@bsc.esOmpSs – improving thescalability of OpenMP
  2. 2. 2The parallel programming revolutionParallel programming in the past– Where to place data– What to run where– How to communicateParallel programming in the future– What do I need to compute– What data do I need to use– hints (not necessarily very precise) onpotential concurrency, locality,…Schedule @ programmers mindStaticSchedule @ systemDynamicComplexity  ! AwarenessVariability
  3. 3. 3Key concept– Sequential task based program on single address/name space+ directionality annotations– Happens to execute parallel: Automatic run time computation ofdependencies between tasksDifferentiation of StarSs– Dependences: Tasks instantiated but not ready. Order IS defined• Lookahead– Avoid stalling the main control flow when a computation depending on previoustasks is reached– Possibility to “see” the future searching for further potential concurrency– Locality aware– Homogenizing heterogeneityThe StarSs family of programming models
  4. 4. 4The StarSs “Granularities”StarSsOmpSs COMPSs@ SMP @ GPU @ ClusterAverage task Granularity:100 microseconds – 10 milliseconds 1second - 1 dayLanguage binding:C, C++, FORTRAN Java, PythonAddress space to compute dependences:Memory Files, Objects (SCM)Parallel Ensemble, workflow
  5. 5. 5OpenMPvoid Cholesky(int NT, float *A[NT][NT] ) {#pragma omp parallel#pragma omp singlefor (int k=0; k<NT; k++) {#pragma omp taskspotrf (A[k][k], TS);#pragma omp taskwaitfor (int i=k+1; i<NT; i++) {#pragma omp taskstrsm (A[k][k], A[k][i], TS);}#pragma omp taskwaitfor (int i=k+1; i<NT; i++) {for (j=k+1; j<i; j++)#pragma omp tasksgemm( A[k][i], A[k][j], A[j][i], TS);#pragma omp taskssyrk (A[k][i], A[i][i], TS);#pragma omp taskwait}}}
  6. 6. 6Extend OpenMP with a data-flow execution model to exploitunstructured parallelism (compared to fork-join)– in/out pragmas to enable dependence and locality– Inlined/outlinedvoid Cholesky(int NT, float *A[NT][NT] ) {for (int k=0; k<NT; k++) {#pragma omp task inout ([TS][TS]A[k][k])spotrf (A[k][k], TS) ;for (int i=k+1; i<NT; i++) {#pragma omp task in(([TS][TS]A[k][k])) inout ([TS][TS]A[k][i])strsm (A[k][k], A[k][i], TS);}for (int i=k+1; i<NT; i++) {for (j=k+1; j<i; j++) {#pragma omp task in([TS][TS]A[k][i]),in([TS][TS]A[k][j]) inout ([TS][TS]A[j][i])sgemm( A[k][i], A[k][j], A[j][i], TS);}#pragma omp task in ([TS][TS]A[k][i]) inout([TS][TS]A[i][i])ssyrk (A[k][i], A[i][i], TS);}}}OmpSs
  7. 7. 7OmpSsOther features and characteristics– Contiguous / strided dependence detection and data management– Nesting/recursion– Multiple implementations for a given task– Integrated with powerful performance tools (Extrae+Paraver)Continuous development and use– Since 2004– On large applications and systemsPushing ideas into the OpenMP standard– Developer Positioning: efforts in OmpSs will not be lost.– Dependences  v 4.018Used in projects and applications …Undertaken significant efforts to port real large scaleapplications:–• Scalapack, PLASMA, SPECFEM3D, LBC, CPMD PSC, PEPC,LS1 Mardyn, Asynchronous algorithms, Microbenchmarks–• YALES2, EUTERPE, SPECFEM3D, MP2C, BigDFT,QuantumESPRESSO, PEPC, SMMP, ProFASI, COSMO, BQCD– DEEP• NEURON, iPIC3D, ECHAM/MESSy, AVBP, TurboRVB, Seismic– G8_ECS• CGPOP, NICAM (planed) …– Consolider project (Spanish ministry)• MRGENESIS– BSC initiatives and collaborations:• GROMACS, GADGET, WRF,…19… but NOT only for «scientific computing» …Plagiarism detection– Histograms, sorting, … (FhI FIRST)Trace browsing– Paraver (BSC)Clustering algorithms– G-means (BSC)Image processing– Tracking (USAF)Embedded and consumer– H.264 (TUBerlin), …
  8. 8. 8The potential of asynchronyPerformance …– Cholesky 8K x 8K @ SandyBridge (E5-2670, 2.6 GHz)– Intel Math Kernel Library (MKL) parallel vs. OmpSs+ MKL sequential• Nested OmpSs (large block size, small block size) Reproducibility / less sensitivity to environment optionsStrong scalingImpact of problem sizefor fixed core count
  9. 9. 9• Detected issues (ongoing work)– scheduling policies (generic of nested OmpSs),– contention on task queues (specific of high #cores),…• Still fairly good performance!!!Xeon Phi (I)
  10. 10. 10Hybrid MPI/SMPSsLinpack exampleOverlap communication/computationExtend asynchronous data-flow execution toouter levelAutomatic lookahead…for (k=0; k<N; k++) {if (mine) {Factor_panel(A[k]);send (A[k])} else {receive (A[k]);if (necessary) resend (A[k]);}for (j=k+1; j<N; j++)update (A[k], A[j]);…#pragma css task inout(A[SIZE])void Factor_panel(float *A);#pragma css task input(A[SIZE]) inout(B[SIZE])void update(float *A, float *B);#pragma css task input(A[SIZE])void send(float *A);#pragma css task output(A[SIZE])void receive(float *A);#pragma css task input(A[SIZE])void resend(float *A);P0 P1 P2V. Marjanovic, et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010
  11. 11. 11Automatically achieved by the runtime– Load balance within node– Fine grain.– Complementary to application levelload balance.– Leverage OmpSs malleabilityLeWI: Lend core When Idle– User level Run time Library (DLB)coordinating processes within node– Fighting the Linux kernel: Explicit pinning ofthreads and handoff schedulingOvercome Amdahl’s law in hybridprogramming– Enable partial node level parallelization– hope for lazy programmers… and us allDynamic Load Balancing: LeWICommunicationRunning taskIdle4 MPI processes @ 4 cores node“LeWI: A Runtime BalancingAlgorithm for Nested Parallelism”.M.Garcia et al. ICPP09
  12. 12. 12Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)HostKNCKNCKNCKNCHostKNCKNCKNCKNCMPIMPIMPIvoid main(int argc, char* argv[]) {//MPI stuff ….for(int i=0; i<my_jobs; i++) {int idx = i % workers;#pragma omp task out(worker_image[idx])worker(worker_image[idx]);#pragma omp task inout(master_image) in(worker_image[idx]);accum(master_image, worker_image[idx])}#pragma omp taskwaitglobal_accum(global_pool, master_image); // MPI}HostKNCKNCKNCKNCHostKNCKNCKNCKNCMPIOmpSsOmpSsvoid main(int argc, char* argv[]) {//MPI stuff ….for(int i=0; i<my_jobs; i++) {int idx = i % workers;// MPI stuff to send job to worker//…//….manually taking care of load balance// …// …. and collect results on worker_pool[idx]accum(master_image, worker_image[idx])}global_accum(global_image, master_image); // MPI}void main(int argc, char* argv[]) {for(int i=0; i<my_jobs; i++) {int idx = i % workers;#pragma omp task out(worker_image[idx])worker(worker_image[idx]);#pragma omp task inout(master_image) in(worker_image[idx]);accum(global_image, worker_image[idx])}#pragma omp taskwait}
  13. 13. 13Offloading to Intel® Xeon Phi™ (SRMPI app @ DEEP)Productivity …… and performance… flexibility …
  14. 14. 14MPI + OmpSs @ MIC– Hydro 8 MPI processes, 2 cores/process, 4 OmpSs threads/process– Mapped to 64 hardware threads (16 cores)4 OmpSs threads/Core2 OmpSs threads/Core3 OmpSs threads@ Core 12 OmpSs threads/Core1 OmpSs threads@ Core 2
  15. 15. 15OmpSs: Enabler for exascale Can exploit very unstructuredparallelism Not just loop/data parallelism Easy to change structure Supports large amounts oflookahead Not stalling for dependence satisfaction Allow for locality optimizations totolerate latency Overlap data transfers, prefetch Reuse Nicely hybridizes into MPI/StarSs Propagates to large scale the node leveldataflow characteristics Overlap communication and computation A chance against Amdahl’s law Support for heterogeneity Any # and combination of CPUs, GPUs Including autotuning Malleability: Decouple program fromresources Allowing dynamic resource allocation andload balance Tolerate noise15Asynchrony: Data-flowLocalitySimple / incrementalinterface to programmersCompatible with proprietarylower level technologies
  16. 16. THANKS

×