Costless Software Abstractions For Parallel Hardware System

703 views

Published on

Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen2. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of various parallel computing libraries in such a way that: - abstraction and expressiveness are maximized - cost over efficiency is minimized As a conclusion, we'll skim over various applications and see how they can benefit from such tools.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
703
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Costless Software Abstractions For Parallel Hardware System

  1. 1. Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation 04/03/2014
  2. 2. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be available The Problem People using computers to do science want to do science first. 2 of 34
  3. 3. The Problem – and how we want to solve it The Facts The ”Library to bind them all” doesn’t exist (or we should have it already ) All those users want to take advantage of new architectures Few of them want to actually handle all the dirty work The Ends Provide a ”familiar” interface that let users benefit from parallelism Helps compilers to generate better parallel code Increase sustainability by decreasing amount of code to write The Means Parallel Abstractions: Skeletons Efficient Implementation: DSEL The Holy Glue: Generative Programming 3 of 34
  4. 4. Expressivité Efficient or Expressive – Choose one Matlab SciLab C++ JAVA C, FORTRAN Efficacité 4 of 34
  5. 5. Talk Layout Introduction Abstractions Efficiency Tools Conclusion 5 of 34
  6. 6. Talk Layout Introduction Abstractions Efficiency Tools Conclusion 6 of 34
  7. 7. Spotting abstraction when you see one Processus 3 F2 Processus 1 Processus 2 Processus 4 Processus 6 Processus 7 F1 Distrib. F2 Collect. F3 Processus 5 F2 7 of 34
  8. 8. Spotting abstraction when you see one Pipeline Farm Processus 3 F2 Processus 1 Processus 2 Processus 4 Processus 6 Processus 7 F1 Distrib. F2 Collect. F3 Processus 5 F2 7 of 34
  9. 9. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns 8 of 34
  10. 10. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functionnal point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 8 of 34
  11. 11. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prefix reduction over subset of data 9 of 34
  12. 12. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prefix reduction over subset of data Task Parallel Skeletons par: Independant task execution pipe: Task dependency over time farm: Load-balancing 9 of 34
  13. 13. Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certifiable 10 of 34
  14. 14. Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certifiable Hardware Abstraction Semantic is set, implementation is free Composability ⇒ Hierarchical architecture 10 of 34
  15. 15. Talk Layout Introduction Abstractions Efficiency Tools Conclusion 11 of 34
  16. 16. Generative Programming Domain Specific Application Description Generative Component Translator Parametric Sub-components 12 of 34 Concrete Application
  17. 17. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 13 of 34
  18. 18. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 13 of 34
  19. 19. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming Definition of Meta-programming Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. 13 of 34
  20. 20. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ 14 of 34
  21. 21. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ 14 of 34
  22. 22. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ C++ meta-programming Relies on the C++ template sub-language Handles types and integral constants at compile-time Proved to be Turing-complete 14 of 34
  23. 23. Domain Specific Embedded Languages What’s an DSEL ? DSL = Domain Specific Language Declarative language, easy-to-use, fitting the domain DSEL = DSL within a general purpose language EDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizations Exploiting static AST At the expression level: code generation At the function level: inter-procedural optimization 15 of 34
  24. 24. Expression Templates = matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); + x expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b); Arbitrary Transforms applied on the meta-AST 16 of 34 * cos a b #pragma omp parallel for for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } a
  25. 25. Embedded Domain Specific Languages EDSL in C++ Relies on operator overload abuse Carry semantic information around code fragment Generic implementation become self-aware of optimizations Advantages Allow introduction of DSLs without disrupting dev. chain Semantic defined as type informations means compile-time resolution Access to a large selection of runtime binding 17 of 34
  26. 26. Embedded Domain Specific Languages 18 of 34
  27. 27. Talk Layout Introduction Abstractions Efficiency Tools Conclusion 19 of 34
  28. 28. Different Strokes Objectives Apply DSEL generation techniques for different kind of hardware Demonstrate low cost of abstractions Demonstrate applicability of skeletons Contributions Extend DEMRAL into AA-DEMRAL Boost.SIMD NT2 20 of 34
  29. 29. Architecture Aware Generative Programming 21 of 34
  30. 30. Boost.SIMD Generic Programming for portable SIMDization SIMD computation C++ Library Built with modern C++ style for modern C++ usage Easy to extend Easy to adapt to new CPUs 22 of 34
  31. 31. Boost.SIMD Generic Programming for portable SIMDization SIMD computation C++ Library Built with modern C++ style for modern C++ usage Easy to extend Easy to adapt to new CPUs Goals: Integrate SIMD computations in standard C++ Make writing of SIMD code easier Promotes Generic Programming Provides performance out of the box 22 of 34
  32. 32. Boost.SIMD Principles pack<T,N> encapsulates the best hardware register for N elements of type T pack<T,N> provides a classical value semantic Operations onpack<T,N> maps to proper intrinsics Support for SIMD standard algortihms 23 of 34
  33. 33. Boost.SIMD Principles pack<T,N> encapsulates the best hardware register for N elements of type T pack<T,N> provides a classical value semantic Operations onpack<T,N> maps to proper intrinsics Support for SIMD standard algortihms How does it works Code is written as for scalar values Code can be applied as polymorphic functor over data Expression Templates optimize operations 23 of 34
  34. 34. Boost.SIMD Current Support OS : Linux, Windows, iOS, Android Architecture: x86, ARM, PowerPC Compilers: g++, clang, icc, MSVC Extensions: x86: SSE2,SSE3,SSSE3,SSE4.x,AVX,XOP,FMA3/4,AVX2 PPC: VMX ARM: NEON In progress: Xeon Phi MIC VSX, QPX NEON2 24 of 34
  35. 35. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  36. 36. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  37. 37. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  38. 38. NT2 A Scientific Computing Library Provide a simple, Matlab-like interface for users Provide high-performance computing entities and primitives Easily extendable Components Use Boost.SIMD for in-core optimizations Use recursive parallel skeletons for threading Code is made independant of architecture and runtime 26 of 34
  39. 39. Expressivité The Numerical Template Toolbox NT2 Matlab SciLab C++ JAVA C, FORTRAN Efficacité 27 of 34
  40. 40. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab 28 of 34
  41. 41. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file 28 of 34
  42. 42. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes 28 of 34
  43. 43. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes Compile the file and link with libnt2.a 28 of 34
  44. 44. Performances - Mandelbrodt 29 of 34
  45. 45. Performances - LU Decomposition 8000 × 8000 LU decomposition Median GFLOPS 150 100 50 NT2 Plasma 0 0 10 20 30 Number of cores 30 of 34 40 50
  46. 46. Performances - linsolve nt2::tie(x,r) = linsolve(A,b); Scale 1024 × 1024 2048 × 2048 12000 × 1200 31 of 34 C LAPACK 85.2 350.7 735.7 C MAGMA 85.2 235.8 1299.0 NT2 LAPACK 83.1 348.2 734.1 NT2 MAGMA 85.1 236.0 1300.1
  47. 47. Talk Layout Introduction Abstractions Efficiency Tools Conclusion 32 of 34
  48. 48. Let’s round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 33 of 34
  49. 49. Let’s round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability Recent activity Follow us on http://www.github.com/MetaScale/nt2 Prototype for single source GPU support Toward a global generic approach to parallelism 33 of 34
  50. 50. Thanks for your attention

×