Designing Architecture-aware Library using Boost.Proto

1,548 views

Published on

HAMM and Meetng C++ Talk on our model of architecture description for C++ template metaprogramming.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,548
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Designing Architecture-aware Library using Boost.Proto

  1. 1. Designing Architecture-aware Library using Boost.Proto Joel Falcou, Mathias Gaunard Pierre Esterie, Eric Jourdanneau LRI - University Paris Sud XI - CNRS 13/03/2012
  2. 2. ContextIn Scientific Computing ... there is Scientific 2 of 34
  3. 3. ContextIn Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes 2 of 34
  4. 4. ContextIn Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing 2 of 34
  5. 5. ContextIn Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be available 2 of 34
  6. 6. ContextIn Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be availableThe ProblemPeople using computers to do science want to do science first. 2 of 34
  7. 7. The Problem and how to solve itThe Facts The ”Library to bind them all” doesn’t exist (or we should have it already ) New architectures performances are underused due to their inherent complexity Few people are both experts in parallel programming and SomeOtherFieldThe Ends Find a way to shield users from parallel architecture details Find a way to shield developpers from parallel architecture details Isolate software from hardware evolutionThe Means Generic Programming Template Meta-Programming Embedded Domain Specific Languages3 of 34
  8. 8. Talk LayoutIntroductionNT2Expression Templates v2.0Supporting GPUsConclusion 4 of 34
  9. 9. What’s NT2 ?A Scientific Computing Library Provide a simple, Matlab-like interface for users Provide high-performance computing entities and primitives Easily extendableA Research Platform Simple framework to add new optimization schemes Test bench for EDSL development methodologies Test bench for Generic Programming in real life projects5 of 34
  10. 10. The NT2 APIPrinciples table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in Matlab 6 of 34
  11. 11. The NT2 APIPrinciples table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m file, copy to a .cpp file 6 of 34
  12. 12. The NT2 APIPrinciples table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes 6 of 34
  13. 13. The NT2 APIPrinciples table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in MatlabHow does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes Compile the file and link with libnt2.a 6 of 34
  14. 14. Matlab you said ?1 R = I (: ,: ,1) ;2 G = I (: ,: ,2) ;3 B = I (: ,: ,3) ;45 Y = min ( abs (0.299.* R +0.587.* G +0.114.* B ) ,235) ;6 U = min ( abs ( -0.169.* R -0.331.* G +0.5.* B ) ,240) ;7 V = min ( abs (0.5.* R -0.419.* G -0.081.* B ) ,240) ; 7 of 34
  15. 15. Now with NT21 table < double > R = I (_ ,_ ,1) ;2 table < double > G = I (_ ,_ ,2) ;3 table < double > B = I (_ ,_ ,3) ;4 table < double > Y, U, V;56 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ; 8 of 34
  16. 16. Now with NT21 table < float , settings ( shallow_ , of_size_ <N ,M >) > R = I (_ ,_ ,1) ;2 table < float , settings ( shallow_ , of_size_ <N ,M >) > G = I (_ ,_ ,2) ;3 table < float , settings ( shallow_ , of_size_ <N ,M >) > B = I (_ ,_ ,3) ;4 table < float , of_size_ <N ,M > > Y , U , V ;56 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ; 8 of 34
  17. 17. Some real life applications9 of 34
  18. 18. NT2 before Boost.ProtoEDSL Core Code base was 11 KLOC 8KLOC was dedicated to the various Expression Template glue 3KLOC of actual smart stuffArchitectural support Altivec and SSE2 extension support Some vague pthread support Efforts were stagnant: adding a simple feature required multipel KLOC of code to change10 of 34
  19. 19. Talk LayoutIntroductionNT2Expression Templates v2.0Supporting GPUsConclusion11 of 34
  20. 20. Embedded Domain Specific LanguagesWhat’s an EDSL ? DSL = Domain Specific Language Declarative language, easy-to-use, fitting the domain EDSL = DSL within a general purpose languageEDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizations12 of 34
  21. 21. Expression Templates matrix x(h,w),a(h,w),b(h,w); = x = cos(a) + (b*a); x + expr<assign ,expr<matrix&> ,expr<plus cos * , expr<cos ,expr<matrix&> > a b a , expr<multiplies ,expr<matrix&> ,expr<matrix&> > #pragma omp parallel for >(x,a,b); for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { Arbitrary Transforms applied x(j,i) = cos(a(j,i)) on the meta-AST + ( b(j,i) * a(j,i) ); } }13 of 34
  22. 22. Boost.ProtoWhat’s Boost.Proto EDSL for defining EDSLs in C++ Generalize Expression Templates Easy way to define and test EDSL EDSL = some Grammar Rules + some SemanticBoost.Proto Benefits Fast development process Compiler guys are at ease : Grammar + Semantic + Code Generation process Easily extendable through Transforms Easy to handle : MPL and Fusion compatible14 of 34
  23. 23. Why Boost.Proto ?Benefits Scalability: Keep EDSL glue code size low Extensibility: Simplify new optimizations design Debugging: Allow for meaningful error reportingRemaining Challenges Provide a generic ”compilation” process Select the best function definition w/r to hardware Allow for non-intrusive new pass definition15 of 34
  24. 24. State of the art EDSLUsual approaches and results Eigen : supports SIMD arithmetic, thread only algorithms uBLAS : parallelism comes from potential back-endWhat’s the problem ? Hardware consideration comes after ET code Retro-fitting parallelism is hard Need of a hardware-aware EDSL Back to Generative Programming16 of 34
  25. 25. Principles of Generative Programming Domain Specific Generative Component Concrete Application Application Description Translator Parametric Sub-components17 of 34
  26. 26. Generative Programming v218 of 34
  27. 27. Our ApproachPrinciples Segment EDSL evaluation in phases Each phases use Proto transforms to advance code generation Hardware specification are active element in function overload Use Generalized Tag Dispatching (GTD) to select proper function implementationAdvantages Hardware can influence part or whole evaluation process Proto transforms syntax are easy to use GTD increases opportunity for optimizations19 of 34
  28. 28. Generalized Tag DispatchingPrinciples Tag dispatching use types tag to overload functions Classical example : standard Iterators20 of 34
  29. 29. Generalized Tag DispatchingPrinciples Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration20 of 34
  30. 30. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration1 namespace tag { struct plus_ : elementwise_ < plus_ > {}; }23 functor < tag :: plus_ > callee ;4 callee ( some_a , some_b ) ; 20 of 34
  31. 31. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration1 auto functor < Tag , Site >:: operator () ( A && args ...)2 {3 dispatch < hierarchy_of < Site >:: type4 , hierarchy_of < Tag >:: type ( hierarchy_of <A >:: type ...)5 > callee ;6 return callee ( args ... ) ;7 } 20 of 34
  32. 32. Generalized Tag Dispatching 1 template < class AO > 2 strutc dispatch < tag :: sse_ 3 , tag :: plus_ ( simd_ < real_ < float_ < A0 > > > 4 , simd_ < real_ < float_ < A0 > > > 5 ) 6 > 7 { 8 A0 operator ( A0 && a , A0 && b ) 9 {10 return _mm_add_ps ( a , b ) ;11 }12 }; 21 of 34
  33. 33. Generalized Tag Dispatching 1 template < class S , class R , class L , class N , class AO > 2 strutc dispatch < openmp_ <S > , reduction_ <R ,L ,N >( ast_ < A0 >) > 3 { 4 A0 :: value_type operator ( A0 && ast ) 5 { 6 auto result = functor <N ,S >( as_ < A0 :: value_type >() ) ; 7 # pragma omp parallel 8 { 9 auto local = functor <N ,S >( as_ < A0 :: value_type >() ) ;10 # pragma omp for nowait11 for ( int i =0; i < size ( ast ) ;++ i )12 functor <R ,S >() ( local , ast ( i ) ) ;1314 # pragma omp critial15 functor <L ,S >() ( result , local ) ;16 }1718 return result ;19 }20 }; 21 of 34
  34. 34. Multipass EDSL Code GenerationPrinciples Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementation22 of 34
  35. 35. Multipass EDSL Code GenerationPrinciples Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementationStructure Optimize : AST to AST transforms22 of 34
  36. 36. Multipass EDSL Code GenerationPrinciples Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementationStructure Optimize : AST to AST transforms Schedule : AST to Forest of Code Generator transforms22 of 34
  37. 37. Multipass EDSL Code GenerationPrinciples Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementationStructure Optimize : AST to AST transforms Schedule : AST to Forest of Code Generator transforms Run : Code Generator to Code transforms22 of 34
  38. 38. Multipass EDSL Code GenerationOptimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)23 of 34
  39. 39. Multipass EDSL Code GenerationOptimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)Schedule Segment AST into loop-compatible loop functors Use AST node hierarchy (and GTD) for splitting23 of 34
  40. 40. Multipass EDSL Code GenerationOptimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)Schedule Segment AST into loop-compatible loop functors Use AST node hierarchy (and GTD) for splittingRun Generate code for each AST int he scheduled forest Potentially generate runtime calls for runtime optimization Classical EDSL code generation happen here on the correct hardware23 of 34
  41. 41. The new NT2 StructureState of the code base Expression Template handling 200-300LOC GTD handling : 1KLOC Actual features and function implementation : 10KLOCEvolution process Adding a new site 100-500LOC Adding a function 10-100 LOC Time for complete new architecture support : 1 to 3 weeks24 of 34
  42. 42. The new NT2 Structure24 of 34
  43. 43. Some resultsSupport for OpenMP 1 build system script 4 new overloads for the run phase 1 new allocator to prevent false sharingSupport for Next Gen SIMD AVX : 2 weeks of work, functionnal AVX2 : prototype ready, work in simulator LarBn : prototype ready, work in simulator25 of 34
  44. 44. Talk LayoutIntroductionNT2Expression Templates v2.0Supporting GPUsConclusion26 of 34
  45. 45. OpenCL in action Principle Standard for manycore/multicore systems Use runtime kernel and JIT Compilation Example 1 const char * OpenCLSource [] = { 2 " __kernel void VectorAdd ( __global int * c , __global int * a , __global int * b , unsigned int nb ) " , 3 "{", 4 " // Index of the elements to add n " , 5 " unsigned int n = get_global_id (0) ; " , 6 " // Sum the n ’ th element of vectors a and b and store in c n " , 7 " c [ n ]=0; " , 8 " for ( int i =0; i < nb ; i ++) c [ n ] += ( a [ n ]+ b [ n ]) ; " , 9 "}"10 }; 27 of 34
  46. 46. From Proto to OpenCLThe Schedule Pass Split AST as usual on function hierarchy Replace each AST by a functor generating the equivalent OpenCL Compute a static hash of the tree to remember already compiled expressions28 of 34
  47. 47. From Proto to OpenCLThe Schedule Pass Split AST as usual on function hierarchy Replace each AST by a functor generating the equivalent OpenCL Compute a static hash of the tree to remember already compiled expressionsThe Run pass Stream data at beginning of each kernel Handle generic calls and retrieval of compiled kernel Fire asynchronous kernel calls over the forest28 of 34
  48. 48. Support for openCL NT2 Example 1 int main () 2 { 3 int nbpoints =512000; 4 table < float , settings ( _1D ) > xx ( ofSize ( nbpoints ) ) ; 5 table < float , settings ( _1D ) > yy ( ofSize ( nbpoints ) ) ; 6 7 for ( int i =1; i <= nbpoints ; i ++) 8 { 9 xx ( i ) = - nbpoints /2+ i ;10 yy ( i ) = 0.0;11 }12 yy =100* sin (8*3.14* xx /512000) ;1314 std :: cout < < xx (1) <<" " << yy (1) ;15 } 29 of 34
  49. 49. Support for openCL NT2 Example 1 int main () 2 { 3 int nbpoints =512000; 4 table < float , settings ( _1D , nt2 :: gpu ) > xx ( ofSize ( nbpoints ) ) ; 5 table < float , settings ( _1D , nt2 :: gpu ) > yy ( ofSize ( nbpoints ) ) ; 6 7 for ( int i =1; i <= nbpoints ; i ++) 8 { 9 xx ( i ) = - nbpoints /2+ i ;10 yy ( i ) = 0.0;11 }12 yy =100* sin (8*3.14* xx /512000) ;1314 std :: cout < < xx (1) <<" " << yy (1) ;15 } 29 of 34
  50. 50. Preliminary results for openCL Generating some signals 1 for ( int i =1; i <= nbpoints ; i ++) 2 { 3 xx ( i ) = - nbpoints /2+ i ; 4 yy ( i ) = 0.0; 5 } 6 7 yy =100* sin (8*3.14* xx /512000) ; 8 yy = yy +0.0001* xx ; 910 for ( int ii =1; ii <512000; ii +=15000)11 {12 addGaussian ( xx , yy ,( float ) ii ,1000) ;13 yy =1 - yy ;14 }1516 yy =1 - yy ;17 yy = yy * yy ;18 yy = yy * cos ( xx ) ; 30 of 34
  51. 51. Preliminary results for openCL Generating some signals1 template < typename A >2 void addGaussian ( A & xx , A & yy , float centre , float largeur )3 {4 yy = yy + ( 100000/( val ( largeur ) * sqrt (2*3.1415) ) )5 * exp ( -( xx - val ( centre ) ) *( xx - val ( centre ) )6 /(2* val ( largeur ) * val ( largeur )7 )8 );9 } 30 of 34
  52. 52. Preliminary results for openCLGenerating some signals CPU 1 core 520.2 ms CPU 4 cores 141.3 ms CPU 4 cores+SIMD 36.1 ms GPU First Run 878.2 ms GPU Other Run 13.1 ms GPU Manual OCL 12.9 ms Compares favorably to ViennaCL and Accelereyes30 of 34
  53. 53. Talk LayoutIntroductionNT2Expression Templates v2.0Supporting GPUsConclusion31 of 34
  54. 54. Let’s round this up!Parallel Computing for Scientist Parallel programmign tools are required EDSL are an efficient way to design such tools But hardware has to stay in the loop32 of 34
  55. 55. Let’s round this up!Parallel Computing for Scientist Parallel programmign tools are required EDSL are an efficient way to design such tools But hardware has to stay in the loopOur claims Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability32 of 34
  56. 56. Current and Future WorksRecent activity A public release of NT2 is planned ”soon” A Matlab to NT2 compiler has been designed Prototype for single source CUDA support in NT2 PhD started on MPI support in NT233 of 34
  57. 57. Current and Future WorksRecent activity A public release of NT2 is planned ”soon” A Matlab to NT2 compiler has been designed Prototype for single source CUDA support in NT2 PhD started on MPI support in NT2What we’ll be cooking in the future std::future is the future Experimental research on NT2 for embedded systems Toward a global generic approach to parallelism33 of 34
  58. 58. Thanks for your attention

×