SlideShare a Scribd company logo
1 of 71
Download to read offline
(Costless) Software Abstractions for Parallel Architectures 
numscale 
Unlocked software performance 
Joel Falcou 
NumScale SAS – LRI - INRIA 
CppCon – 01/07/2014
numscale 
Unlocked software performance 
Context 
Decades of hardware improvements 
 Scientic Computing now drives most hardware innovations 
 Current Solution: Parallel architectures 
 Machines become more and more complex 
2 of 40
numscale 
Unlocked software performance 
Context 
Decades of hardware improvements 
 Scientic Computing now drives most hardware innovations 
 Current Solution: Parallel architectures 
 Machines become more and more complex 
Example : A simple laptop 
 CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX 
 4 logical cores 
 SIMD Extensions: SSE2-SSE4.2, AVX 
 GPU: NVIDIA GeForce GT 520M (48 CUDA cores) 
2 of 40
numscale 
Unlocked software performance 
The Real Challenge of HPC 
Single Core Era 
Performance 
Expressiveness 
C/Fort. 
C++ 
Java 
Multi-Core/SIMD Era 
Performance 
Sequential 
Expressiveness 
SIMD 
Threads 
Heterogenous Era 
Performance 
Sequential 
Expressiveness 
GPU 
Phi 
SIMD 
Threads 
Distributed 
3 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
 Design tools as C++ libraries (1) 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
 Design tools as C++ libraries (1) 
 Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
 Design tools as C++ libraries (1) 
 Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) 
 Use Parallel Skeletons as parallel components (4) 
4 of 40
numscale 
Unlocked software performance 
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
 Design tools as C++ libraries (1) 
 Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) 
 Use Parallel Skeletons as parallel components (4) 
 Use Generative Programming to deliver performance (5) 
4 of 40
numscale 
Unlocked software performance 
Parallel Programming Ain’t Easy 
5 of 40
numscale 
Unlocked software performance 
Spotting abstraction when you see one 
Why Parallel Programming Models ? 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
6 of 40
numscale 
Unlocked software performance 
Spotting abstraction when you see one 
Why Parallel Programming Models ? 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
Available Models 
 Performance centric: P-RAM, LOG-P, BSP 
 Data centric: HTA, PGAS 
 Pattern centric: Actors, Skeletons 
6 of 40
numscale 
Unlocked software performance 
Spotting abstraction when you see one 
Why Parallel Programming Models ? 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
Available Models 
 Performance centric: P-RAM, LOG-P, BSP 
 Data centric: HTA, PGAS 
 Pattern centric: Actors, Skeletons 
6 of 40
numscale 
Unlocked software performance 
Parallel Skeletons in a nutshell 
Basic Principles [COLE 89] 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as combination of such patterns 
7 of 40
numscale 
Unlocked software performance 
Parallel Skeletons in a nutshell 
Basic Principles [COLE 89] 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as combination of such patterns 
Functionnal point of view 
 Skeletons are Higher-Order Functions 
 Skeletons support a compositionnal semantic 
 Applications become composition of state-less functions 
7 of 40
numscale 
Unlocked software performance 
Classic Parallel Skeletons 
Data Parallel Skeletons 
 map: Apply a n-ary function in SIMD mode over subset of data 
 fold: Perform n-ary reduction over subset of data 
 scan: Perform n-ary prex reduction over subset of data 
8 of 40
numscale 
Unlocked software performance 
Classic Parallel Skeletons 
Data Parallel Skeletons 
 map: Apply a n-ary function in SIMD mode over subset of data 
 fold: Perform n-ary reduction over subset of data 
 scan: Perform n-ary prex reduction over subset of data 
Task Parallel Skeletons 
 par: Independant task execution 
 pipe: Task dependency over time 
 farm: Load-balancing 
8 of 40
numscale 
Unlocked software performance 
Why using Parallel Skeletons 
Software Abstraction 
 Write without bothering with parallel details 
 Code is scalable and easy to maintain 
 Debuggable, Provable, Certiable 
9 of 40
numscale 
Unlocked software performance 
Why using Parallel Skeletons 
Software Abstraction 
 Write without bothering with parallel details 
 Code is scalable and easy to maintain 
 Debuggable, Provable, Certiable 
Hardware Abstraction 
 Semantic is set, implementation is free 
 Composability )Hierarchical architecture 
9 of 40
numscale 
Unlocked software performance 
Generative Programming 
Domain Specific 
Application Description 
Generative Component Concrete Application 
Translator 
Parametric 
Sub-components 
10 of 40
numscale 
Unlocked software performance 
Generative Programming as a Tool 
Available techniques 
 Dedicated compilers 
 External pre-processing tools 
 Languages supporting meta-programming 
11 of 40
numscale 
Unlocked software performance 
Generative Programming as a Tool 
Available techniques 
 Dedicated compilers 
 External pre-processing tools 
 Languages supporting meta-programming 
11 of 40
numscale 
Unlocked software performance 
Generative Programming as a Tool 
Available techniques 
 Dedicated compilers 
 External pre-processing tools 
 Languages supporting meta-programming 
Denition of Meta-programming 
Meta-programming is the writing of computer programs that analyse, transform and 
generate other programs (or themselves) as their data. 
11 of 40
numscale 
Unlocked software performance 
Generative Programming as a Tool 
Available techniques 
 Dedicated compilers 
 External pre-processing tools 
 Languages supporting meta-programming 
Denition of Meta-programming 
Meta-programming is the writing of computer programs that analyse, transform and 
generate other programs (or themselves) as their data. 
C++ meta-programming 
 Relies on the C++  sub-language 
 Handles types and integral constants at compile-time 
 Proved to be Turing-complete 
11 of 40
numscale 
Unlocked software performance 
Domain Specic Embedded Languages 
What’s an DSEL ? 
 DSL = Domain Specic Language 
 Declarative language, easy-to-use, tting the domain 
 DSEL = DSL within a general purpose language 
EDSL in C++ 
 Relies on operator overload abuse (Expression Templates) 
 Carry semantic information around code fragment 
 Generic implementation become self-aware of optimizations 
Exploiting static AST 
 At the expression level: code generation 
 At the function level: inter-procedural optimization 
12 of 40
numscale 
Unlocked software performance 
Embedded Domain Specic Languages 
EDSL in C++ 
 Relies on operator overload abuse – see B.P 
 Carry semantic information around code fragment 
 Generic implementation become self-aware of optimizations 
Advantages 
 Allow introduction of DSLs without disrupting dev. chain 
 Semantic dened as type informations means compile-time resolution 
 Access to a large selection of runtime binding 
13 of 40
numscale 
Unlocked software performance 
Expression Templates in A Nutshell 
14 of 40
numscale 
Unlocked software performance 
Expression Templates 
matrix x(h,w),a(h,w),b(h,w); 
x = cos(a) + (b*a); 
exprassign 
,exprmatrix 
,exprplus 
, exprcos 
,exprmatrix 
 
, exprmultiplies 
,exprmatrix 
,exprmatrix 
 
(x,a,b); 
+ 
= 
cos * 
a b a 
x 
#pragma omp parallel for 
for(int j=0;jh;++j) 
{ 
for(int i=0;iw;++i) 
{ 
x(j,i) = cos(a(j,i)) 
+ ( b(j,i) 
* a(j,i) 
); 
} 
} 
Arbitrary Transforms applied 
on the meta-AST 
15 of 40
numscale 
Unlocked software performance 
Architecture Aware Generative Programming 
16 of 40
numscale 
Unlocked software performance 
Parallel DSEL in practice 
Objectives 
 Apply DSEL generation techniques for different kind of hardware 
 Demonstrate low cost of abstractions 
 Demonstrate applicability of skeletons 
17 of 40
numscale 
Unlocked software performance 
Parallel DSEL in practice 
Objectives 
 Apply DSEL generation techniques for different kind of hardware 
 Demonstrate low cost of abstractions 
 Demonstrate applicability of skeletons 
Our contribution 
 BSP++ : Generic C++ BSP for shared/distributed memory 
 Quaff: DSEL for skeleton programming 
 B.SIMD: DSEL for portable SIMD programming 
 NT2: M like DSEL for scientic computing 
17 of 40
numscale 
Unlocked software performance 
Parallel DSEL in practice 
Objectives 
 Apply DSEL generation techniques for different kind of hardware 
 Demonstrate low cost of abstractions 
 Demonstrate applicability of skeletons 
Our contribution 
 BSP++ : Generic C++ BSP for shared/distributed memory 
 Quaff: DSEL for skeleton programming 
 B.SIMD: DSEL for portable SIMD programming 
 NT2: M like DSEL for scientic computing 
17 of 40
numscale 
Unlocked software performance 
NT2 
A Scientic Computing Library 
 Provide a simple, M-like interface for users 
 Provide high-performance computing entities and primitives 
 Easily extendable 
Components 
 Use Boost.SIMD for in-core optimizations 
 Use recursive parallel skeletons 
 Code is made independant of architecture and runtime 
18 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Comparison to other libraries 
Feature Armadillo Blaze Eigen MTL uBlas NT2 
M-like API X     X 
BLAS/LAPACK binding X X X X X X 
MAGMA binding      X 
SSE2+ support X X X   X 
AVX support X X    X 
AVX2 support      X 
Xeon Phi support      X 
Altivec support   X   X 
ARM support   X   X 
Threading support      X 
CUDA support      X 
19 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly mimics 
M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values as in M 
20 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly mimics 
M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
20 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly mimics 
M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
20 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly mimics 
M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
 Compile the le and link with libnt2.a 
20 of 40
numscale 
Unlocked software performance 
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly mimics 
M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
 Compile the le and link with libnt2.a 
 ?????? 
 PROFIT! 
20 of 40
numscale 
Unlocked software performance 
NT2 - From M ... 
A1 = 1 : 1 0 0 0 ; 
A2 = A1 + randn ( size ( A1 ) ) ; 
X = lu ( A1 * A1 ’) ; 
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; 
21 of 40
numscale 
Unlocked software performance 
NT2 - ... to C++ 
table  double  A1 = _ (1. ,1000.) ; 
table  double  A2 = A1 + randn ( size ( A1 ) ) ; 
table  double  X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; 
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 
22 of 40
numscale 
Unlocked software performance 
Parallel Skeletons extraction process 
A = B / sum(C+D); 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
23 of 40
numscale 
Unlocked software performance 
Parallel Skeletons extraction process 
A = B / sum(C+D); 
; ; 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
= 
tmp sum 
+ 
C D 
fold 
) 
= 
A = 
B tmp 
transform 
24 of 40
numscale 
Unlocked software performance 
From data to task parallelism 
Limits of the fork-join model 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism 
 Poor data locality and no inter-statement optimization 
25 of 40
numscale 
Unlocked software performance 
From data to task parallelism 
Limits of the fork-join model 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism 
 Poor data locality and no inter-statement optimization 
From Skeletons to Actors 
 Upgrade NT2 to enable task parallelism 
 Adapt current skeletons for taskication 
 Use Futures ( or HPX)to automatically create pipelines 
 Derive a dependency graph between statements 
25 of 40
numscale 
Unlocked software performance 
Parallel Skeletons extraction process - Take 2 
A = B / sum(C+D); 
; ; 
= 
tmp sum 
+ 
C D 
fold 
= 
A = 
B tmp 
transform 
26 of 40
numscale 
Unlocked software performance 
Parallel Skeletons extraction process - Take 2 
A = B / sum(C+D); 
fold 
= 
= 
tmp(3) sum(3) 
+ 
tmp(1) sum 
+ 
C(:; 3) D(:; 3) 
= 
spawnertransform,OpenMP 
= 
= 
A(:; 3) = 
A(:; 2) = 
A(:; 1) = 
B(:; 3) tmp(3) 
B(:; 2) tmp(2) 
transform 
C(:; 1) D(:; 1) 
fold 
= 
tmp(2) sum(2) 
+ 
C(:; 2) D(:; 2) 
B(:; 1) tmp(1) 
transform 
workerfold,simd 
workertransform,simd 
spawnertransform,OpenMP 
; 
27 of 40
numscale 
Unlocked software performance 
Sigma-Delta Motion Detection 
Context 
 Mono-modal algorithm based on background substraction 
 Use local gaussian model of lightness variation to detect motion 
 Target applications: robotic, video survey and analytics, defence 
 Challenge: Very low arithmetic density 
 Challenge: Integer-based implementation with small range 
28 of 40
numscale 
Unlocked software performance 
Motion Detection 
NT2 Code 
table  char  s i g m a _ d e l t a ( table  char  b a c k g r o u n d 
, table  char  const  frame 
, table  char  v a r i a n c e 
) 
{ 
// E s t i m a t e Raw M o v e m e n t 
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d  frame 
, s e l d e c ( b a c k g r o u n d  frame , b a c k g r o u n d ) 
) ; 
table  char  diff = dist ( background , frame ) ; 
// C o m p u t e Local V a r i a n c e 
table  char  sig3 = muls ( diff ,3) ; 
var = i f _ e l s e ( diff != 0 
, s e l i n c ( v a r i a n c e  sig3 
, s e l d e c ( var  sig3 , v a r i a n c e ) 
) 
, v a r i a n c e 
) ; 
// G e n e r a t e M o v e m e n t Label 
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff  v a r i a n c e ) ; 
} 
29 of 40
numscale 
Unlocked software performance 
Motion Detection 
Performance 
18 
16 
14 
12 
10 
8 
6 
4 
2 
0 
512x512 1024x1024 
cycles/element 
Image Size (N x N) 
x6.8 
x14.8 
x16.5 
x2.1 
x3.6 
x6.7 
x15.3 
x18 
x2.3 
x3.99 
x10.8 
x10.8 
SCALAR 
HALF CORE 
FULL CORE 
SIMD 
JRTIP2008 
SIMD + HALF CORE 
SIMD + FULL CORE 
30 of 40
numscale 
Unlocked software performance 
Black and Scholes Option Pricing 
NT2 Code 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
table  float  da = sqrt ( Ta ) ; 
table  float  d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; 
table  float  d2 = d1 - va * da ; 
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; 
} 
31 of 40
numscale 
Unlocked software performance 
Black and Scholes Option Pricing 
NT2 Code with loop fusion 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
// P r e a l l o c a t e t e m p o r a r y t a b l e s 
table  float  da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; 
// tie merge loop nest and i n c r e a s e cache l o c a l i t y 
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) 
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) 
, d1 - va * da 
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) 
) ; 
r e t u r n R ; 
} 
31 of 40
numscale 
Unlocked software performance 
Black and Scholes Option Pricing 
Performance 
1000000 
150 
100 
50 
0 
x1.89 
x2.91 
x5.58 
x6.30 
Size 
cycle/value 
scalar 
SSE2 
AVX2 
SSE2, 4 cores 
AVX2, 4 cores 
32 of 40
numscale 
Unlocked software performance 
Black and Scholes Option Pricing 
Performance with loop fusion 
1000000 
150 
100 
50 
0 
x2.27 
x4.13 
x8.05 
x11.12 
Size 
cycle/value 
scalar 
SSE2 
AVX2 
SSE2, 4 cores 
AVX2, 4 cores 
33 of 40
numscale 
Unlocked software performance 
LU Decomposition 
Algorithm 
A00 
A10 A01 A02 
A20 A11 
A21 
A12 
A22 A11 
A21 A12 
A22 
A22 
step 1 
step 2 
step 3 
step 4 
step 5 
step 6 
step 7 
DGETRF 
DGESSM 
DTSTRF 
DSSSSM 
34 of 40
numscale 
Unlocked software performance 
LU Decomposition 
Performance 
0 10 20 30 40 50 
100 
50 
0 
Number of cores 
Median GFLOPS 
8000  8000 LU decomposition 
NT2 
Intel MKL 
35 of 40
numscale 
Unlocked software performance 
What we learn 
Parallel Computing for Scientist 
 Software Libraries built as Generic and Generative components can solve a large 
chunk of parallelism related problems while being easy to use. 
 Like regular language, EDSL needs informations about the hardware system 
 Integrating hardware descriptions as Generic components increases tools portability 
and re-targetability 
36 of 40
numscale 
Unlocked software performance 
What we learn 
Parallel Computing for Scientist 
 Software Libraries built as Generic and Generative components can solve a large 
chunk of parallelism related problems while being easy to use. 
 Like regular language, EDSL needs informations about the hardware system 
 Integrating hardware descriptions as Generic components increases tools portability 
and re-targetability 
New Directions 
 Toward a global generic approach to parallelism 
 Turning hacks into language features 
36 of 40
numscale 
Unlocked software performance 
Generic Parallelism 
Parallel C++ Concepts 
 Expand function hierarchization to Concepts 
 e.g : DataParallel, AssociativeOperations, etc. 
 Use C++1y Concept overloading to split skeletons 
37 of 40
numscale 
Unlocked software performance 
Generic Parallelism 
Parallel C++ Concepts 
 Expand function hierarchization to Concepts 
 e.g : DataParallel, AssociativeOperations, etc. 
 Use C++1y Concept overloading to split skeletons 
Impact 
 Less work for the Skeleton users 
 Extendable through renement 
 Static assertion of function properties 
37 of 40
numscale 
Unlocked software performance 
New C++ Language Features 
My C++ Christmas Land 
 Build lazy evaluation into the language 
 Interactions with generic function is cumbersome 
 SIMD as part of the standard at type level 
38 of 40
numscale 
Unlocked software performance 
New C++ Language Features 
My C++ Christmas Land 
 Build lazy evaluation into the language 
 Interactions with generic function is cumbersome 
 SIMD as part of the standard at type level 
Current Work 
 Can sizeof inspires an ast_of operator 
 Proposal N4035 for auto customization 
 Proposal N3571 for standard SIMD computation 
38 of 40
numscale 
Unlocked software performance 
Perspectives 
At tools level 
 Prototype of single source GPU support 
 Work on distributed systems 
 Applications to Big Data 
39 of 40
numscale 
Unlocked software performance 
Perspectives 
At tools level 
 Prototype of single source GPU support 
 Work on distributed systems 
 Applications to Big Data 
At language level 
 Formalize meta-programming 
 DSEL verication transferance over C++ 
 Interaction with polyhedral model 
39 of 40
Thanks for your attention

More Related Content

What's hot

01 c++ Intro.ppt
01 c++ Intro.ppt01 c++ Intro.ppt
01 c++ Intro.ppt
Tareq Hasan
 
C programming & data structure [arrays & pointers]
C programming & data structure   [arrays & pointers]C programming & data structure   [arrays & pointers]
C programming & data structure [arrays & pointers]
MomenMostafa
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kushaal Singla
 

What's hot (20)

C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answer
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
 
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
 
C programming & data structure [character strings & string functions]
C programming & data structure   [character strings & string functions]C programming & data structure   [character strings & string functions]
C programming & data structure [character strings & string functions]
 
C++ interview question
C++ interview questionC++ interview question
C++ interview question
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
C++ Interview Questions
C++ Interview QuestionsC++ Interview Questions
C++ Interview Questions
 
Storage classes, linkage & memory management
Storage classes, linkage & memory managementStorage classes, linkage & memory management
Storage classes, linkage & memory management
 
01 c++ Intro.ppt
01 c++ Intro.ppt01 c++ Intro.ppt
01 c++ Intro.ppt
 
Intro. to prog. c++
Intro. to prog. c++Intro. to prog. c++
Intro. to prog. c++
 
Vb.net
Vb.netVb.net
Vb.net
 
C programming & data structure [arrays & pointers]
C programming & data structure   [arrays & pointers]C programming & data structure   [arrays & pointers]
C programming & data structure [arrays & pointers]
 
PHP
PHPPHP
PHP
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in R
 
C by balaguruswami - e.balagurusamy
C   by balaguruswami - e.balagurusamyC   by balaguruswami - e.balagurusamy
C by balaguruswami - e.balagurusamy
 
C++ questions and answers
C++ questions and answersC++ questions and answers
C++ questions and answers
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview questions
C interview questionsC interview questions
C interview questions
 
A Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X TransformationA Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X Transformation
 

Similar to (Costless) Software Abstractions for Parallel Architectures

Rhapsody Software
Rhapsody SoftwareRhapsody Software
Rhapsody Software
Bill Duncan
 
An Introduction To Linux Development Environment
An Introduction To Linux Development EnvironmentAn Introduction To Linux Development Environment
An Introduction To Linux Development Environment
S. M. Hossein Hamidi
 

Similar to (Costless) Software Abstractions for Parallel Architectures (20)

Rhapsody Software
Rhapsody SoftwareRhapsody Software
Rhapsody Software
 
An Introduction To Linux Development Environment
An Introduction To Linux Development EnvironmentAn Introduction To Linux Development Environment
An Introduction To Linux Development Environment
 
ContainerDayVietnam2016: Become a Cloud-native Developer
ContainerDayVietnam2016: Become a Cloud-native DeveloperContainerDayVietnam2016: Become a Cloud-native Developer
ContainerDayVietnam2016: Become a Cloud-native Developer
 
.NET Overview
.NET Overview.NET Overview
.NET Overview
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
 
Recover 30% of your day with IBM Development Tools (Smarter Mainframe Develop...
Recover 30% of your day with IBM Development Tools (Smarter Mainframe Develop...Recover 30% of your day with IBM Development Tools (Smarter Mainframe Develop...
Recover 30% of your day with IBM Development Tools (Smarter Mainframe Develop...
 
LIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolLIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval Tool
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
oneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel ProductoneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel Product
 
Introduction to Programming Lesson 01
Introduction to Programming Lesson 01Introduction to Programming Lesson 01
Introduction to Programming Lesson 01
 
Introduction to c_sharp
Introduction to c_sharpIntroduction to c_sharp
Introduction to c_sharp
 
Introduction to c_sharp
Introduction to c_sharpIntroduction to c_sharp
Introduction to c_sharp
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
 
Oracle sun studio
Oracle sun studioOracle sun studio
Oracle sun studio
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 
01 Introduction to programming
01 Introduction to programming01 Introduction to programming
01 Introduction to programming
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 

(Costless) Software Abstractions for Parallel Architectures

  • 1. (Costless) Software Abstractions for Parallel Architectures numscale Unlocked software performance Joel Falcou NumScale SAS – LRI - INRIA CppCon – 01/07/2014
  • 2. numscale Unlocked software performance Context Decades of hardware improvements Scientic Computing now drives most hardware innovations Current Solution: Parallel architectures Machines become more and more complex 2 of 40
  • 3. numscale Unlocked software performance Context Decades of hardware improvements Scientic Computing now drives most hardware innovations Current Solution: Parallel architectures Machines become more and more complex Example : A simple laptop CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX 4 logical cores SIMD Extensions: SSE2-SSE4.2, AVX GPU: NVIDIA GeForce GT 520M (48 CUDA cores) 2 of 40
  • 4. numscale Unlocked software performance The Real Challenge of HPC Single Core Era Performance Expressiveness C/Fort. C++ Java Multi-Core/SIMD Era Performance Sequential Expressiveness SIMD Threads Heterogenous Era Performance Sequential Expressiveness GPU Phi SIMD Threads Distributed 3 of 40
  • 5. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 4 of 40
  • 6. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 4 of 40
  • 7. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 4 of 40
  • 8. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4 of 40
  • 9. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 4 of 40
  • 10. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient 4 of 40
  • 11. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach 4 of 40
  • 12. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) 4 of 40
  • 13. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) 4 of 40
  • 14. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) Use Parallel Skeletons as parallel components (4) 4 of 40
  • 15. numscale Unlocked software performance Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) Design these libraries as Domain Specic Embedded Language (DSEL) (2+3) Use Parallel Skeletons as parallel components (4) Use Generative Programming to deliver performance (5) 4 of 40
  • 16. numscale Unlocked software performance Parallel Programming Ain’t Easy 5 of 40
  • 17. numscale Unlocked software performance Spotting abstraction when you see one Why Parallel Programming Models ? Unstructured parallelism is error-prone Low level parallel tools are non-composable 6 of 40
  • 18. numscale Unlocked software performance Spotting abstraction when you see one Why Parallel Programming Models ? Unstructured parallelism is error-prone Low level parallel tools are non-composable Available Models Performance centric: P-RAM, LOG-P, BSP Data centric: HTA, PGAS Pattern centric: Actors, Skeletons 6 of 40
  • 19. numscale Unlocked software performance Spotting abstraction when you see one Why Parallel Programming Models ? Unstructured parallelism is error-prone Low level parallel tools are non-composable Available Models Performance centric: P-RAM, LOG-P, BSP Data centric: HTA, PGAS Pattern centric: Actors, Skeletons 6 of 40
  • 20. numscale Unlocked software performance Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns 7 of 40
  • 21. numscale Unlocked software performance Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functionnal point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 7 of 40
  • 22. numscale Unlocked software performance Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prex reduction over subset of data 8 of 40
  • 23. numscale Unlocked software performance Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prex reduction over subset of data Task Parallel Skeletons par: Independant task execution pipe: Task dependency over time farm: Load-balancing 8 of 40
  • 24. numscale Unlocked software performance Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certiable 9 of 40
  • 25. numscale Unlocked software performance Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certiable Hardware Abstraction Semantic is set, implementation is free Composability )Hierarchical architecture 9 of 40
  • 26. numscale Unlocked software performance Generative Programming Domain Specific Application Description Generative Component Concrete Application Translator Parametric Sub-components 10 of 40
  • 27. numscale Unlocked software performance Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 11 of 40
  • 28. numscale Unlocked software performance Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 11 of 40
  • 29. numscale Unlocked software performance Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming Denition of Meta-programming Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. 11 of 40
  • 30. numscale Unlocked software performance Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming Denition of Meta-programming Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. C++ meta-programming Relies on the C++  sub-language Handles types and integral constants at compile-time Proved to be Turing-complete 11 of 40
  • 31. numscale Unlocked software performance Domain Specic Embedded Languages What’s an DSEL ? DSL = Domain Specic Language Declarative language, easy-to-use, tting the domain DSEL = DSL within a general purpose language EDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizations Exploiting static AST At the expression level: code generation At the function level: inter-procedural optimization 12 of 40
  • 32. numscale Unlocked software performance Embedded Domain Specic Languages EDSL in C++ Relies on operator overload abuse – see B.P Carry semantic information around code fragment Generic implementation become self-aware of optimizations Advantages Allow introduction of DSLs without disrupting dev. chain Semantic dened as type informations means compile-time resolution Access to a large selection of runtime binding 13 of 40
  • 33. numscale Unlocked software performance Expression Templates in A Nutshell 14 of 40
  • 34. numscale Unlocked software performance Expression Templates matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); exprassign ,exprmatrix ,exprplus , exprcos ,exprmatrix , exprmultiplies ,exprmatrix ,exprmatrix (x,a,b); + = cos * a b a x #pragma omp parallel for for(int j=0;jh;++j) { for(int i=0;iw;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST 15 of 40
  • 35. numscale Unlocked software performance Architecture Aware Generative Programming 16 of 40
  • 36. numscale Unlocked software performance Parallel DSEL in practice Objectives Apply DSEL generation techniques for different kind of hardware Demonstrate low cost of abstractions Demonstrate applicability of skeletons 17 of 40
  • 37. numscale Unlocked software performance Parallel DSEL in practice Objectives Apply DSEL generation techniques for different kind of hardware Demonstrate low cost of abstractions Demonstrate applicability of skeletons Our contribution BSP++ : Generic C++ BSP for shared/distributed memory Quaff: DSEL for skeleton programming B.SIMD: DSEL for portable SIMD programming NT2: M like DSEL for scientic computing 17 of 40
  • 38. numscale Unlocked software performance Parallel DSEL in practice Objectives Apply DSEL generation techniques for different kind of hardware Demonstrate low cost of abstractions Demonstrate applicability of skeletons Our contribution BSP++ : Generic C++ BSP for shared/distributed memory Quaff: DSEL for skeleton programming B.SIMD: DSEL for portable SIMD programming NT2: M like DSEL for scientic computing 17 of 40
  • 39. numscale Unlocked software performance NT2 A Scientic Computing Library Provide a simple, M-like interface for users Provide high-performance computing entities and primitives Easily extendable Components Use Boost.SIMD for in-core optimizations Use recursive parallel skeletons Code is made independant of architecture and runtime 18 of 40
  • 40. numscale Unlocked software performance The Numerical Template Toolbox Comparison to other libraries Feature Armadillo Blaze Eigen MTL uBlas NT2 M-like API X X BLAS/LAPACK binding X X X X X X MAGMA binding X SSE2+ support X X X X AVX support X X X AVX2 support X Xeon Phi support X Altivec support X X ARM support X X Threading support X CUDA support X 19 of 40
  • 41. numscale Unlocked software performance The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M 20 of 40
  • 42. numscale Unlocked software performance The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le 20 of 40
  • 43. numscale Unlocked software performance The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes 20 of 40
  • 44. numscale Unlocked software performance The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes Compile the le and link with libnt2.a 20 of 40
  • 45. numscale Unlocked software performance The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes Compile the le and link with libnt2.a ?????? PROFIT! 20 of 40
  • 46. numscale Unlocked software performance NT2 - From M ... A1 = 1 : 1 0 0 0 ; A2 = A1 + randn ( size ( A1 ) ) ; X = lu ( A1 * A1 ’) ; rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; 21 of 40
  • 47. numscale Unlocked software performance NT2 - ... to C++ table double A1 = _ (1. ,1000.) ; table double A2 = A1 + randn ( size ( A1 ) ) ; table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 22 of 40
  • 48. numscale Unlocked software performance Parallel Skeletons extraction process A = B / sum(C+D); = A = B sum + C D fold transform 23 of 40
  • 49. numscale Unlocked software performance Parallel Skeletons extraction process A = B / sum(C+D); ; ; = A = B sum + C D fold transform = tmp sum + C D fold ) = A = B tmp transform 24 of 40
  • 50. numscale Unlocked software performance From data to task parallelism Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization 25 of 40
  • 51. numscale Unlocked software performance From data to task parallelism Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization From Skeletons to Actors Upgrade NT2 to enable task parallelism Adapt current skeletons for taskication Use Futures ( or HPX)to automatically create pipelines Derive a dependency graph between statements 25 of 40
  • 52. numscale Unlocked software performance Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); ; ; = tmp sum + C D fold = A = B tmp transform 26 of 40
  • 53. numscale Unlocked software performance Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); fold = = tmp(3) sum(3) + tmp(1) sum + C(:; 3) D(:; 3) = spawnertransform,OpenMP = = A(:; 3) = A(:; 2) = A(:; 1) = B(:; 3) tmp(3) B(:; 2) tmp(2) transform C(:; 1) D(:; 1) fold = tmp(2) sum(2) + C(:; 2) D(:; 2) B(:; 1) tmp(1) transform workerfold,simd workertransform,simd spawnertransform,OpenMP ; 27 of 40
  • 54. numscale Unlocked software performance Sigma-Delta Motion Detection Context Mono-modal algorithm based on background substraction Use local gaussian model of lightness variation to detect motion Target applications: robotic, video survey and analytics, defence Challenge: Very low arithmetic density Challenge: Integer-based implementation with small range 28 of 40
  • 55. numscale Unlocked software performance Motion Detection NT2 Code table char s i g m a _ d e l t a ( table char b a c k g r o u n d , table char const frame , table char v a r i a n c e ) { // E s t i m a t e Raw M o v e m e n t b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame , s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d ) ) ; table char diff = dist ( background , frame ) ; // C o m p u t e Local V a r i a n c e table char sig3 = muls ( diff ,3) ; var = i f _ e l s e ( diff != 0 , s e l i n c ( v a r i a n c e sig3 , s e l d e c ( var sig3 , v a r i a n c e ) ) , v a r i a n c e ) ; // G e n e r a t e M o v e m e n t Label r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ; } 29 of 40
  • 56. numscale Unlocked software performance Motion Detection Performance 18 16 14 12 10 8 6 4 2 0 512x512 1024x1024 cycles/element Image Size (N x N) x6.8 x14.8 x16.5 x2.1 x3.6 x6.7 x15.3 x18 x2.3 x3.99 x10.8 x10.8 SCALAR HALF CORE FULL CORE SIMD JRTIP2008 SIMD + HALF CORE SIMD + FULL CORE 30 of 40
  • 57. numscale Unlocked software performance Black and Scholes Option Pricing NT2 Code table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { table float da = sqrt ( Ta ) ; table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; table float d2 = d1 - va * da ; r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; } 31 of 40
  • 58. numscale Unlocked software performance Black and Scholes Option Pricing NT2 Code with loop fusion table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { // P r e a l l o c a t e t e m p o r a r y t a b l e s table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; // tie merge loop nest and i n c r e a s e cache l o c a l i t y tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) , log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) , d1 - va * da , Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ) ; r e t u r n R ; } 31 of 40
  • 59. numscale Unlocked software performance Black and Scholes Option Pricing Performance 1000000 150 100 50 0 x1.89 x2.91 x5.58 x6.30 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 32 of 40
  • 60. numscale Unlocked software performance Black and Scholes Option Pricing Performance with loop fusion 1000000 150 100 50 0 x2.27 x4.13 x8.05 x11.12 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 33 of 40
  • 61. numscale Unlocked software performance LU Decomposition Algorithm A00 A10 A01 A02 A20 A11 A21 A12 A22 A11 A21 A12 A22 A22 step 1 step 2 step 3 step 4 step 5 step 6 step 7 DGETRF DGESSM DTSTRF DSSSSM 34 of 40
  • 62. numscale Unlocked software performance LU Decomposition Performance 0 10 20 30 40 50 100 50 0 Number of cores Median GFLOPS 8000 8000 LU decomposition NT2 Intel MKL 35 of 40
  • 63. numscale Unlocked software performance What we learn Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 36 of 40
  • 64. numscale Unlocked software performance What we learn Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability New Directions Toward a global generic approach to parallelism Turning hacks into language features 36 of 40
  • 65. numscale Unlocked software performance Generic Parallelism Parallel C++ Concepts Expand function hierarchization to Concepts e.g : DataParallel, AssociativeOperations, etc. Use C++1y Concept overloading to split skeletons 37 of 40
  • 66. numscale Unlocked software performance Generic Parallelism Parallel C++ Concepts Expand function hierarchization to Concepts e.g : DataParallel, AssociativeOperations, etc. Use C++1y Concept overloading to split skeletons Impact Less work for the Skeleton users Extendable through renement Static assertion of function properties 37 of 40
  • 67. numscale Unlocked software performance New C++ Language Features My C++ Christmas Land Build lazy evaluation into the language Interactions with generic function is cumbersome SIMD as part of the standard at type level 38 of 40
  • 68. numscale Unlocked software performance New C++ Language Features My C++ Christmas Land Build lazy evaluation into the language Interactions with generic function is cumbersome SIMD as part of the standard at type level Current Work Can sizeof inspires an ast_of operator Proposal N4035 for auto customization Proposal N3571 for standard SIMD computation 38 of 40
  • 69. numscale Unlocked software performance Perspectives At tools level Prototype of single source GPU support Work on distributed systems Applications to Big Data 39 of 40
  • 70. numscale Unlocked software performance Perspectives At tools level Prototype of single source GPU support Work on distributed systems Applications to Big Data At language level Formalize meta-programming DSEL verication transferance over C++ Interaction with polyhedral model 39 of 40
  • 71. Thanks for your attention