(Costless) Software Abstractions for Parallel Architectures

(Costless) Software Abstractions for Parallel Architectures
numscale
Unlocked software performance
Joel Falcou
NumScale SAS – LRI - INRIA
CppCon – 01/07/2014

numscale
Context
Decades of hardware improvements
Scientic Computing now drives most hardware innovations
Current Solution: Parallel architectures
Machines become more and more complex
2 of 40

numscale
Context
Decades of hardware improvements
Scientic Computing now drives most hardware innovations
Current Solution: Parallel architectures
Machines become more and more complex
Example : A simple laptop
CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX
4 logical cores
SIMD Extensions: SSE2-SSE4.2, AVX
GPU: NVIDIA GeForce GT 520M (48 CUDA cores)
2 of 40

numscale
The Real Challenge of HPC
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
3 of 40

numscale
Designing tools for Scientic Computing
Challenges
4 of 40

numscale
Challenges
1. Be non-disruptive
4 of 40

numscale
Challenges
2. Domain driven optimizations
4 of 40

numscale
Challenges
3. Provide intuitive API for the user
4 of 40

numscale
Challenges
4. Support a wide architectural landscape
4 of 40

numscale
Challenges
5. Be efficient
4 of 40

numscale
Challenges
5. Be efficient
Our Approach
4 of 40

numscale
Challenges
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
4 of 40

numscale
Challenges
5. Be efficient
Our Approach
Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)
4 of 40

numscale
Challenges
5. Be efficient
Our Approach
Use Parallel Skeletons as parallel components (4)
4 of 40

numscale
Challenges
5. Be efficient
Our Approach
Use Parallel Skeletons as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 40

numscale
Parallel Programming Ain’t Easy
5 of 40

numscale
Spotting abstraction when you see one
Why Parallel Programming Models ?
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
6 of 40

numscale
Spotting abstraction when you see one
Why Parallel Programming Models ?
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Available Models
Performance centric: P-RAM, LOG-P, BSP
Data centric: HTA, PGAS
Pattern centric: Actors, Skeletons
6 of 40

numscale
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns
7 of 40

numscale
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns
Functionnal point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
7 of 40

numscale
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prex reduction over subset of data
8 of 40

numscale
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prex reduction over subset of data
Task Parallel Skeletons
par: Independant task execution
pipe: Task dependency over time
farm: Load-balancing
8 of 40

numscale
Why using Parallel Skeletons
Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certiable
9 of 40

numscale
Why using Parallel Skeletons
Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certiable
Hardware Abstraction
Semantic is set, implementation is free
Composability )Hierarchical architecture
9 of 40

numscale
Generative Programming
Domain Specific
Application Description
Generative Component Concrete Application
Translator
Parametric
Sub-components
10 of 40

numscale
Generative Programming as a Tool
Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming
11 of 40

numscale
Dedicated compilers
Denition of Meta-programming
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
11 of 40

numscale
Dedicated compilers
Denition of Meta-programming
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
C++ meta-programming
Relies on the C++  sub-language
Handles types and integral constants at compile-time
Proved to be Turing-complete
11 of 40

numscale
Domain Specic Embedded Languages
What’s an DSEL ?
DSL = Domain Specic Language
Declarative language, easy-to-use, tting the domain
DSEL = DSL within a general purpose language
EDSL in C++
Relies on operator overload abuse (Expression Templates)
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations
Exploiting static AST
At the expression level: code generation
At the function level: inter-procedural optimization
12 of 40

numscale
Embedded Domain Specic Languages
EDSL in C++
Relies on operator overload abuse – see B.P
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations
Advantages
Allow introduction of DSLs without disrupting dev. chain
Semantic dened as type informations means compile-time resolution
Access to a large selection of runtime binding
13 of 40

numscale
Expression Templates in A Nutshell
14 of 40

numscale
Expression Templates
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix

, exprmultiplies
,exprmatrix
,exprmatrix

(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
15 of 40

numscale
Architecture Aware Generative Programming
16 of 40

numscale
Parallel DSEL in practice
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons
17 of 40

numscale
Parallel DSEL in practice
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons
Our contribution
BSP++ : Generic C++ BSP for shared/distributed memory
Quaff: DSEL for skeleton programming
B.SIMD: DSEL for portable SIMD programming
NT2: M like DSEL for scientic computing
17 of 40

numscale
NT2
A Scientic Computing Library
Provide a simple, M-like interface for users
Provide high-performance computing entities and primitives
Easily extendable
Components
Use Boost.SIMD for in-core optimizations
Use recursive parallel skeletons
Code is made independant of architecture and runtime
18 of 40

numscale
The Numerical Template Toolbox
Comparison to other libraries
Feature Armadillo Blaze Eigen MTL uBlas NT2
M-like API X X
BLAS/LAPACK binding X X X X X X
MAGMA binding X
SSE2+ support X X X X
AVX support X X X
AVX2 support X
Xeon Phi support X
Altivec support X X
ARM support X X
Threading support X
CUDA support X
19 of 40

numscale
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
20 of 40

numscale
Principles
How does it works
Take a .m le, copy to a .cpp le
20 of 40

numscale
Principles
How does it works
Add #include nt2/nt2.hpp and do cosmetic changes
20 of 40

numscale
Principles
How does it works
Compile the le and link with libnt2.a
20 of 40

numscale
Principles
How does it works
Compile the le and link with libnt2.a
??????
PROFIT!
20 of 40

numscale
NT2 - From M ...
A1 = 1 : 1 0 0 0 ;
A2 = A1 + randn ( size ( A1 ) ) ;
X = lu ( A1 * A1 ’) ;
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ;
21 of 40

numscale
NT2 - ... to C++
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
22 of 40

numscale
Parallel Skeletons extraction process
A = B / sum(C+D);
=
A =
B sum
+
C D
fold
transform
23 of 40

numscale
Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
24 of 40

numscale
From data to task parallelism
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
25 of 40

numscale
From data to task parallelism
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
From Skeletons to Actors
Upgrade NT2 to enable task parallelism
Adapt current skeletons for taskication
Use Futures ( or HPX)to automatically create pipelines
Derive a dependency graph between statements
25 of 40

numscale
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A =
B tmp
transform
26 of 40

numscale
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
=
tmp(3) sum(3)
+
tmp(1) sum
+
C(:; 3) D(:; 3)
=
spawnertransform,OpenMP
=
=
A(:; 3) =
A(:; 2) =
A(:; 1) =
B(:; 3) tmp(3)
B(:; 2) tmp(2)
transform
C(:; 1) D(:; 1)
fold
=
tmp(2) sum(2)
+
C(:; 2) D(:; 2)
B(:; 1) tmp(1)
transform
workerfold,simd
workertransform,simd
spawnertransform,OpenMP
;
27 of 40

numscale
Sigma-Delta Motion Detection
Context
Mono-modal algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Target applications: robotic, video survey and analytics, defence
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
28 of 40

numscale
Motion Detection
NT2 Code
table char s i g m a _ d e l t a ( table char b a c k g r o u n d
, table char const frame
, table char v a r i a n c e
)
{
// E s t i m a t e Raw M o v e m e n t
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame
, s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d )
) ;
table char diff = dist ( background , frame ) ;
// C o m p u t e Local V a r i a n c e
table char sig3 = muls ( diff ,3) ;
var = i f _ e l s e ( diff != 0
, s e l i n c ( v a r i a n c e sig3
, s e l d e c ( var sig3 , v a r i a n c e )
)
, v a r i a n c e
) ;
// G e n e r a t e M o v e m e n t Label
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ;
}
29 of 40

numscale
Motion Detection
Performance
18
16
14
12
10
8
6
4
2
0
512x512 1024x1024
cycles/element
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.99
x10.8
x10.8
SCALAR
HALF CORE
FULL CORE
SIMD
JRTIP2008
SIMD + HALF CORE
SIMD + FULL CORE
30 of 40

numscale
Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
31 of 40

numscale
NT2 Code with loop fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
31 of 40

numscale
Performance
1000000
150
100
50
0
x1.89
x2.91
x5.58
x6.30
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
32 of 40

numscale
Performance with loop fusion
1000000
150
100
50
0
x2.27
x4.13
x8.05
x11.12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 40

numscale
LU Decomposition
Algorithm
A00
A10 A01 A02
A20 A11
A21
A12
A22 A11
A21 A12
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
34 of 40

numscale
LU Decomposition
Performance
0 10 20 30 40 50
100
50
0
Number of cores
Median GFLOPS
8000 8000 LU decomposition
NT2
Intel MKL
35 of 40

numscale
What we learn
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
36 of 40

numscale
What we learn
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
New Directions
Toward a global generic approach to parallelism
Turning hacks into language features
36 of 40

numscale
Generic Parallelism
Parallel C++ Concepts
Expand function hierarchization to Concepts
e.g : DataParallel, AssociativeOperations, etc.
Use C++1y Concept overloading to split skeletons
37 of 40

numscale
Generic Parallelism
Parallel C++ Concepts
Expand function hierarchization to Concepts
e.g : DataParallel, AssociativeOperations, etc.
Use C++1y Concept overloading to split skeletons
Impact
Less work for the Skeleton users
Extendable through renement
Static assertion of function properties
37 of 40

numscale
New C++ Language Features
My C++ Christmas Land
Build lazy evaluation into the language
Interactions with generic function is cumbersome
SIMD as part of the standard at type level
38 of 40

numscale
New C++ Language Features
My C++ Christmas Land
Build lazy evaluation into the language
Interactions with generic function is cumbersome
SIMD as part of the standard at type level
Current Work
Can sizeof inspires an ast_of operator
Proposal N4035 for auto customization
Proposal N3571 for standard SIMD computation
38 of 40

numscale
Perspectives
At tools level
Prototype of single source GPU support
Work on distributed systems
Applications to Big Data
39 of 40

numscale
Perspectives
At tools level
Prototype of single source GPU support
Work on distributed systems
Applications to Big Data
At language level
Formalize meta-programming
DSEL verication transferance over C++
Interaction with polyhedral model
39 of 40

(Costless) Software Abstractions for Parallel Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to (Costless) Software Abstractions for Parallel Architectures

Similar to (Costless) Software Abstractions for Parallel Architectures (20)

Recently uploaded

Recently uploaded (20)

(Costless) Software Abstractions for Parallel Architectures