Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
(Costless) Software Abstractions for Parallel Architectures
1. (Costless) Software Abstractions for Parallel Architectures
numscale
Unlocked software performance
Joel Falcou
NumScale SAS – LRI - INRIA
CppCon – 01/07/2014
2. numscale
Unlocked software performance
Context
Decades of hardware improvements
Scientic Computing now drives most hardware innovations
Current Solution: Parallel architectures
Machines become more and more complex
2 of 40
3. numscale
Unlocked software performance
Context
Decades of hardware improvements
Scientic Computing now drives most hardware innovations
Current Solution: Parallel architectures
Machines become more and more complex
Example : A simple laptop
CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX
4 logical cores
SIMD Extensions: SSE2-SSE4.2, AVX
GPU: NVIDIA GeForce GT 520M (48 CUDA cores)
2 of 40
4. numscale
Unlocked software performance
The Real Challenge of HPC
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
3 of 40
6. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
4 of 40
7. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
4 of 40
8. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4 of 40
9. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
4 of 40
10. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
4 of 40
11. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
4 of 40
12. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
4 of 40
13. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)
4 of 40
14. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)
Use Parallel Skeletons as parallel components (4)
4 of 40
15. numscale
Unlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)
Use Parallel Skeletons as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 40
17. numscale
Unlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
6 of 40
18. numscale
Unlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Available Models
Performance centric: P-RAM, LOG-P, BSP
Data centric: HTA, PGAS
Pattern centric: Actors, Skeletons
6 of 40
19. numscale
Unlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Available Models
Performance centric: P-RAM, LOG-P, BSP
Data centric: HTA, PGAS
Pattern centric: Actors, Skeletons
6 of 40
20. numscale
Unlocked software performance
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns
7 of 40
21. numscale
Unlocked software performance
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns
Functionnal point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
7 of 40
22. numscale
Unlocked software performance
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prex reduction over subset of data
8 of 40
23. numscale
Unlocked software performance
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prex reduction over subset of data
Task Parallel Skeletons
par: Independant task execution
pipe: Task dependency over time
farm: Load-balancing
8 of 40
24. numscale
Unlocked software performance
Why using Parallel Skeletons
Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certiable
9 of 40
25. numscale
Unlocked software performance
Why using Parallel Skeletons
Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certiable
Hardware Abstraction
Semantic is set, implementation is free
Composability )Hierarchical architecture
9 of 40
27. numscale
Unlocked software performance
Generative Programming as a Tool
Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming
11 of 40
28. numscale
Unlocked software performance
Generative Programming as a Tool
Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming
11 of 40
29. numscale
Unlocked software performance
Generative Programming as a Tool
Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming
Denition of Meta-programming
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
11 of 40
30. numscale
Unlocked software performance
Generative Programming as a Tool
Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming
Denition of Meta-programming
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
C++ meta-programming
Relies on the C++ sub-language
Handles types and integral constants at compile-time
Proved to be Turing-complete
11 of 40
31. numscale
Unlocked software performance
Domain Specic Embedded Languages
What’s an DSEL ?
DSL = Domain Specic Language
Declarative language, easy-to-use, tting the domain
DSEL = DSL within a general purpose language
EDSL in C++
Relies on operator overload abuse (Expression Templates)
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations
Exploiting static AST
At the expression level: code generation
At the function level: inter-procedural optimization
12 of 40
32. numscale
Unlocked software performance
Embedded Domain Specic Languages
EDSL in C++
Relies on operator overload abuse – see B.P
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations
Advantages
Allow introduction of DSLs without disrupting dev. chain
Semantic dened as type informations means compile-time resolution
Access to a large selection of runtime binding
13 of 40
36. numscale
Unlocked software performance
Parallel DSEL in practice
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons
17 of 40
37. numscale
Unlocked software performance
Parallel DSEL in practice
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons
Our contribution
BSP++ : Generic C++ BSP for shared/distributed memory
Quaff: DSEL for skeleton programming
B.SIMD: DSEL for portable SIMD programming
NT2: M like DSEL for scientic computing
17 of 40
38. numscale
Unlocked software performance
Parallel DSEL in practice
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons
Our contribution
BSP++ : Generic C++ BSP for shared/distributed memory
Quaff: DSEL for skeleton programming
B.SIMD: DSEL for portable SIMD programming
NT2: M like DSEL for scientic computing
17 of 40
39. numscale
Unlocked software performance
NT2
A Scientic Computing Library
Provide a simple, M-like interface for users
Provide high-performance computing entities and primitives
Easily extendable
Components
Use Boost.SIMD for in-core optimizations
Use recursive parallel skeletons
Code is made independant of architecture and runtime
18 of 40
40. numscale
Unlocked software performance
The Numerical Template Toolbox
Comparison to other libraries
Feature Armadillo Blaze Eigen MTL uBlas NT2
M-like API X X
BLAS/LAPACK binding X X X X X X
MAGMA binding X
SSE2+ support X X X X
AVX support X X X
AVX2 support X
Xeon Phi support X
Altivec support X X
ARM support X X
Threading support X
CUDA support X
19 of 40
41. numscale
Unlocked software performance
The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
20 of 40
42. numscale
Unlocked software performance
The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
How does it works
Take a .m le, copy to a .cpp le
20 of 40
43. numscale
Unlocked software performance
The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
20 of 40
44. numscale
Unlocked software performance
The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
20 of 40
45. numscale
Unlocked software performance
The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly mimics
M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
??????
PROFIT!
20 of 40
47. numscale
Unlocked software performance
NT2 - ... to C++
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
22 of 40
48. numscale
Unlocked software performance
Parallel Skeletons extraction process
A = B / sum(C+D);
=
A =
B sum
+
C D
fold
transform
23 of 40
49. numscale
Unlocked software performance
Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
24 of 40
50. numscale
Unlocked software performance
From data to task parallelism
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
25 of 40
51. numscale
Unlocked software performance
From data to task parallelism
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
From Skeletons to Actors
Upgrade NT2 to enable task parallelism
Adapt current skeletons for taskication
Use Futures ( or HPX)to automatically create pipelines
Derive a dependency graph between statements
25 of 40
52. numscale
Unlocked software performance
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A =
B tmp
transform
26 of 40
54. numscale
Unlocked software performance
Sigma-Delta Motion Detection
Context
Mono-modal algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Target applications: robotic, video survey and analytics, defence
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
28 of 40
55. numscale
Unlocked software performance
Motion Detection
NT2 Code
table char s i g m a _ d e l t a ( table char b a c k g r o u n d
, table char const frame
, table char v a r i a n c e
)
{
// E s t i m a t e Raw M o v e m e n t
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame
, s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d )
) ;
table char diff = dist ( background , frame ) ;
// C o m p u t e Local V a r i a n c e
table char sig3 = muls ( diff ,3) ;
var = i f _ e l s e ( diff != 0
, s e l i n c ( v a r i a n c e sig3
, s e l d e c ( var sig3 , v a r i a n c e )
)
, v a r i a n c e
) ;
// G e n e r a t e M o v e m e n t Label
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ;
}
29 of 40
57. numscale
Unlocked software performance
Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
31 of 40
58. numscale
Unlocked software performance
Black and Scholes Option Pricing
NT2 Code with loop fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
31 of 40
62. numscale
Unlocked software performance
LU Decomposition
Performance
0 10 20 30 40 50
100
50
0
Number of cores
Median GFLOPS
8000 8000 LU decomposition
NT2
Intel MKL
35 of 40
63. numscale
Unlocked software performance
What we learn
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
36 of 40
64. numscale
Unlocked software performance
What we learn
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
New Directions
Toward a global generic approach to parallelism
Turning hacks into language features
36 of 40
65. numscale
Unlocked software performance
Generic Parallelism
Parallel C++ Concepts
Expand function hierarchization to Concepts
e.g : DataParallel, AssociativeOperations, etc.
Use C++1y Concept overloading to split skeletons
37 of 40
66. numscale
Unlocked software performance
Generic Parallelism
Parallel C++ Concepts
Expand function hierarchization to Concepts
e.g : DataParallel, AssociativeOperations, etc.
Use C++1y Concept overloading to split skeletons
Impact
Less work for the Skeleton users
Extendable through renement
Static assertion of function properties
37 of 40
67. numscale
Unlocked software performance
New C++ Language Features
My C++ Christmas Land
Build lazy evaluation into the language
Interactions with generic function is cumbersome
SIMD as part of the standard at type level
38 of 40
68. numscale
Unlocked software performance
New C++ Language Features
My C++ Christmas Land
Build lazy evaluation into the language
Interactions with generic function is cumbersome
SIMD as part of the standard at type level
Current Work
Can sizeof inspires an ast_of operator
Proposal N4035 for auto customization
Proposal N3571 for standard SIMD computation
38 of 40
69. numscale
Unlocked software performance
Perspectives
At tools level
Prototype of single source GPU support
Work on distributed systems
Applications to Big Data
39 of 40
70. numscale
Unlocked software performance
Perspectives
At tools level
Prototype of single source GPU support
Work on distributed systems
Applications to Big Data
At language level
Formalize meta-programming
DSEL verication transferance over C++
Interaction with polyhedral model
39 of 40