Software Abstractions for Parallel Hardware

Software Abstractions for Parallel Architectures
Joel Falcou
LRI - CNRS - INRIA
HDR Thesis Defense
12/01/2014

The Paradigm Change in Science
From Experiments to Simulations
Simulations is now an integral
part of the Scientic Method
Scientic Computing enables
larger, faster, more accurate
Research
Fast Simulation is Time Travel as
scientic results are now more
readily available
Local Galaxy Cluster Simulation - Illustris project
Computing is rst and foremost a mainstream science tool
2 of 41

The Paradigm Change in Science
The Parallel Hell
Heat Wall: Growing cores
instead of GHz
Hierarchical and heterogeneous
parallel systems are the norm
The Free Lunch is over as
hardware complexity rises faster
than the average developer skills
Local Galaxy Cluster Simulation - Illustris project
The real challenge in HPC is the Expressiveness/Efficiency War
2 of 41

The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
3 of 41

Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
4 of 41

Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)
Use Parallel Programming Abstractions as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 41

Talk Layout
Introduction
Abstractions Efficiency
Experimental Results
Conclusion
5 of 41

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
6 of 41

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]
Principles
Machine Model
Execution Model
Analytic Cost Model
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
Wmax h.g L
BSP Execution Model
7 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]
Advantages
Simple set of primitives
Implementable on any
kind of hardware
Possibility to reason about
BSP programs
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
Wmax h.g L
BSP Execution Model
7 of 41

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
8 of 41

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel: map, fold, scan
Task parallel: par, pipe, farm
More complex: Distribuable Homomorphism, Divide Conquer, …
8 of 41

Relevance to our Objectives
Why using Parallel Skeletons ?
Write code independant of parallel programming minutiae
Composability supports hierarchical architectures
Code is scalable and easy to maintain
Why using BSP ?
Cost model guide development
Few primitives mean that intellectual burden is low
Good medium for developping skeletons
How to ensure performance of those models’ implementations ?
9 of 41

Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g: SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
10 of 41

Generative Programming [Eisenecker 97]
Domain Specific
Application Description
Generative Component Concrete Application
Translator
Parametric
Sub-components
11 of 41

Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++  sub-language
Handles types and integral constants at compile-time
 classes and functions act as code quoting
12 of 41

The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix

, exprmultiplies
,exprmatrix
,exprmatrix

(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41

The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix

, exprmultiplies
,exprmatrix
,exprmatrix

(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41

Our Contributions
Our Strategy
Applies DSEL generation techniques to parallel programming
Maintains low cost of abstractions through meta-programming
Maintains abstraction level via modern library design
Our contributions
Tools Pub. Scope Applications
Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction
SkellPU PACT’08 Skeleton on Cell BE Real-time Image processing
BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking
NT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision
14 of 41

Example of BSP++ Application
Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia
BSP Smith Waterman
SW computes DNA sequences alignment
BSP++ implementation was written once and run on 7 different hardwares
Efficiency of 95+% even on 6000 cores super-computer
Platform MaxSize # Elements Speedup GCUPs
cluster (MPI) 1,072,950 128 cores 73x 6.53
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41
OpenMP 1,072,950 16 cores 16x 0.40
CellBE 85,603 8 SPEs — 0.14
cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37
Hopper(MPI) 5,303,436 3072 cores 260x 3.09
Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5
15 of 41

Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
16 of 41

Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
Proposed Method
Extends Generative Programming to take this architecture into account
Provides an architecture description DSEL
Integrates this description in the code generation process
16 of 41

Architecture Aware Generative Programming
17 of 41

Software refactoring
Tools Issues Changes
Quaff Raw skeletons API Re-engineered as part of NT2
SkellPU Too architecture specic Re-engineered as part of NT2
BSP++ Integration issues Integrate hybrid code generation
NT2 Not easily extendable Integrate Quaff Skeleton models
Boost.SIMD - Side product of NT2 restructuration
Conclusion
Skeletons are ne as parallel middleware
Model based abstractions are not high level enough
For low level architectures, the simplest model is often the best
18 of 41

Boost.SIMD
Pierre Estérie PHD 2010-2014
Principles
Provides simple C++ API over SIMD
extensions
Supports every Intel, PPC and ARM
instructions sets
Fully integrates with modern C++
idioms
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang
19 of 41

Talk Layout
Introduction
Conclusion
20 of 41

The Numerical Template Toolbox
Pierre Estérie PHD 2010-2014
NT2 as a Scientic Computing Library
Provides a simple, M-like interface for users
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
21 of 41

Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
22 of 41

Principles
How does it works
Take a .m le, copy to a .cpp le
22 of 41

Principles
How does it works
Add #include nt2/nt2.hpp and do cosmetic changes
22 of 41

Principles
How does it works
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
22 of 41

NT2 - From M to C++
M code
A1 = 1 : 1 0 0 0 ;
A2 = A1 + randn ( size ( A1 ) ) ;
X = lu ( A1 * A1 ’) ;
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ;
NT2 code
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
23 of 41

Parallel Skeletons extraction process
A = B / sum(C+D);
=
A =
B sum
+
C D
fold
transform
24 of 41

Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
25 of 41

From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
26 of 41

From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
Skeletons from the Future
Adapt current skeletons for taskication
Use Futures ( or HPX) to automatically pipeline
Derive a dependency graph between statements
26 of 41

Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A =
B tmp
transform
27 of 41

Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
=
tmp(3) sum(3)
+
tmp(1) sum
+
C(:; 3) D(:; 3)
=
spawnertransform,OpenMP
=
=
A(:; 3) =
A(:; 2) =
A(:; 1) =
B(:; 3) tmp(3)
B(:; 2) tmp(2)
transform
C(:; 1) D(:; 1)
fold
=
tmp(2) sum(2)
+
C(:; 2) D(:; 2)
B(:; 1) tmp(1)
transform
workerfold,simd
workertransform,simd
spawnertransform,OpenMP
;
28 of 41

Motion Detection
Lacassagne et al., ICIP 2009
Sigma-Delta algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
29 of 41

Motion Detection
table char s i g m a _ d e l t a ( table char b a c k g r o u n d
, table char const frame
, table char v a r i a n c e
)
{
// E s t i m a t e Raw M o v e m e n t
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame
, s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d )
) ;
table char diff = dist ( background , frame ) ;
// C o m p u t e Local V a r i a n c e
table char sig3 = muls ( diff ,3) ;
var = i f _ e l s e ( diff != 0
, s e l i n c ( v a r i a n c e sig3
, s e l d e c ( var sig3 , v a r i a n c e )
)
, v a r i a n c e
) ;
// G e n e r a t e M o v e m e n t Label
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ;
}
30 of 41

Motion Detection
18
16
14
12
10
8
6
4
2
0
512x512 1024x1024
cycles/element
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.99
x10.8
x10.8
SCALAR
HALF CORE
FULL CORE
SIMD
JRTIP2008
SIMD + HALF CORE
SIMD + FULL CORE
31 of 41

Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
32 of 41

NT2 Code with loop fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
32 of 41

Performance
1000000
150
100
50
0
x1.89
x2.91
x5.58
x6.30
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 41

Performance with loop fusion/futurisation
1000000
150
100
50
0
x2.27
x4.13
x8.05
x11.12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
34 of 41

LU Decomposition
Algorithm
A00
A10 A01 A02
A20 A11
A21
A12
A22 A11
A21 A12
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
35 of 41

LU Decomposition
Performance
0 10 20 30 40 50
100
50
0
Number of cores
Median GFLOPS
8000 8000 LU decomposition
NT2
Intel MKL
36 of 41

Talk Layout
Introduction
Conclusion
37 of 41

Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
38 of 41

Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements
A new method for parallel software development
Efficient libraries working on large subset of hardware
High level of performances across a wide application spectrum
38 of 41

Works in Progress
Application to Accelerators
Exploration of proper skeleton implementation on GPUs
Adaptation of Future based code generator
In progress with Ian Masliah’ PHD thesis
Parallelism within C++
SIMD as part of the standard library
Proposal N3571 for standard SIMD computation
Interoperability with current parallel model of C++
39 of 41

Perspectives
DSEL as C++ rst class idiom
Build partial evaluation into the language
Ease transition between regular and meta C++
Mid-term Prospect: metaOCAML like quoting for C++
DSEL and compilers relationship
C++ DSEL hits a limit on their applicability
Compilers often lack high level informations for proper optimization
Mid-term Prospect: Hybrid library/compiler approaches for DSEL
40 of 41

Software Abstractions for Parallel Hardware

More Related Content

What's hot

Viewers also liked

Similar to Software Abstractions for Parallel Hardware

Recently uploaded

Software Abstractions for Parallel Hardware