HDR Defence - Software Abstractions for Parallel Architectures

Software Abstractions for Parallel Architectures
Joel Falcou
LRI - CNRS - INRIA
HDR Thesis Defense
12/01/2014

The Paradigm Change in Science
From Experiments to Simulations
Simulations is now an integral
part of the Scientic Method
Scientic Computing enables
larger, faster, more accurate
Research
Fast Simulation is Time Travel as
scientic results are now more
readily available
Local Galaxy Cluster Simulation - Illustris project
Computing is rst and foremost a mainstream science tool
2 of 41

The Paradigm Change in Science
The Parallel Hell
Heat Wall: Growing cores
instead of GHz
Hierarchical and heterogeneous
parallel systems are the norm
The Free Lunch is over as
hardware complexity rises faster
than the average developer skills
Local Galaxy Cluster Simulation - Illustris project
The real challenge in HPC is the Expressiveness/Eﬃciency War
2 of 41

The Expressiveness/Eﬃciency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Expressiveness
Sequential
Threads
SIMD
Heterogenous Era
Performance
Expressiveness
Sequential
SIMD
Threads
GPU
Phi
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
3 of 41

Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be eﬃcient
4 of 41

Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be eﬃcient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)
Use Parallel Programming Abstractions as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 41

Talk Layout
Introduction
Abstractions & Eﬃciency
Experimental Results
Conclusion
5 of 41

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
6 of 41

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]
Principles
Machine Model
Execution Model
Analytic Cost Model
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]
Advantages
Simple set of primitives
Implementable on any
kind of hardware
Possibility to reason about
BSP programs
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
8 of 41

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel: map, fold, scan
Task parallel: par, pipe, farm
More complex: Distribuable Homomorphism, Divide & Conquer, …
8 of 41

Relevance to our Objectives
Why using Parallel Skeletons ?
Write code independant of parallel programming minutiae
Composability supports hierarchical architectures
Code is scalable and easy to maintain
Why using BSP ?
Cost model guide development
Few primitives mean that intellectual burden is low
Good medium for developping skeletons
How to ensure performance of those models’ implementations ?
9 of 41

Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g: SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
10 of 41

Generative Programming [Eisenecker 97]
Domain Speciﬁc
Application Description
Generative Component Concrete Application
Translator
Parametric
Sub-components
11 of 41

Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++  sub-language
Handles types and integral constants at compile-time
 classes and functions act as code quoting
12 of 41

The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign
,expr<matrix&>
,expr<plus
, expr<cos
,expr<matrix&>
>
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
>
>(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel for
for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41

The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign
,expr<matrix&>
,expr<plus
, expr<cos
,expr<matrix&>
>
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
>
>(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel for
for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41

Our Contributions
Our Strategy
Applies DSEL generation techniques to parallel programming
Maintains low cost of abstractions through meta-programming
Maintains abstraction level via modern library design
Our contributions
Tools Pub. Scope Applications
Quaﬀ ParCo’06 MPI Skeletons Real-time 3D reconstruction
SkellBE PACT’08 Skeleton on Cell BE Real-time Image processing
BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking
NT2
JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision
14 of 41

Example of BSP++ Application
Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia
BSP Smith & Waterman
S&W computes DNA sequences alignment
BSP++ implementation was written once and run on 7 diﬀerent hardwares
Eﬃciency of 95+% even on 6000 cores super-computer
Platform MaxSize # Elements Speedup GCUPs
cluster (MPI) 1,072,950 128 cores 73x 6.53
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41
OpenMP 1,072,950 16 cores 16x 0.40
CellBE 85,603 8 SPEs — 0.14
cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37
Hopper(MPI) 5,303,436 3072 cores 260x 3.09
Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5
15 of 41

Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is diﬃcult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
16 of 41

Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is diﬃcult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
Proposed Method
Extends Generative Programming to take this architecture into account
Provides an architecture description DSEL
Integrates this description in the code generation process
16 of 41

Architecture Aware Generative Programming
17 of 41

Software refactoring
Tools Issues Changes
Quaﬀ Raw skeletons API Re-engineered as part of NT2
SkellBE Too architecture specic Re-engineered as part of NT2
BSP++ Integration issues Integrate hybrid code generation
NT2
Not easily extendable Integrate Quaﬀ Skeleton models
Boost.SIMD - Side product of NT2
restructuration
Conclusion
Skeletons are ne as parallel middleware
Model based abstractions are not high level enough
For low level architectures, the simplest model is often the best
18 of 41

Boost.SIMD
Pierre Estérie PHD 2010-2014
Principles
Provides simple C++ API over SIMD
extensions
Supports every Intel, PPC and ARM
instructions sets
Fully integrates with modern C++
idioms
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang
19 of 41

Talk Layout
Introduction
Conclusion
20 of 41

The Numerical Template Toolbox
Pierre Estérie PHD 2010-2014
NT2
as a Scientic Computing Library
Provides a simple, M-like interface for users
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
21 of 41

Principles
table<T,S> is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
22 of 41

Principles
How does it works
Take a .m le, copy to a .cpp le
22 of 41

Principles
How does it works
Add #include <nt2/nt2.hpp> and do cosmetic changes
22 of 41

Principles
How does it works
Add #include <nt2/nt2.hpp> and do cosmetic changes
Compile the le and link with libnt2.a
22 of 41

NT2 - From M to C++
M code
A1 = 1:1000;
A2 = A1 + randn(size(A1));
X = lu(A1*A1’);
rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );
NT2
code
table <double > A1 = _(1. ,1000.);
table <double > A2 = A1 + randn(size(A1));
table <double > X = lu( mtimes(A1, trans(A1) );
double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );
23 of 41

Parallel Skeletons extraction process
A = B / sum(C+D);
=
A /
B sum
+
C D
fold
transform
24 of 41

Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A /
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
⇒
=
A /
B tmp
transform
25 of 41

From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
26 of 41

From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
Skeletons from the Future
Adapt current skeletons for taskication
Use Futures ( or HPX) to automatically pipeline
Derive a dependency graph between statements
26 of 41

Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A /
B tmp
transform
27 of 41

Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
tmp(3) sum(3)
+
C(:, 3) D(:, 3)
transform
=
A(:, 3) /
B(:, 3) tmp(3)
fold
=
tmp(2) sum(2)
+
C(:, 2) D(:, 2)
transform
=
A(:, 2) /
B(:, 2) tmp(2)
workerfold,simd
=
tmp(1) sum
+
C(:, 1) D(:, 1)
workertransform,simd
=
A(:, 1) /
B(:, 1) tmp(1)
spawnertransform,OpenMP
spawnertransform,OpenMP
;
28 of 41

Motion Detection
Lacassagne et al., ICIP 2009
Sigma-Delta algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
29 of 41

Motion Detection
table <char > sigma_delta( table <char >& background
, table <char > const& frame
, table <char >& variance
)
{
// Estimate Raw Movement
background = selinc( background < frame
, seldec(background > frame , background)
);
table <char > diff = dist(background , frame);
// Compute Local Variance
table <char > sig3 = muls(diff ,3);
var = if_else( diff != 0
, selinc( variance < sig3
, seldec( var > sig3 , variance)
)
, variance
);
// Generate Movement Label
return if_zero_else_one( diff < variance );
}
30 of 41

Motion Detection
0
2
4
6
8
10
12
14
16
18
512x512 1024x1024
cycles/element
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.99
x10.8
x10.8
SCALAR
HALF CORE
FULL CORE
SIMD
JRTIP2008
SIMD + HALF CORE
SIMD + FULL CORE
31 of 41

Black and Scholes Option Pricing
NT2
Code
table <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta
, table <float > const& ra, table <float > const& va
)
{
table <float > da = sqrt(Ta);
table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);
table <float > d2 = d1 -va*da;
return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);
}
32 of 41

NT2
Code with loop fusion
table <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta
, table <float > const& ra, table <float > const& va
)
{
// Preallocate temporary tables
table <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));
// tie merge loop nest and increase cache locality
tie(da,d1,d2,R) = tie( sqrt(Ta)
, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da)
, d1 -va*da
, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2)
);
return R;
}
32 of 41

Performance
1000000
0
50
100
150
x1.89
x2.91
x5.58
x6.30
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 41

Performance with loop fusion/futurisation
1000000
0
50
100
150
x2.27
x4.13
x8.05
x11.12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
34 of 41

LU Decomposition
Algorithm
A00
A01 A02A10
A20 A11
A21
A12
A22 A11
A12A21
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
35 of 41

LU Decomposition
Performance
0 10 20 30 40 50
0
50
100
Number of cores
MedianGFLOPS
8000 × 8000 LU decomposition
NT2
Intel MKL
36 of 41

Talk Layout
Introduction
Conclusion
37 of 41

Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
38 of 41

Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements
A new method for parallel software development
Eﬃcient libraries working on large subset of hardware
High level of performances across a wide application spectrum
38 of 41

Works in Progress
Application to Accelerators
Exploration of proper skeleton implementation on GPUs
Adaptation of Future based code generator
In progress with Ian Masliah’ PHD thesis
Parallelism within C++
SIMD as part of the standard library
Proposal N3571 for standard SIMD computation
Interoperability with current parallel model of C++
39 of 41

Perspectives
DSEL as C++ rst class idiom
Build partial evaluation into the language
Ease transition between regular and meta C++
Mid-term Prospect: metaOCAML like quoting for C++
DSEL and compilers relationship
C++ DSEL hits a limit on their applicability
Compilers often lack high level informations for proper optimization
Mid-term Prospect: Hybrid library/compiler approaches for DSEL
40 of 41

HDR Defence - Software Abstractions for Parallel Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HDR Defence - Software Abstractions for Parallel Architectures

Similar to HDR Defence - Software Abstractions for Parallel Architectures (20)

Recently uploaded

Recently uploaded (20)

HDR Defence - Software Abstractions for Parallel Architectures