SlideShare a Scribd company logo
1 of 55
Download to read offline
Software Abstractions for Parallel Architectures
Joel Falcou
LRI - CNRS - INRIA
HDR Thesis Defense
12/01/2014
The Paradigm Change in Science
From Experiments to Simulations
Simulations is now an integral
part of the Scientic Method
Scientic Computing enables
larger, faster, more accurate
Research
Fast Simulation is Time Travel as
scientic results are now more
readily available
Local Galaxy Cluster Simulation - Illustris project
Computing is rst and foremost a mainstream science tool
2 of 41
The Paradigm Change in Science
The Parallel Hell
Heat Wall: Growing cores
instead of GHz
Hierarchical and heterogeneous
parallel systems are the norm
The Free Lunch is over as
hardware complexity rises faster
than the average developer skills
Local Galaxy Cluster Simulation - Illustris project
The real challenge in HPC is the Expressiveness/Efficiency War
2 of 41
The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Expressiveness
Sequential
Threads
SIMD
Heterogenous Era
Performance
Expressiveness
Sequential
SIMD
Threads
GPU
Phi
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
3 of 41
Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
4 of 41
Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)
Use Parallel Programming Abstractions as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
5 of 41
Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
6 of 41
Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41
Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90]
Principles
Machine Model
Execution Model
Analytic Cost Model
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90]
Advantages
Simple set of primitives
Implementable on any
kind of hardware
Possibility to reason about
BSP programs
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41
Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
8 of 41
Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel: map, fold, scan
Task parallel: par, pipe, farm
More complex: Distribuable Homomorphism, Divide & Conquer, …
8 of 41
Relevance to our Objectives
Why using Parallel Skeletons ?
Write code independant of parallel programming minutiae
Composability supports hierarchical architectures
Code is scalable and easy to maintain
Why using BSP ?
Cost model guide development
Few primitives mean that intellectual burden is low
Good medium for developping skeletons
How to ensure performance of those models’ implementations ?
9 of 41
Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g: SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
10 of 41
Generative Programming [Eisenecker 97]
Domain Specific
Application Description
Generative Component Concrete Application
Translator
Parametric
Sub-components
11 of 41
Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++  sub-language
Handles types and integral constants at compile-time
 classes and functions act as code quoting
12 of 41
The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign
,expr<matrix&>
,expr<plus
, expr<cos
,expr<matrix&>
>
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
>
>(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel for
for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41
The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign
,expr<matrix&>
,expr<plus
, expr<cos
,expr<matrix&>
>
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
>
>(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel for
for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41
Our Contributions
Our Strategy
Applies DSEL generation techniques to parallel programming
Maintains low cost of abstractions through meta-programming
Maintains abstraction level via modern library design
Our contributions
Tools Pub. Scope Applications
Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction
SkellBE PACT’08 Skeleton on Cell BE Real-time Image processing
BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking
NT2
JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision
14 of 41
Example of BSP++ Application
Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia
BSP Smith & Waterman
S&W computes DNA sequences alignment
BSP++ implementation was written once and run on 7 different hardwares
Efficiency of 95+% even on 6000 cores super-computer
Platform MaxSize # Elements Speedup GCUPs
cluster (MPI) 1,072,950 128 cores 73x 6.53
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41
OpenMP 1,072,950 16 cores 16x 0.40
CellBE 85,603 8 SPEs — 0.14
cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37
Hopper(MPI) 5,303,436 3072 cores 260x 3.09
Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5
15 of 41
Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
16 of 41
Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
Proposed Method
Extends Generative Programming to take this architecture into account
Provides an architecture description DSEL
Integrates this description in the code generation process
16 of 41
Architecture Aware Generative Programming
17 of 41
Software refactoring
Tools Issues Changes
Quaff Raw skeletons API Re-engineered as part of NT2
SkellBE Too architecture specic Re-engineered as part of NT2
BSP++ Integration issues Integrate hybrid code generation
NT2
Not easily extendable Integrate Quaff Skeleton models
Boost.SIMD - Side product of NT2
restructuration
Conclusion
Skeletons are ne as parallel middleware
Model based abstractions are not high level enough
For low level architectures, the simplest model is often the best
18 of 41
Boost.SIMD
Pierre Estérie PHD 2010-2014
Principles
Provides simple C++ API over SIMD
extensions
Supports every Intel, PPC and ARM
instructions sets
Fully integrates with modern C++
idioms
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang
19 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
20 of 41
The Numerical Template Toolbox
Pierre Estérie PHD 2010-2014
NT2
as a Scientic Computing Library
Provides a simple, M-like interface for users
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
21 of 41
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
22 of 41
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
22 of 41
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include <nt2/nt2.hpp> and do cosmetic changes
22 of 41
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include <nt2/nt2.hpp> and do cosmetic changes
Compile the le and link with libnt2.a
22 of 41
NT2 - From M to C++
M code
A1 = 1:1000;
A2 = A1 + randn(size(A1));
X = lu(A1*A1’);
rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );
NT2
code
table <double > A1 = _(1. ,1000.);
table <double > A2 = A1 + randn(size(A1));
table <double > X = lu( mtimes(A1, trans(A1) );
double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );
23 of 41
Parallel Skeletons extraction process
A = B / sum(C+D);
=
A /
B sum
+
C D
fold
transform
24 of 41
Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A /
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
⇒
=
A /
B tmp
transform
25 of 41
From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
26 of 41
From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
Skeletons from the Future
Adapt current skeletons for taskication
Use Futures ( or HPX) to automatically pipeline
Derive a dependency graph between statements
26 of 41
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A /
B tmp
transform
27 of 41
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
tmp(3) sum(3)
+
C(:, 3) D(:, 3)
transform
=
A(:, 3) /
B(:, 3) tmp(3)
fold
=
tmp(2) sum(2)
+
C(:, 2) D(:, 2)
transform
=
A(:, 2) /
B(:, 2) tmp(2)
workerfold,simd
=
tmp(1) sum
+
C(:, 1) D(:, 1)
workertransform,simd
=
A(:, 1) /
B(:, 1) tmp(1)
spawnertransform,OpenMP
spawnertransform,OpenMP
;
28 of 41
Motion Detection
Lacassagne et al., ICIP 2009
Sigma-Delta algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
29 of 41
Motion Detection
table <char > sigma_delta( table <char >& background
, table <char > const& frame
, table <char >& variance
)
{
// Estimate Raw Movement
background = selinc( background < frame
, seldec(background > frame , background)
);
table <char > diff = dist(background , frame);
// Compute Local Variance
table <char > sig3 = muls(diff ,3);
var = if_else( diff != 0
, selinc( variance < sig3
, seldec( var > sig3 , variance)
)
, variance
);
// Generate Movement Label
return if_zero_else_one( diff < variance );
}
30 of 41
Motion Detection
0
2
4
6
8
10
12
14
16
18
512x512 1024x1024
cycles/element
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.99
x10.8
x10.8
SCALAR
HALF CORE
FULL CORE
SIMD
JRTIP2008
SIMD + HALF CORE
SIMD + FULL CORE
31 of 41
Black and Scholes Option Pricing
NT2
Code
table <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta
, table <float > const& ra, table <float > const& va
)
{
table <float > da = sqrt(Ta);
table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);
table <float > d2 = d1 -va*da;
return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);
}
32 of 41
Black and Scholes Option Pricing
NT2
Code with loop fusion
table <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta
, table <float > const& ra, table <float > const& va
)
{
// Preallocate temporary tables
table <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));
// tie merge loop nest and increase cache locality
tie(da,d1,d2,R) = tie( sqrt(Ta)
, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da)
, d1 -va*da
, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2)
);
return R;
}
32 of 41
Black and Scholes Option Pricing
Performance
1000000
0
50
100
150
x1.89
x2.91
x5.58
x6.30
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 41
Black and Scholes Option Pricing
Performance with loop fusion/futurisation
1000000
0
50
100
150
x2.27
x4.13
x8.05
x11.12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
34 of 41
LU Decomposition
Algorithm
A00
A01 A02A10
A20 A11
A21
A12
A22 A11
A12A21
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
35 of 41
LU Decomposition
Performance
0 10 20 30 40 50
0
50
100
Number of cores
MedianGFLOPS
8000 × 8000 LU decomposition
NT2
Intel MKL
36 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
37 of 41
Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
38 of 41
Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements
A new method for parallel software development
Efficient libraries working on large subset of hardware
High level of performances across a wide application spectrum
38 of 41
Works in Progress
Application to Accelerators
Exploration of proper skeleton implementation on GPUs
Adaptation of Future based code generator
In progress with Ian Masliah’ PHD thesis
Parallelism within C++
SIMD as part of the standard library
Proposal N3571 for standard SIMD computation
Interoperability with current parallel model of C++
39 of 41
Perspectives
DSEL as C++ rst class idiom
Build partial evaluation into the language
Ease transition between regular and meta C++
Mid-term Prospect: metaOCAML like quoting for C++
DSEL and compilers relationship
C++ DSEL hits a limit on their applicability
Compilers often lack high level informations for proper optimization
Mid-term Prospect: Hybrid library/compiler approaches for DSEL
40 of 41
Thanks for your attention

More Related Content

What's hot

Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...Shinpei Hayashi
 
The Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 MigrationThe Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 MigrationJoel Falcou
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerVineet Kumar Saini
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesCoen De Roover
 
C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answerlavparmar007
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1ReKruiTIn.com
 
Intro. to prog. c++
Intro. to prog. c++Intro. to prog. c++
Intro. to prog. c++KurdGul
 
Compiler Construction | Lecture 6 | Introduction to Static Analysis
Compiler Construction | Lecture 6 | Introduction to Static AnalysisCompiler Construction | Lecture 6 | Introduction to Static Analysis
Compiler Construction | Lecture 6 | Introduction to Static AnalysisEelco Visser
 
01 c++ Intro.ppt
01 c++ Intro.ppt01 c++ Intro.ppt
01 c++ Intro.pptTareq Hasan
 
A Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X TransformationA Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X TransformationCoen De Roover
 
C++ questions and answers
C++ questions and answersC++ questions and answers
C++ questions and answersDeepak Singh
 
Implementation
ImplementationImplementation
Implementationadil raja
 
C programming & data structure [character strings & string functions]
C programming & data structure   [character strings & string functions]C programming & data structure   [character strings & string functions]
C programming & data structure [character strings & string functions]MomenMostafa
 
C programming for Computing Techniques
C programming for Computing TechniquesC programming for Computing Techniques
C programming for Computing TechniquesAppili Vamsi Krishna
 

What's hot (20)

Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
Generating Assertion Code from OCL: A Transformational Approach Based on Simi...
 
The Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 MigrationThe Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 Migration
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
 
Vb.net
Vb.netVb.net
Vb.net
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
 
C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answer
 
PHP
PHPPHP
PHP
 
C++ Interview Questions
C++ Interview QuestionsC++ Interview Questions
C++ Interview Questions
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
C++ interview question
C++ interview questionC++ interview question
C++ interview question
 
Intro. to prog. c++
Intro. to prog. c++Intro. to prog. c++
Intro. to prog. c++
 
Compiler Construction | Lecture 6 | Introduction to Static Analysis
Compiler Construction | Lecture 6 | Introduction to Static AnalysisCompiler Construction | Lecture 6 | Introduction to Static Analysis
Compiler Construction | Lecture 6 | Introduction to Static Analysis
 
01 c++ Intro.ppt
01 c++ Intro.ppt01 c++ Intro.ppt
01 c++ Intro.ppt
 
C by balaguruswami - e.balagurusamy
C   by balaguruswami - e.balagurusamyC   by balaguruswami - e.balagurusamy
C by balaguruswami - e.balagurusamy
 
A Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X TransformationA Recommender System for Refining Ekeko/X Transformation
A Recommender System for Refining Ekeko/X Transformation
 
C++ questions and answers
C++ questions and answersC++ questions and answers
C++ questions and answers
 
Implementation
ImplementationImplementation
Implementation
 
C programming & data structure [character strings & string functions]
C programming & data structure   [character strings & string functions]C programming & data structure   [character strings & string functions]
C programming & data structure [character strings & string functions]
 
C programming for Computing Techniques
C programming for Computing TechniquesC programming for Computing Techniques
C programming for Computing Techniques
 
Deep C
Deep CDeep C
Deep C
 

Similar to HDR Defence - Software Abstractions for Parallel Architectures

Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
LIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolLIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolKellyton Brito
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructurekaveirious
 
Probe Debugging
Probe DebuggingProbe Debugging
Probe DebuggingESUG
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
 
The Concurrency Challenge : Notes
The Concurrency Challenge : NotesThe Concurrency Challenge : Notes
The Concurrency Challenge : NotesSubhajit Sahu
 
Whats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersWhats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersRainer Stropek
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question keyArthyR3
 

Similar to HDR Defence - Software Abstractions for Parallel Architectures (20)

Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
LIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolLIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval Tool
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Probe Debugging
Probe DebuggingProbe Debugging
Probe Debugging
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
Embedded system
Embedded systemEmbedded system
Embedded system
 
HPC and Simulation
HPC and SimulationHPC and Simulation
HPC and Simulation
 
The Concurrency Challenge : Notes
The Concurrency Challenge : NotesThe Concurrency Challenge : Notes
The Concurrency Challenge : Notes
 
Whats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersWhats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ Developers
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question key
 
cc23
cc23cc23
cc23
 

Recently uploaded

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Data modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainData modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainAbdul Ahad
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Copilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotCopilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotEdgard Alejos
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 

Recently uploaded (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Data modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainData modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software Domain
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Copilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotCopilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform Copilot
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 

HDR Defence - Software Abstractions for Parallel Architectures

  • 1. Software Abstractions for Parallel Architectures Joel Falcou LRI - CNRS - INRIA HDR Thesis Defense 12/01/2014
  • 2. The Paradigm Change in Science From Experiments to Simulations Simulations is now an integral part of the Scientic Method Scientic Computing enables larger, faster, more accurate Research Fast Simulation is Time Travel as scientic results are now more readily available Local Galaxy Cluster Simulation - Illustris project Computing is rst and foremost a mainstream science tool 2 of 41
  • 3. The Paradigm Change in Science The Parallel Hell Heat Wall: Growing cores instead of GHz Hierarchical and heterogeneous parallel systems are the norm The Free Lunch is over as hardware complexity rises faster than the average developer skills Local Galaxy Cluster Simulation - Illustris project The real challenge in HPC is the Expressiveness/Efficiency War 2 of 41
  • 4. The Expressiveness/Efficiency War Single Core Era Performance Expressiveness C/Fort. C++ Java Multi-Core/SIMD Era Performance Expressiveness Sequential Threads SIMD Heterogenous Era Performance Expressiveness Sequential SIMD Threads GPU Phi Distributed As parallel systems complexity grows, the expressiveness gap turns into an ocean 3 of 41
  • 5. Designing tools for Scientic Computing Objectives 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient 4 of 41
  • 6. Designing tools for Scientic Computing Objectives 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3) Use Parallel Programming Abstractions as parallel components (4) Use Generative Programming to deliver performance (5) 4 of 41
  • 7. Talk Layout Introduction Abstractions & Efficiency Experimental Results Conclusion 5 of 41
  • 8. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap 6 of 41
  • 9. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric: P-RAM, LOG-P, BSP Pattern centric: Futures, Skeletons Data centric: HTA, PGAS 6 of 41
  • 10. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric: P-RAM, LOG-P, BSP Pattern centric: Futures, Skeletons Data centric: HTA, PGAS 6 of 41
  • 11. Bulk Synchronous Parallelism [Valiant, McColl 90] Principles Machine Model Execution Model Analytic Cost Model C o m p u t e B a r r i e r C o m m Wmax h.g P0 P1 P2 P3 Superstep T Superstep T+1 LWmax h.g BSP Execution Model 7 of 41
  • 12. Bulk Synchronous Parallelism [Valiant, McColl 90] Advantages Simple set of primitives Implementable on any kind of hardware Possibility to reason about BSP programs C o m p u t e B a r r i e r C o m m Wmax h.g P0 P1 P2 P3 Superstep T Superstep T+1 LWmax h.g BSP Execution Model 7 of 41
  • 13. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 8 of 41
  • 14. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Classical Skeletons Data parallel: map, fold, scan Task parallel: par, pipe, farm More complex: Distribuable Homomorphism, Divide & Conquer, … 8 of 41
  • 15. Relevance to our Objectives Why using Parallel Skeletons ? Write code independant of parallel programming minutiae Composability supports hierarchical architectures Code is scalable and easy to maintain Why using BSP ? Cost model guide development Few primitives mean that intellectual burden is low Good medium for developping skeletons How to ensure performance of those models’ implementations ? 9 of 41
  • 16. Domain Specic Embedded Languages Domain Specic Languages Non-Turing complete declarative languages Solve a single type of problems Express what to do instead of how to do it E.g: SQL, M, M, … From DSL to DSEL [Abrahams 2004] A DSL incorporates domain-specic notation, constructs, and abstractions as fundamental design considerations. A Domain Specic Embedded Languages (DSEL) is simply a library that meets the same criteria Generative Programming is one way to design such libraries 10 of 41
  • 17. Generative Programming [Eisenecker 97] Domain Specific Application Description Generative Component Concrete Application Translator Parametric Sub-components 11 of 41
  • 18. Meta-programming as a Tool Denition Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. Meta-programmable Languages metaOCAML : runtime code generation via code quoting Template Haskell : compile-time code generation via templates C++ : compile-time code generation via templates C++ meta-programming Relies on the Turing-complete C++  sub-language Handles types and integral constants at compile-time  classes and functions act as code quoting 12 of 41
  • 19. The Expression Templates Idiom Principles Relies on extensive operator overloading Carries semantic information around code fragment Introduces DSLs without disrupting dev. chain matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b); + *cos a ab = x #pragma omp parallel for for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 13 of 41
  • 20. The Expression Templates Idiom Advantages Generic implementation becomes self-aware of optimizations API abstraction level is arbitrary high Accessible through high-level tools like B.P matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b); + *cos a ab = x #pragma omp parallel for for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 13 of 41
  • 21. Our Contributions Our Strategy Applies DSEL generation techniques to parallel programming Maintains low cost of abstractions through meta-programming Maintains abstraction level via modern library design Our contributions Tools Pub. Scope Applications Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction SkellBE PACT’08 Skeleton on Cell BE Real-time Image processing BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking NT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision 14 of 41
  • 22. Example of BSP++ Application Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia BSP Smith & Waterman S&W computes DNA sequences alignment BSP++ implementation was written once and run on 7 different hardwares Efficiency of 95+% even on 6000 cores super-computer Platform MaxSize # Elements Speedup GCUPs cluster (MPI) 1,072,950 128 cores 73x 6.53 cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41 OpenMP 1,072,950 16 cores 16x 0.40 CellBE 85,603 8 SPEs — 0.14 cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37 Hopper(MPI) 5,303,436 3072 cores 260x 3.09 Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5 15 of 41
  • 23. Second Look at our Contributions Development Limitations DSELs are mostly tied to the domain model Architecture support is often an afterthought Extensibility is difficult as many refactoring are required per architecture Example : No proper way to support GPUs with those implementation techniques 16 of 41
  • 24. Second Look at our Contributions Development Limitations DSELs are mostly tied to the domain model Architecture support is often an afterthought Extensibility is difficult as many refactoring are required per architecture Example : No proper way to support GPUs with those implementation techniques Proposed Method Extends Generative Programming to take this architecture into account Provides an architecture description DSEL Integrates this description in the code generation process 16 of 41
  • 25. Architecture Aware Generative Programming 17 of 41
  • 26. Software refactoring Tools Issues Changes Quaff Raw skeletons API Re-engineered as part of NT2 SkellBE Too architecture specic Re-engineered as part of NT2 BSP++ Integration issues Integrate hybrid code generation NT2 Not easily extendable Integrate Quaff Skeleton models Boost.SIMD - Side product of NT2 restructuration Conclusion Skeletons are ne as parallel middleware Model based abstractions are not high level enough For low level architectures, the simplest model is often the best 18 of 41
  • 27. Boost.SIMD Pierre Estérie PHD 2010-2014 Principles Provides simple C++ API over SIMD extensions Supports every Intel, PPC and ARM instructions sets Fully integrates with modern C++ idioms Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang 19 of 41
  • 28. Talk Layout Introduction Abstractions & Efficiency Experimental Results Conclusion 20 of 41
  • 29. The Numerical Template Toolbox Pierre Estérie PHD 2010-2014 NT2 as a Scientic Computing Library Provides a simple, M-like interface for users Provides high-performance computing entities and primitives Is easily extendable Components Uses Boost.SIMD for in-core optimizations Uses recursive parallel skeletons Supports task parallelism through Futures 21 of 41
  • 30. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M 22 of 41
  • 31. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le 22 of 41
  • 32. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include <nt2/nt2.hpp> and do cosmetic changes 22 of 41
  • 33. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include <nt2/nt2.hpp> and do cosmetic changes Compile the le and link with libnt2.a 22 of 41
  • 34. NT2 - From M to C++ M code A1 = 1:1000; A2 = A1 + randn(size(A1)); X = lu(A1*A1’); rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) ); NT2 code table <double > A1 = _(1. ,1000.); table <double > A2 = A1 + randn(size(A1)); table <double > X = lu( mtimes(A1, trans(A1) ); double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) ); 23 of 41
  • 35. Parallel Skeletons extraction process A = B / sum(C+D); = A / B sum + C D fold transform 24 of 41
  • 36. Parallel Skeletons extraction process A = B / sum(C+D); ; ; = A / B sum + C D fold transform = tmp sum + C D fold ⇒ = A / B tmp transform 25 of 41
  • 37. From data to task parallelism Antoine Tran Tan PHD, 2012-2015 Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization 26 of 41
  • 38. From data to task parallelism Antoine Tran Tan PHD, 2012-2015 Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization Skeletons from the Future Adapt current skeletons for taskication Use Futures ( or HPX) to automatically pipeline Derive a dependency graph between statements 26 of 41
  • 39. Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); ; ; = tmp sum + C D fold = A / B tmp transform 27 of 41
  • 40. Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); fold = tmp(3) sum(3) + C(:, 3) D(:, 3) transform = A(:, 3) / B(:, 3) tmp(3) fold = tmp(2) sum(2) + C(:, 2) D(:, 2) transform = A(:, 2) / B(:, 2) tmp(2) workerfold,simd = tmp(1) sum + C(:, 1) D(:, 1) workertransform,simd = A(:, 1) / B(:, 1) tmp(1) spawnertransform,OpenMP spawnertransform,OpenMP ; 28 of 41
  • 41. Motion Detection Lacassagne et al., ICIP 2009 Sigma-Delta algorithm based on background substraction Use local gaussian model of lightness variation to detect motion Challenge: Very low arithmetic density Challenge: Integer-based implementation with small range 29 of 41
  • 42. Motion Detection table <char > sigma_delta( table <char >& background , table <char > const& frame , table <char >& variance ) { // Estimate Raw Movement background = selinc( background < frame , seldec(background > frame , background) ); table <char > diff = dist(background , frame); // Compute Local Variance table <char > sig3 = muls(diff ,3); var = if_else( diff != 0 , selinc( variance < sig3 , seldec( var > sig3 , variance) ) , variance ); // Generate Movement Label return if_zero_else_one( diff < variance ); } 30 of 41
  • 43. Motion Detection 0 2 4 6 8 10 12 14 16 18 512x512 1024x1024 cycles/element Image Size (N x N) x6.8 x14.8 x16.5 x2.1 x3.6 x6.7 x15.3 x18 x2.3 x3.99 x10.8 x10.8 SCALAR HALF CORE FULL CORE SIMD JRTIP2008 SIMD + HALF CORE SIMD + FULL CORE 31 of 41
  • 44. Black and Scholes Option Pricing NT2 Code table <float > blackscholes( table <float > const& Sa, table <float > const& Xa , table <float > const& Ta , table <float > const& ra, table <float > const& va ) { table <float > da = sqrt(Ta); table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da); table <float > d2 = d1 -va*da; return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2); } 32 of 41
  • 45. Black and Scholes Option Pricing NT2 Code with loop fusion table <float > blackscholes( table <float > const& Sa, table <float > const& Xa , table <float > const& Ta , table <float > const& ra, table <float > const& va ) { // Preallocate temporary tables table <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta)); // tie merge loop nest and increase cache locality tie(da,d1,d2,R) = tie( sqrt(Ta) , log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da) , d1 -va*da , Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2) ); return R; } 32 of 41
  • 46. Black and Scholes Option Pricing Performance 1000000 0 50 100 150 x1.89 x2.91 x5.58 x6.30 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 33 of 41
  • 47. Black and Scholes Option Pricing Performance with loop fusion/futurisation 1000000 0 50 100 150 x2.27 x4.13 x8.05 x11.12 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 34 of 41
  • 48. LU Decomposition Algorithm A00 A01 A02A10 A20 A11 A21 A12 A22 A11 A12A21 A22 A22 step 1 step 2 step 3 step 4 step 5 step 6 step 7 DGETRF DGESSM DTSTRF DSSSSM 35 of 41
  • 49. LU Decomposition Performance 0 10 20 30 40 50 0 50 100 Number of cores MedianGFLOPS 8000 × 8000 LU decomposition NT2 Intel MKL 36 of 41
  • 50. Talk Layout Introduction Abstractions & Efficiency Experimental Results Conclusion 37 of 41
  • 51. Conclusion Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, DSEL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 38 of 41
  • 52. Conclusion Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, DSEL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability Our Achievements A new method for parallel software development Efficient libraries working on large subset of hardware High level of performances across a wide application spectrum 38 of 41
  • 53. Works in Progress Application to Accelerators Exploration of proper skeleton implementation on GPUs Adaptation of Future based code generator In progress with Ian Masliah’ PHD thesis Parallelism within C++ SIMD as part of the standard library Proposal N3571 for standard SIMD computation Interoperability with current parallel model of C++ 39 of 41
  • 54. Perspectives DSEL as C++ rst class idiom Build partial evaluation into the language Ease transition between regular and meta C++ Mid-term Prospect: metaOCAML like quoting for C++ DSEL and compilers relationship C++ DSEL hits a limit on their applicability Compilers often lack high level informations for proper optimization Mid-term Prospect: Hybrid library/compiler approaches for DSEL 40 of 41
  • 55. Thanks for your attention