SlideShare a Scribd company logo
C++ on its way to exascale and beyond
– The HPX Parallel Runtime System
Thomas Heller (thomas.heller@cs.fau.de)
January 21, 2016
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
What is Exascale anyway?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Exascale in numbers
• An Exascale Computer is supposed to execute 1018
floating point
operations in a second
• Exa: 1018
= 1000000000000000000
• People on Earth: 7.3 Billion = 7.3 ∗ 109
• Imagine each person is able to compute one operation per second. It
takes:
⇒ 136986301 seconds
⇒ 2283105 minutes
⇒ 38051 hours
⇒ 1585 days
⇒ 4 years
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
3/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Challenges
• How do we program those beasts?
⇒ Massively parallel processors
⇒ Massive amount of compute nodes
⇒ Deep Memory hierarchies
• How can we design the architecture to be affordable?
⇒ Biggest Operational cost is Energy
⇒ Power Envelop of 20MW
⇒ Current fastest Computer (Tian-He 2): 17MW
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
5/ 51
Current Development
Current #1 System:
• Tian-He 2: 33.9 PFLOPS
• 4% of an Exaflop
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
6/ 51
Hardware Trends
• ARM: Low-Power ARM64 cores (maybe adding embedded GPU
accelerators)
• IBM: POWER + NVIDIA Accelerators
• Intel: Knights Landing (Xeon Phi) Many Core processor
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
7/ 51
How will C++ deal with all that?!?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Challenges
• Programmability
• Expressing Parallelism
• Expressing Data Locality
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
9/ 51
The 4 Horsemen of the Apocalypse: SLOW
Starvation
Latency
Overhead
Waiting for contention
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
10/ 51
State of the Art
• Modern architectures impose massive challenges on programmability in
the context of performance portability
• Massive increase in on-node parallelism
• Deep memory hierarchies
• Only portable parallelization solution for C++ programmers (today):
OpenMP and MPI
• Hugely successful for years
• Widely used and supported
• Simple use for simple use cases
• Very portable
• Highly optimized
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
11/ 51
State of the Art – Parallelism in C++
• C++11 introduced lower level abstractions
• std::thread, std::mutex, std::future, etc.
• Fairly limited, more is needed
• C++ needs stronger support for higher-level parallelism
• Several proposals to the Standardization Committee are accepted or
under consideration
• Technical Specification: Concurrency (P0159, note: misnomer)
• Technical Specification: Parallelism (P0024)
• Other smaller proposals: resumable functions, task regions, executors
• Currently there is no overarching vision related to higher-level parallelism
• Goal is to standardize a ‘big story’ by 2020
• No need for OpenMP, OpenACC, OpenCL, etc.
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
12/ 51
Stepping Aside – Introducing HPX
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
HPX – A general purpose parallel Runtime System
• Solidly based on a theoretical foundation – a well defined, new execution
model (ParalleX)
• Exposes a coherent and uniform, standards-oriented API for ease of
programming parallel and distributed applications.
• Enables to write fully asynchronous code using hundreds of millions of threads.
• Provides unified syntax and semantics for local and remote operations.
• Open Source: Published under the Boost Software License
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
14/ 51
HPX – A general purpose parallel Runtime System
HPX represents an innovative mixture of
• A global system-wide address space (AGAS - Active Global Address
Space)
• Fine grain parallelism and lightweight synchronization
• Combined with implicit, work queue based, message driven computation
• Full semantic equivalence of local and remote execution, and
• Explicit support for hardware accelerators (through percolation)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
15/ 51
HPX 101 – The programming model
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
future <id_type > id =
new_ <Component >( locality , ...);
future <R> result =
async(id.get(), action , ...);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Locality 0 Locality 1 Locality i Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – Overview
HPX
C++ Standard Library
C++
R f(p...) Synchronous Asynchronous Fire & Forget
(returns R) (returns future<R>) (returns void)
Functions f(p...) async(f, p...) apply(f, p...)
(direct)
Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...)
(lazy)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(lazy) bind(a(), id, p...)
(...)
async(bind(a(), id, p...),
...)
apply(bind(a(), id, p...),
...)
In Addition: dataflow(func, f1, f2);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
17/ 51
The Future, an example
int universal_answer () { return 42; }
void deep_thought () {
future <int > promised_answer
= async(& universal_answer);
// do other things for 7.5 million years
cout << promised_answer.get() << endl;
// prints 42, eventually
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
18/ 51
Compositional facilities
• Sequential composition of futures
future <string > make_string () {
future <int > f1 =
async ([]() -> int { return 123; });
future <string > f2 = f1.then(
[](future <int > f) -> string
{
// here .get() won’t block
return to_string(f.get());
});
return f2;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
19/ 51
Compositional facilities
• Parallel composition of futures
future <int > test_when_all () {
future <int > future1 =
async ([]() -> int { return 125; });
future <string > future2 =
async ([]() -> string { return string("hi"); });
auto all_f = when_all(future1 , future2);
future <int > result = all_f.then(
[]( auto f) -> int {
return do_work(f.get());
});
return result;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
20/ 51
Dataflow – The new ’async’ (HPX)
• What if one or more arguments to ’async’ are futures themselves?
• Normal behavior: pass futures through to function
• Extended behavior: wait for futures to become ready before invoking the
function:
template <typename F, typename ... Arg >
future <result_of_t <F(Args ...) >>
// requires(is_callable <F(Arg ...) >)
dataflow(F && f, Arg &&... arg);
• If ArgN is a future, then the invocation of F will be delayed
• Non-future arguments are passed through
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
21/ 51
Parallel Algorithms
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Concepts of Parallelism – Parallel Execution Properties
• The execution restrictions applicable for the work items
• In what sequence the work items have to be executed
• Where the work items should be executed
• The parameters of the execution environment
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
23/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
Parallel Algorithms Fork-Join, etc
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Execution Policies (std)
• Specify execution guarantees (in terms of thread-safety) for executed
parallel tasks:
• sequential_execution_policy: seq
• parallel_execution_policy: par
• parallel_vector_execution_policy: par_vec
• In parallelism TS used for parallel algorithms only
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
25/ 51
Execution Policies (Extensions)
• Asynchronous Execution Policies:
• sequential_task_execution_policy: seq(task)
• parallel_task_execution_policy: par(task)
• In both cases the formerly synchronous functions return a future<>
• Instruct the parallel construct to be executed asynchronously
• Allows integration with asynchronous control flow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
26/ 51
Executors
• Executor are objects responsible for
• Creating execution agents on which work is performed (P0058)
• In P0058 this is limited to parallel algorithms, here much broader use
• Abstraction of the (potentially platform-specific) mechanisms for launching
work
• Responsible for defining the Where and How of the execution of tasks
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
27/ 51
Execution Parameters
Allows to control the grain size of work
• i.e. amount of iterations of a parallel for_each run on the same thread
• Similar to OpenMP scheduling policies: static, guided, dynamic
• Much more fine control
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
28/ 51
Putting it all together – SAXPY routine with data locality
• a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1
• Using parallel algorithms
• Explicit Control over data locality
• No raw Loops
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
29/ 51
Putting it all together – SAXPY routine with data locality
Complete serial version:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
std:: transform(b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
30/ 51
Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
parallel :: transform(parallel ::par ,
b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
31/ 51
Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double , numa_allocator > a = ...;
std::vector <double , numa_allocator > b = ...;
std::vector <double , numa_allocator > c = ...;
double x = ...;
for(numa_executor : numa_executors) {
parallel :: transform(
parallel ::par.on(numa_executor),
b.begin() +..., b.begin() +...,
c.begin() +..., c.begin() +..., a.begin() +...,
[x]( double bb, double cc)
{ return bb * x + cc; });
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
32/ 51
Case Studies
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
LibGeoDecomp
• C++ Auto-parallelizing framework
• Open Source
• High scalability
• Wide range of platform support
• http://www.libgeodecomp.org
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
34/ 51
LibGeoDecomp
Futurizing the Simulation Flow
Basic Simulation flow:
for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid);
++step;
for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
35/ 51
LibGeoDecomp
Futurizing the Simulation Flow
Futurized Simulation flow:
parallel for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid); ++ step;
parallel for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
parallel for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
parallel for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
Continuation
Continuation
Continuation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
36/ 51
HPXCL – Extending the Global Adress Space
• All GPU devices are addressable globally
• GPU memory can be allocated and referenced remotely
• Events are extensions of the shared state
⇒ API embedded into the already existing future facilities
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
37/ 51
From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
auto devices
= hpx:: opencl :: find_devices(hpx:: find_here (),
CL_DEVICE_TYPE_GPU).get();
// create buffers , programs and kernels ...
hpx:: opencl :: buffer buf = devices [0]. create_buffer(
CL_MEM_READ_WRITE , 4711);
auto write_future = buf.enqueue_write(some_vec.
begin(), some_vec.end());
auto kernel_future = kernel.enqueue(dim ,
write_future);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
• Proof of Concept
• Future Directions:
• Embedd OpenCL devices behind Execution Policies and Executors
• Hide OpenCL stuff behind parallel algorithms
• Hide OpenCL buffer management behind "distributed data structures"
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
Mandelbrot example
Queue
Google
Maps API
Client
Worker
Generator
Worker
Webserver
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
39/ 51
Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
40/ 51
Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
41/ 51
LibGeoDecomp
Performance Results
0
10
20
30
40
50
60
70
1 2 4 8 16
Time[s]
Number of Cores, on one Node
Execution Times of HPX and MPI N-Body Codes
(SMP, Weak Scaling)
Sim HPX
Sim MPI
Comm HPX
Comm MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60
PerformanceinGFLOPS
Number of Cores
Weak Scaling Results for HPX N-Body Code
(Single Xeon Phi, Futurized)
1 Thread/Core
2 Threads/Core
3 Threads/Core
4 Threads/Core
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
PerformanceinTFLOPS
Number of Nodes, 16 Cores on Host, Full Xeon Phi
Weak Scaling Results for HPX N-Body Codes
(Host Cores and Xeon Phi Accelerator)
HPX
Peak
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
STREAM Benchmark
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth[GB/s]
Number of cores per NUMA Domain
TRIAD STREAM Results
(50 million data points)
HPX (1 NUMA Domain)
OpenMP (1 NUMA Domain)
HPX (2 NUMA Domains)
OpenMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
43/ 51
Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (1 NUMA Domain)
HPX (2 NUMA Domains)
OMP (1 NUMA Domain)
OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
44/ 51
Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (2 NUMA Domains)
MPI (1 NUMA Domain, 12 ranks)
MPI (2 NUMA Domains, 24 ranks)
MPI+OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
45/ 51
Matrix Transpose
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60
Datatransferrate[GB/s]
Number of cores
Matrix Transpose (Xeon/Phi, 24kx24k matrices)
HPX (4 PUs per core) OMP (4 PUs per core)
HPX (2 PUs per core) OMP (2 PUs per core)
HPX (1 PUs per core) OMP (1 PUs per core)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
46/ 51
Matrix Transpose
0
5
10
15
20
25
30
35
2 3 4 5 6 7 8
Datatransferrate[GB/s]
Number of nodes (16 cores each)
Matrix Transpose (Distributed, 18kx18k elements per node)
HPX MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
47/ 51
What’s beyond Exascale?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Conclusions
Higher-level parallelization abstractions in C++:
• uniform, versatile, and generic
• All of this is enabled by use of modern C++ facilities
• Runtime system (fine-grain, task-based schedulers)
• Performant, portable implementation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
49/ 51
Parallelism is here to stay!
• Massive Parallel Hardware is already part of our daily lives!
• Parallelism is observable everywhere:
⇒ IoT: Massive amount devices existing in parallel
⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs,
FPGAs)
⇒ Automotive: Massive amount of parallel sensor data to process
• We all need solutions on how to deal with this, efficiently and pragmatically
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
50/ 51
More Information
• https://github.com/STEllAR-GROUP/hpx
• http://stellar-group.org
• hpx-users@stellar.cct.lsu.edu
• #STE||AR @ irc.freenode.org
Collaborations:
• FET-HPC (H2020): AllScale (https://allscale.eu)
• NSF: STORM (http://storm.stellar-group.org)
• DOE: Part of X-Stack
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
51/ 51

More Related Content

What's hot

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisLWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
Jonas Traub
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Samuel Bosch
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
Vasia Kalavri
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AI
Jim Dowling
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
Claudio Martella
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCVENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
Qin Liu
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
Turi, Inc.
 

What's hot (18)

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisLWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AI
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCVENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 

Viewers also liked

Римский корсаков снегурочка
Римский корсаков снегурочкаРимский корсаков снегурочка
Римский корсаков снегурочка
Ninel Kek
 
Цветочные легенды
Цветочные легендыЦветочные легенды
Цветочные легенды
Ninel Kek
 
High Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSHigh Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSJonathan Oliver
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
бсп (обоб. урок)
бсп (обоб. урок)бсп (обоб. урок)
бсп (обоб. урок)
HomichAlla
 
правописание приставок урок№4
правописание приставок урок№4правописание приставок урок№4
правописание приставок урок№4
HomichAlla
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
james tong
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex
Apache Apex
 
The 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy CodeThe 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy Code
Roberto Cortez
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
Hadoop online training
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
bispsolutions
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Build your shiny new pc, with Pangoly
Build your shiny new pc, with PangolyBuild your shiny new pc, with Pangoly
Build your shiny new pc, with Pangoly
Pangoly
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examplesIntroduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examples
Noé Fernández-Pozo
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 

Viewers also liked (20)

Римский корсаков снегурочка
Римский корсаков снегурочкаРимский корсаков снегурочка
Римский корсаков снегурочка
 
Цветочные легенды
Цветочные легендыЦветочные легенды
Цветочные легенды
 
High Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRSHigh Performance Distributed Systems with CQRS
High Performance Distributed Systems with CQRS
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
бсп (обоб. урок)
бсп (обоб. урок)бсп (обоб. урок)
бсп (обоб. урок)
 
правописание приставок урок№4
правописание приставок урок№4правописание приставок урок№4
правописание приставок урок№4
 
Troubleshooting mysql-tutorial
Troubleshooting mysql-tutorialTroubleshooting mysql-tutorial
Troubleshooting mysql-tutorial
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex
 
The 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy CodeThe 5 People in your Organization that grow Legacy Code
The 5 People in your Organization that grow Legacy Code
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Build your shiny new pc, with Pangoly
Build your shiny new pc, with PangolyBuild your shiny new pc, with Pangoly
Build your shiny new pc, with Pangoly
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examplesIntroduction to UNIX Command-Lines with examples
Introduction to UNIX Command-Lines with examples
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 

Similar to C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
Joel Falcou
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
Thomas Wuerthinger
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
HPCC Systems
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
dairsie
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
HPCC Systems
 
Going deep (learning) with tensor flow and quarkus
Going deep (learning) with tensor flow and quarkusGoing deep (learning) with tensor flow and quarkus
Going deep (learning) with tensor flow and quarkus
Red Hat Developers
 
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
SasidharaKashyapChat
 
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
Ioan Toma
 
HDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesHDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel Architectures
Joel Falcou
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
Edge AI and Vision Alliance
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi
 
Deployment of an HPC Cloud based on Intel hardware
Deployment of an HPC Cloud based on Intel hardwareDeployment of an HPC Cloud based on Intel hardware
Deployment of an HPC Cloud based on Intel hardware
Intel IT Center
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
ExtremeEarth
 
EuroHPC AI in DAPHNE
EuroHPC AI in DAPHNEEuroHPC AI in DAPHNE
EuroHPC AI in DAPHNE
University of Maribor
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
Esteban Hernandez
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
Bertrand Tavitian
 
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
Edge AI and Vision Alliance
 

Similar to C++ on its way to exascale and beyond -- The HPX Parallel Runtime System (20)

Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
 
Going deep (learning) with tensor flow and quarkus
Going deep (learning) with tensor flow and quarkusGoing deep (learning) with tensor flow and quarkus
Going deep (learning) with tensor flow and quarkus
 
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
 
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)
 
HDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesHDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel Architectures
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Deployment of an HPC Cloud based on Intel hardware
Deployment of an HPC Cloud based on Intel hardwareDeployment of an HPC Cloud based on Intel hardware
Deployment of an HPC Cloud based on Intel hardware
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
EuroHPC AI in DAPHNE
EuroHPC AI in DAPHNEEuroHPC AI in DAPHNE
EuroHPC AI in DAPHNE
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
 

Recently uploaded

Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 

Recently uploaded (20)

Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 

C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

  • 1. C++ on its way to exascale and beyond – The HPX Parallel Runtime System Thomas Heller (thomas.heller@cs.fau.de) January 21, 2016 This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 2. What is Exascale anyway? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 3. Exascale in numbers • An Exascale Computer is supposed to execute 1018 floating point operations in a second • Exa: 1018 = 1000000000000000000 • People on Earth: 7.3 Billion = 7.3 ∗ 109 • Imagine each person is able to compute one operation per second. It takes: ⇒ 136986301 seconds ⇒ 2283105 minutes ⇒ 38051 hours ⇒ 1585 days ⇒ 4 years C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 3/ 51
  • 4. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 5. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 6. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 7. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 8. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 9. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 10. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 11. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 12. Challenges • How do we program those beasts? ⇒ Massively parallel processors ⇒ Massive amount of compute nodes ⇒ Deep Memory hierarchies • How can we design the architecture to be affordable? ⇒ Biggest Operational cost is Energy ⇒ Power Envelop of 20MW ⇒ Current fastest Computer (Tian-He 2): 17MW C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 5/ 51
  • 13. Current Development Current #1 System: • Tian-He 2: 33.9 PFLOPS • 4% of an Exaflop C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 6/ 51
  • 14. Hardware Trends • ARM: Low-Power ARM64 cores (maybe adding embedded GPU accelerators) • IBM: POWER + NVIDIA Accelerators • Intel: Knights Landing (Xeon Phi) Many Core processor C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 7/ 51
  • 15. How will C++ deal with all that?!? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 16. Challenges • Programmability • Expressing Parallelism • Expressing Data Locality C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 9/ 51
  • 17. The 4 Horsemen of the Apocalypse: SLOW Starvation Latency Overhead Waiting for contention C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 10/ 51
  • 18. State of the Art • Modern architectures impose massive challenges on programmability in the context of performance portability • Massive increase in on-node parallelism • Deep memory hierarchies • Only portable parallelization solution for C++ programmers (today): OpenMP and MPI • Hugely successful for years • Widely used and supported • Simple use for simple use cases • Very portable • Highly optimized C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 11/ 51
  • 19. State of the Art – Parallelism in C++ • C++11 introduced lower level abstractions • std::thread, std::mutex, std::future, etc. • Fairly limited, more is needed • C++ needs stronger support for higher-level parallelism • Several proposals to the Standardization Committee are accepted or under consideration • Technical Specification: Concurrency (P0159, note: misnomer) • Technical Specification: Parallelism (P0024) • Other smaller proposals: resumable functions, task regions, executors • Currently there is no overarching vision related to higher-level parallelism • Goal is to standardize a ‘big story’ by 2020 • No need for OpenMP, OpenACC, OpenCL, etc. C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 12/ 51
  • 20. Stepping Aside – Introducing HPX This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 21. HPX – A general purpose parallel Runtime System • Solidly based on a theoretical foundation – a well defined, new execution model (ParalleX) • Exposes a coherent and uniform, standards-oriented API for ease of programming parallel and distributed applications. • Enables to write fully asynchronous code using hundreds of millions of threads. • Provides unified syntax and semantics for local and remote operations. • Open Source: Published under the Boost Software License C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 14/ 51
  • 22. HPX – A general purpose parallel Runtime System HPX represents an innovative mixture of • A global system-wide address space (AGAS - Active Global Address Space) • Fine grain parallelism and lightweight synchronization • Combined with implicit, work queue based, message driven computation • Full semantic equivalence of local and remote execution, and • Explicit support for hardware accelerators (through percolation) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 15/ 51
  • 23. HPX 101 – The programming model Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 24. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 25. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 26. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread future <id_type > id = new_ <Component >( locality , ...); future <R> result = async(id.get(), action , ...); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 27. HPX 101 – The programming model Locality 0 Locality 1 Locality i Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 28. HPX 101 – Overview HPX C++ Standard Library C++ R f(p...) Synchronous Asynchronous Fire & Forget (returns R) (returns future<R>) (returns void) Functions f(p...) async(f, p...) apply(f, p...) (direct) Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) (...) async(bind(a(), id, p...), ...) apply(bind(a(), id, p...), ...) In Addition: dataflow(func, f1, f2); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 17/ 51
  • 29. The Future, an example int universal_answer () { return 42; } void deep_thought () { future <int > promised_answer = async(& universal_answer); // do other things for 7.5 million years cout << promised_answer.get() << endl; // prints 42, eventually } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 18/ 51
  • 30. Compositional facilities • Sequential composition of futures future <string > make_string () { future <int > f1 = async ([]() -> int { return 123; }); future <string > f2 = f1.then( [](future <int > f) -> string { // here .get() won’t block return to_string(f.get()); }); return f2; } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 19/ 51
  • 31. Compositional facilities • Parallel composition of futures future <int > test_when_all () { future <int > future1 = async ([]() -> int { return 125; }); future <string > future2 = async ([]() -> string { return string("hi"); }); auto all_f = when_all(future1 , future2); future <int > result = all_f.then( []( auto f) -> int { return do_work(f.get()); }); return result; } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 20/ 51
  • 32. Dataflow – The new ’async’ (HPX) • What if one or more arguments to ’async’ are futures themselves? • Normal behavior: pass futures through to function • Extended behavior: wait for futures to become ready before invoking the function: template <typename F, typename ... Arg > future <result_of_t <F(Args ...) >> // requires(is_callable <F(Arg ...) >) dataflow(F && f, Arg &&... arg); • If ArgN is a future, then the invocation of F will be delayed • Non-future arguments are passed through C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 21/ 51
  • 33. Parallel Algorithms This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 34. Concepts of Parallelism – Parallel Execution Properties • The execution restrictions applicable for the work items • In what sequence the work items have to be executed • Where the work items should be executed • The parameters of the execution environment C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 23/ 51
  • 35. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 36. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 37. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 38. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 39. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size Futures, Async, Dataflow C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 40. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size Futures, Async, Dataflow Parallel Algorithms Fork-Join, etc C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 41. Execution Policies (std) • Specify execution guarantees (in terms of thread-safety) for executed parallel tasks: • sequential_execution_policy: seq • parallel_execution_policy: par • parallel_vector_execution_policy: par_vec • In parallelism TS used for parallel algorithms only C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 25/ 51
  • 42. Execution Policies (Extensions) • Asynchronous Execution Policies: • sequential_task_execution_policy: seq(task) • parallel_task_execution_policy: par(task) • In both cases the formerly synchronous functions return a future<> • Instruct the parallel construct to be executed asynchronously • Allows integration with asynchronous control flow C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 26/ 51
  • 43. Executors • Executor are objects responsible for • Creating execution agents on which work is performed (P0058) • In P0058 this is limited to parallel algorithms, here much broader use • Abstraction of the (potentially platform-specific) mechanisms for launching work • Responsible for defining the Where and How of the execution of tasks C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 27/ 51
  • 44. Execution Parameters Allows to control the grain size of work • i.e. amount of iterations of a parallel for_each run on the same thread • Similar to OpenMP scheduling policies: static, guided, dynamic • Much more fine control C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 28/ 51
  • 45. Putting it all together – SAXPY routine with data locality • a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1 • Using parallel algorithms • Explicit Control over data locality • No raw Loops C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 29/ 51
  • 46. Putting it all together – SAXPY routine with data locality Complete serial version: std::vector <double > a = ...; std::vector <double > b = ...; std::vector <double > c = ...; double x = ...; std:: transform(b.begin(), b.end(), c.begin(), c.end(), a.begin(), [x]( double bb, double cc) { return bb * x + cc; }); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 30/ 51
  • 47. Putting it all together – SAXPY routine with data locality Parallel version, no data locality: std::vector <double > a = ...; std::vector <double > b = ...; std::vector <double > c = ...; double x = ...; parallel :: transform(parallel ::par , b.begin(), b.end(), c.begin(), c.end(), a.begin(), [x]( double bb, double cc) { return bb * x + cc; }); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 31/ 51
  • 48. Putting it all together – SAXPY routine with data locality Parallel version, no data locality: std::vector <double , numa_allocator > a = ...; std::vector <double , numa_allocator > b = ...; std::vector <double , numa_allocator > c = ...; double x = ...; for(numa_executor : numa_executors) { parallel :: transform( parallel ::par.on(numa_executor), b.begin() +..., b.begin() +..., c.begin() +..., c.begin() +..., a.begin() +..., [x]( double bb, double cc) { return bb * x + cc; }); } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 32/ 51
  • 49. Case Studies This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 50. LibGeoDecomp • C++ Auto-parallelizing framework • Open Source • High scalability • Wide range of platform support • http://www.libgeodecomp.org C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 34/ 51
  • 51. LibGeoDecomp Futurizing the Simulation Flow Basic Simulation flow: for(Region r: innerRegion) { update(r, oldGrid , newGrid , step); } swap(oldGrid , newGrid); ++step; for(Region r: outerGhostZoneRegion) { notifyPatchProviders(r, oldGrid); } for(Region r: outerGhostZoneRegion) { update(r, oldGrid , newGrid , step); } for(Region r: innerGhostZoneRegion) { notifyPatchAccepters(r, oldGrid); } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 35/ 51
  • 52. LibGeoDecomp Futurizing the Simulation Flow Futurized Simulation flow: parallel for(Region r: innerRegion) { update(r, oldGrid , newGrid , step); } swap(oldGrid , newGrid); ++ step; parallel for(Region r: outerGhostZoneRegion) { notifyPatchProviders(r, oldGrid); } parallel for(Region r: outerGhostZoneRegion) { update(r, oldGrid , newGrid , step); } parallel for(Region r: innerGhostZoneRegion) { notifyPatchAccepters(r, oldGrid); } Continuation Continuation Continuation C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 36/ 51
  • 53. HPXCL – Extending the Global Adress Space • All GPU devices are addressable globally • GPU memory can be allocated and referenced remotely • Events are extensions of the shared state ⇒ API embedded into the already existing future facilities C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 37/ 51
  • 54. From async to GPUs Spawning single tasks not feasible ⇒ offload a work group (Think of parallel::for_each) auto devices = hpx:: opencl :: find_devices(hpx:: find_here (), CL_DEVICE_TYPE_GPU).get(); // create buffers , programs and kernels ... hpx:: opencl :: buffer buf = devices [0]. create_buffer( CL_MEM_READ_WRITE , 4711); auto write_future = buf.enqueue_write(some_vec. begin(), some_vec.end()); auto kernel_future = kernel.enqueue(dim , write_future); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 38/ 51
  • 55. From async to GPUs Spawning single tasks not feasible ⇒ offload a work group (Think of parallel::for_each) • Proof of Concept • Future Directions: • Embedd OpenCL devices behind Execution Policies and Executors • Hide OpenCL stuff behind parallel algorithms • Hide OpenCL buffer management behind "distributed data structures" C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 38/ 51
  • 56. Mandelbrot example Queue Google Maps API Client Worker Generator Worker Webserver Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 39/ 51
  • 57. Mandelbrot example Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 40/ 51
  • 58. Mandelbrot example Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 41/ 51
  • 59. LibGeoDecomp Performance Results 0 10 20 30 40 50 60 70 1 2 4 8 16 Time[s] Number of Cores, on one Node Execution Times of HPX and MPI N-Body Codes (SMP, Weak Scaling) Sim HPX Sim MPI Comm HPX Comm MPI C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 60. LibGeoDecomp Performance Results C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 61. LibGeoDecomp Performance Results 0 200 400 600 800 1000 1200 1400 1600 0 10 20 30 40 50 60 PerformanceinGFLOPS Number of Cores Weak Scaling Results for HPX N-Body Code (Single Xeon Phi, Futurized) 1 Thread/Core 2 Threads/Core 3 Threads/Core 4 Threads/Core C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 62. LibGeoDecomp Performance Results 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 PerformanceinTFLOPS Number of Nodes, 16 Cores on Host, Full Xeon Phi Weak Scaling Results for HPX N-Body Codes (Host Cores and Xeon Phi Accelerator) HPX Peak C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 63. STREAM Benchmark 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 11 12 Bandwidth[GB/s] Number of cores per NUMA Domain TRIAD STREAM Results (50 million data points) HPX (1 NUMA Domain) OpenMP (1 NUMA Domain) HPX (2 NUMA Domains) OpenMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 43/ 51
  • 64. Matrix Transpose 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 Datatransferrate[GB/s] Number of cores per NUMA domain Matrix Transpose (SMP, 24kx24k Matrices) HPX (1 NUMA Domain) HPX (2 NUMA Domains) OMP (1 NUMA Domain) OMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 44/ 51
  • 65. Matrix Transpose 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 Datatransferrate[GB/s] Number of cores per NUMA domain Matrix Transpose (SMP, 24kx24k Matrices) HPX (2 NUMA Domains) MPI (1 NUMA Domain, 12 ranks) MPI (2 NUMA Domains, 24 ranks) MPI+OMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 45/ 51
  • 66. Matrix Transpose 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 Datatransferrate[GB/s] Number of cores Matrix Transpose (Xeon/Phi, 24kx24k matrices) HPX (4 PUs per core) OMP (4 PUs per core) HPX (2 PUs per core) OMP (2 PUs per core) HPX (1 PUs per core) OMP (1 PUs per core) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 46/ 51
  • 67. Matrix Transpose 0 5 10 15 20 25 30 35 2 3 4 5 6 7 8 Datatransferrate[GB/s] Number of nodes (16 cores each) Matrix Transpose (Distributed, 18kx18k elements per node) HPX MPI C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 47/ 51
  • 68. What’s beyond Exascale? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 69. Conclusions Higher-level parallelization abstractions in C++: • uniform, versatile, and generic • All of this is enabled by use of modern C++ facilities • Runtime system (fine-grain, task-based schedulers) • Performant, portable implementation C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 49/ 51
  • 70. Parallelism is here to stay! • Massive Parallel Hardware is already part of our daily lives! • Parallelism is observable everywhere: ⇒ IoT: Massive amount devices existing in parallel ⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs, FPGAs) ⇒ Automotive: Massive amount of parallel sensor data to process • We all need solutions on how to deal with this, efficiently and pragmatically C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 50/ 51
  • 71. More Information • https://github.com/STEllAR-GROUP/hpx • http://stellar-group.org • hpx-users@stellar.cct.lsu.edu • #STE||AR @ irc.freenode.org Collaborations: • FET-HPC (H2020): AllScale (https://allscale.eu) • NSF: STORM (http://storm.stellar-group.org) • DOE: Part of X-Stack C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 51/ 51