C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

C++ on its way to exascale and beyond
– The HPX Parallel Runtime System
Thomas Heller (thomas.heller@cs.fau.de)
January 21, 2016
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603

What is Exascale anyway?
671603

Exascale in numbers
• An Exascale Computer is supposed to execute 1018
ﬂoating point
operations in a second
• Exa: 1018
= 1000000000000000000
• People on Earth: 7.3 Billion = 7.3 ∗ 109
• Imagine each person is able to compute one operation per second. It
takes:
⇒ 136986301 seconds
⇒ 2283105 minutes
⇒ 38051 hours
⇒ 1585 days
⇒ 4 years
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
3/ 51

Why do we need that many calculations?
4/ 51

Challenges
• How do we program those beasts?
⇒ Massively parallel processors
⇒ Massive amount of compute nodes
⇒ Deep Memory hierarchies
• How can we design the architecture to be affordable?
⇒ Biggest Operational cost is Energy
⇒ Power Envelop of 20MW
⇒ Current fastest Computer (Tian-He 2): 17MW
5/ 51

Current Development
Current #1 System:
• Tian-He 2: 33.9 PFLOPS
• 4% of an Exaﬂop
6/ 51

Hardware Trends
• ARM: Low-Power ARM64 cores (maybe adding embedded GPU
accelerators)
• IBM: POWER + NVIDIA Accelerators
• Intel: Knights Landing (Xeon Phi) Many Core processor
7/ 51

How will C++ deal with all that?!?
671603

Challenges
• Programmability
• Expressing Parallelism
• Expressing Data Locality
9/ 51

The 4 Horsemen of the Apocalypse: SLOW
Starvation
Latency
Overhead
Waiting for contention
10/ 51

State of the Art
• Modern architectures impose massive challenges on programmability in
the context of performance portability
• Massive increase in on-node parallelism
• Deep memory hierarchies
• Only portable parallelization solution for C++ programmers (today):
OpenMP and MPI
• Hugely successful for years
• Widely used and supported
• Simple use for simple use cases
• Very portable
• Highly optimized
11/ 51

State of the Art – Parallelism in C++
• C++11 introduced lower level abstractions
• std::thread, std::mutex, std::future, etc.
• Fairly limited, more is needed
• C++ needs stronger support for higher-level parallelism
• Several proposals to the Standardization Committee are accepted or
under consideration
• Technical Speciﬁcation: Concurrency (P0159, note: misnomer)
• Technical Speciﬁcation: Parallelism (P0024)
• Other smaller proposals: resumable functions, task regions, executors
• Currently there is no overarching vision related to higher-level parallelism
• Goal is to standardize a ‘big story’ by 2020
• No need for OpenMP, OpenACC, OpenCL, etc.
12/ 51

Stepping Aside – Introducing HPX
671603

HPX – A general purpose parallel Runtime System
• Solidly based on a theoretical foundation – a well deﬁned, new execution
model (ParalleX)
• Exposes a coherent and uniform, standards-oriented API for ease of
programming parallel and distributed applications.
• Enables to write fully asynchronous code using hundreds of millions of threads.
• Provides uniﬁed syntax and semantics for local and remote operations.
• Open Source: Published under the Boost Software License
14/ 51

HPX – A general purpose parallel Runtime System
HPX represents an innovative mixture of
• A global system-wide address space (AGAS - Active Global Address
Space)
• Fine grain parallelism and lightweight synchronization
• Combined with implicit, work queue based, message driven computation
• Full semantic equivalence of local and remote execution, and
• Explicit support for hardware accelerators (through percolation)
15/ 51

HPX 101 – The programming model
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
16/ 51

Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
16/ 51

Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
16/ 51

Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
future <id_type > id =
new_ <Component >( locality , ...);
future <R> result =
async(id.get(), action , ...);
16/ 51

Locality 0 Locality 1 Locality i Locality N-1
Parcelport
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
16/ 51

HPX 101 – Overview
HPX
C++ Standard Library
C++
R f(p...) Synchronous Asynchronous Fire & Forget
(returns R) (returns future<R>) (returns void)
Functions f(p...) async(f, p...) apply(f, p...)
(direct)
Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...)
(lazy)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(lazy) bind(a(), id, p...)
(...)
async(bind(a(), id, p...),
...)
apply(bind(a(), id, p...),
...)
In Addition: dataflow(func, f1, f2);
17/ 51

The Future, an example
int universal_answer () { return 42; }
void deep_thought () {
future <int > promised_answer
= async(& universal_answer);
// do other things for 7.5 million years
cout << promised_answer.get() << endl;
// prints 42, eventually
}
18/ 51

Compositional facilities
• Sequential composition of futures
future <string > make_string () {
future <int > f1 =
async ([]() -> int { return 123; });
future <string > f2 = f1.then(
[](future <int > f) -> string
{
// here .get() won’t block
return to_string(f.get());
});
return f2;
}
19/ 51

Compositional facilities
• Parallel composition of futures
future <int > test_when_all () {
future <int > future1 =
async ([]() -> int { return 125; });
future <string > future2 =
async ([]() -> string { return string("hi"); });
auto all_f = when_all(future1 , future2);
future <int > result = all_f.then(
[]( auto f) -> int {
return do_work(f.get());
});
return result;
}
20/ 51

Dataﬂow – The new ’async’ (HPX)
• What if one or more arguments to ’async’ are futures themselves?
• Normal behavior: pass futures through to function
• Extended behavior: wait for futures to become ready before invoking the
function:
template <typename F, typename ... Arg >
future <result_of_t <F(Args ...) >>
// requires(is_callable <F(Arg ...) >)
dataflow(F && f, Arg &&... arg);
• If ArgN is a future, then the invocation of F will be delayed
• Non-future arguments are passed through
21/ 51

Parallel Algorithms
671603

Concepts of Parallelism – Parallel Execution Properties
• The execution restrictions applicable for the work items
• In what sequence the work items have to be executed
• Where the work items should be executed
• The parameters of the execution environment
23/ 51

Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
24/ 51

Application
Concepts
Execution Policies
Restrictions
24/ 51

Application
Concepts
Execution Policies
Restrictions
Sequence, Where
24/ 51

Application
Concepts
Execution Policies
Restrictions
Sequence, Where
Grain Size
24/ 51

Application
Concepts
Execution Policies
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataﬂow
24/ 51

Application
Concepts
Execution Policies
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataﬂow
Parallel Algorithms Fork-Join, etc
24/ 51

Execution Policies (std)
• Specify execution guarantees (in terms of thread-safety) for executed
parallel tasks:
• sequential_execution_policy: seq
• parallel_execution_policy: par
• parallel_vector_execution_policy: par_vec
• In parallelism TS used for parallel algorithms only
25/ 51

Execution Policies (Extensions)
• Asynchronous Execution Policies:
• sequential_task_execution_policy: seq(task)
• parallel_task_execution_policy: par(task)
• In both cases the formerly synchronous functions return a future<>
• Instruct the parallel construct to be executed asynchronously
• Allows integration with asynchronous control ﬂow
26/ 51

Executors
• Executor are objects responsible for
• Creating execution agents on which work is performed (P0058)
• In P0058 this is limited to parallel algorithms, here much broader use
• Abstraction of the (potentially platform-speciﬁc) mechanisms for launching
work
• Responsible for deﬁning the Where and How of the execution of tasks
27/ 51

Execution Parameters
Allows to control the grain size of work
• i.e. amount of iterations of a parallel for_each run on the same thread
• Similar to OpenMP scheduling policies: static, guided, dynamic
• Much more ﬁne control
28/ 51

Putting it all together – SAXPY routine with data locality
• a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1
• Using parallel algorithms
• Explicit Control over data locality
• No raw Loops
29/ 51

Complete serial version:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
std:: transform(b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
30/ 51

Parallel version, no data locality:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
parallel :: transform(parallel ::par ,
b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
{
return bb * x + cc;
});
31/ 51

Parallel version, no data locality:
std::vector <double , numa_allocator > a = ...;
std::vector <double , numa_allocator > b = ...;
std::vector <double , numa_allocator > c = ...;
double x = ...;
for(numa_executor : numa_executors) {
parallel :: transform(
parallel ::par.on(numa_executor),
b.begin() +..., b.begin() +...,
c.begin() +..., c.begin() +..., a.begin() +...,
{ return bb * x + cc; });
}
32/ 51

Case Studies
671603

LibGeoDecomp
• C++ Auto-parallelizing framework
• Open Source
• High scalability
• Wide range of platform support
• http://www.libgeodecomp.org
34/ 51

LibGeoDecomp
Futurizing the Simulation Flow
Basic Simulation ﬂow:
for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid);
++step;
for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
for(Region r: outerGhostZoneRegion) {
}
for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
35/ 51

LibGeoDecomp
Futurizing the Simulation Flow
Futurized Simulation ﬂow:
parallel for(Region r: innerRegion) {
}
swap(oldGrid , newGrid); ++ step;
parallel for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
parallel for(Region r: outerGhostZoneRegion) {
}
parallel for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
Continuation
Continuation
Continuation
36/ 51

HPXCL – Extending the Global Adress Space
• All GPU devices are addressable globally
• GPU memory can be allocated and referenced remotely
• Events are extensions of the shared state
⇒ API embedded into the already existing future facilities
37/ 51

From async to GPUs
Spawning single tasks not feasible
⇒ ofﬂoad a work group (Think of parallel::for_each)
auto devices
= hpx:: opencl :: find_devices(hpx:: find_here (),
CL_DEVICE_TYPE_GPU).get();
// create buffers , programs and kernels ...
hpx:: opencl :: buffer buf = devices [0]. create_buffer(
CL_MEM_READ_WRITE , 4711);
auto write_future = buf.enqueue_write(some_vec.
begin(), some_vec.end());
auto kernel_future = kernel.enqueue(dim ,
write_future);
38/ 51

From async to GPUs
Spawning single tasks not feasible
⇒ ofﬂoad a work group (Think of parallel::for_each)
• Proof of Concept
• Future Directions:
• Embedd OpenCL devices behind Execution Policies and Executors
• Hide OpenCL stuff behind parallel algorithms
• Hide OpenCL buffer management behind "distributed data structures"
38/ 51

Mandelbrot example
Queue
Google
Maps API
Client
Worker
Generator
Worker
Webserver
Acknowledgements to Martin Stumpf
39/ 51

Mandelbrot example
40/ 51

Mandelbrot example
41/ 51

LibGeoDecomp
Performance Results
0
10
20
30
40
50
60
70
1 2 4 8 16
Time[s]
Number of Cores, on one Node
Execution Times of HPX and MPI N-Body Codes
(SMP, Weak Scaling)
Sim HPX
Sim MPI
Comm HPX
Comm MPI
42/ 51

LibGeoDecomp
Performance Results
42/ 51

LibGeoDecomp
Performance Results
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60
PerformanceinGFLOPS
Number of Cores
Weak Scaling Results for HPX N-Body Code
(Single Xeon Phi, Futurized)
1 Thread/Core
2 Threads/Core
3 Threads/Core
4 Threads/Core
42/ 51

LibGeoDecomp
Performance Results
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
PerformanceinTFLOPS
Number of Nodes, 16 Cores on Host, Full Xeon Phi
Weak Scaling Results for HPX N-Body Codes
(Host Cores and Xeon Phi Accelerator)
HPX
Peak
42/ 51

STREAM Benchmark
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth[GB/s]
Number of cores per NUMA Domain
TRIAD STREAM Results
(50 million data points)
HPX (1 NUMA Domain)
OpenMP (1 NUMA Domain)
HPX (2 NUMA Domains)
OpenMP (2 NUMA Domains)
43/ 51

Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (1 NUMA Domain)
OMP (1 NUMA Domain)
OMP (2 NUMA Domains)
44/ 51

Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
MPI (1 NUMA Domain, 12 ranks)
MPI (2 NUMA Domains, 24 ranks)
MPI+OMP (2 NUMA Domains)
45/ 51

Matrix Transpose
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60
Number of cores
Matrix Transpose (Xeon/Phi, 24kx24k matrices)
HPX (4 PUs per core) OMP (4 PUs per core)
46/ 51

Matrix Transpose
0
5
10
15
20
25
30
35
2 3 4 5 6 7 8
Number of nodes (16 cores each)
Matrix Transpose (Distributed, 18kx18k elements per node)
HPX MPI
47/ 51

What’s beyond Exascale?
671603

Conclusions
Higher-level parallelization abstractions in C++:
• uniform, versatile, and generic
• All of this is enabled by use of modern C++ facilities
• Runtime system (ﬁne-grain, task-based schedulers)
• Performant, portable implementation
49/ 51

Parallelism is here to stay!
• Massive Parallel Hardware is already part of our daily lives!
• Parallelism is observable everywhere:
⇒ IoT: Massive amount devices existing in parallel
⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs,
FPGAs)
⇒ Automotive: Massive amount of parallel sensor data to process
• We all need solutions on how to deal with this, efﬁciently and pragmatically
50/ 51

More Information
• https://github.com/STEllAR-GROUP/hpx
• http://stellar-group.org
• hpx-users@stellar.cct.lsu.edu
• #STE||AR @ irc.freenode.org
Collaborations:
• FET-HPC (H2020): AllScale (https://allscale.eu)
• NSF: STORM (http://storm.stellar-group.org)
• DOE: Part of X-Stack
51/ 51

C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

Similar to C++ on its way to exascale and beyond -- The HPX Parallel Runtime System (20)

Recently uploaded

Recently uploaded (20)

C++ on its way to exascale and beyond -- The HPX Parallel Runtime System