Options and trade offs for parallelism and concurrency in Modern C++

Options and trade-offs for
parallelism and concurrency in
Modern C++
Mats Brorsson

Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• Conclusions

Parallelism everywhere
• Server level parallelism
• Distributed memory
• Multicore architectures
• Shared memory
• Instruction-level parallelism
• Vector parallelism
• Thread parallelism
• Hardware vs software threads
• Simultaneous multithreading
• ”Switch-on-event” multithreading

Vector vs Thread parallelism
• Vector parallelism maps naturally to
Regular Data Parallelism
• Inner loops can (sometimes) be
vectorized
• Ways to vectorization:
• Auto-vectorization
• Vector Intrinsics
• __mm_add_ps(__m128 x,y)
• Compiler hints
• Cilk Plus array notation
• #pragma omp simd
Images courtesy Intel and Rebel Science News

Key features for Performance
• Data locality
• Chunks that fit in cache
• Reuse data locally
• Avoid cache conflicts
• Use few virtual pages
• Avoid false sharing
• Parallel slack
• Specify potential parallelism much
higher than the actual parallelism
• Load balance
• All threads have the same amount of
work to do

Example: SAXPY, scaling of a vector
• SAXPY scales a vector, , by a factor, added by vector
• is used for both input and output
• Single-precision floating point (DAXPY: double precision)
• Low arithmetic intensity
• Little arithmetic work compared to the amount of data consumed and
produced
• 2 FLOPS, 8 bytes read and 4 bytes written

The Map Pattern
• Applies a function to every
element of a collection of data
items
• Elemental function
• No side effects
• Embarrassingly parallel
• Often combined with collective
patterns

Crash course C++11 thread programming
Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc
#include <thread> at top of file
std::thread t; declare a thread; acts as thread handle
t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2)
with arguments a1 and a2
t.join() join with thread t. Wait for thread with handle tid to finish
std::mutex m; Declare mutual exclusion lock
m.lock() Enter critical section protected by m
m.unlock(); Leave the critical section

Explicit threading on
multicores (in C++11)
• Create one thread per core
• Divide the work manually
• Substantial amount of extra
code over the serial
• Inflexible scheduling of work
void saxpy_t(float a, const vector<float> &x,
vector<float> &y, int nthreads,
int thr_id) {
int n = x.size();
int start = thr_id*n/num_threads;
int end= min((thr_id+1)*n/num_threads, n);
for (int i = start; i < end; i++)
y[i] = a*x[i] + y[i];
}
void main(…)
…
vector<thread> tarr;
for (int i = 0; i < nthreads; i++){
tarr.push_back(thread(saxpy_t, a, ref(x),
ref(y), nthreads, i));
}
// Wait for threads to finish
for (auto & t : tarr){
t.join();
}

What happens when the iterations are
different?
void map_serial(
float a; // scale factor
const std::vector<float> &x; // input vec
std::vector<float> &y; // output and input vec )
{
for (int i = 0; i < n; i++)
y[i] = map(i);
}

Explicit threading with
load imbalance
• Atomic update of index variable
• Fine granularity of load balancing
• Overheads in multiple threads
wanting to update index
• Note that the declaration of
saxpy_index to be atomic
guarantees no data races
• saxpy_index.fetch_add(1) returns
the old value and atomically adds 1
to it.
std::atomic<int> saxpy_index {-1};
void saxpy_t(float a, vector<float> &x,
vector<float> &y) {
int i = std::saxpy_index.fetch_add(1);
while (i < x.size()) {
y[i] = map(i, a*x[i] + y[i]);
i = std::saxpy_index.fetch_add(1);
}
}
void main(…)
…
vector<thread> tarr;
index = 0;
for (int i = 0; i < num_threads; i++){
tarr.push_back(thread(saxpy_t, a,
ref(x), ref(y));
}
// Wait
// join
for (auto & t : tarr){
t.join();
}

Load imbalance with chunks
• The index variable might be a
bottleneck
• Use CHUNKS so that each
threads work on a range of
indeces
• Note that the declaration of
index to be atomic guarantees
no data races
std::atomic<int> saxpy_index {-CHUNK};
void saxpy_t(float a, std::vector<float> x,
std::vector<float> y) {
int c = std::saxpy_index.fetch_add(CHUNK);
int n = x.size();
while (c < n) {
for (int i = c ; c < min(n, c+CHUNK); i++)
y[i] = map(i, a*x[i] + y[i]);
c = std::saxpy_index.fetch_add(CHUNK);
}
}

Sequence of maps vs Map of Sequence
• Also called: Code fusion
• Do this whenever possible!
• Increases the Arithmetic
intensity
• Less data to load and store
• Explicit changes needed
• Make sure consecutive elemental
functions do not refer to memory
either through compiler
optimizations or by design

Cache fusion optimization
• Almost as important as code
fusion
• Break down maps to sequences
of smaller maps, executed by
each thread
• Keep aggregate data small
enough to fit in cache

Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions

Higher abstraction models
• OpenMP
void saxpy_par_openmp(float a,
const vector<float> & x,
vector<float> & y) {
auto n = x.size();
#pragma omp parallel for
for (auto i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
• TBB
auto n = x.size();
tbb::parallel_for(size_t(0), n,
[&]( size_t i ) {
y[i] = a * x[i] + y[i];
});
• Parallel STL (C++17)
std::transform(std::par, x.begin(),
x.end(),
y.begin(),
y.begin(),
[=](float x, float y){
return a*x + y;
});

• Threads are assigned
independent set of
iterations
• Work-sharing
construct
24
OpenMP support for map: for loops
parallel
Work sharing
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=15
Implicit barrier
#pragma omp parallel
#pragma omp for
for (i=0; i < 16; i++)
c[i] = b[i] + a[i];

The Reduce Pattern
• Combining the elements of a
collection of data to a single
value
• A combiner function is used to
combine elementd pairwise
• The combiner function must be
associative for parallelism
• 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐

Serial reduction
Example: Dot-product
float sdot(vector x, vector y){
float sum = 0.0;
for (int i; i < x.size(); i++)
sum += x[i]*y[i];
return sum;
}
Note that this is a fusion of a map (vector element product) and the reduce (sum).

Implementation of parallel reduction
• Simple approach:
• Let each thread make the reduce on its part of the data
• Let one (master) thread combine the results to a scalar value
Each thread performs local reduction in parallel
Master thread reduces to scalar value
Thread 0 1 2 3

Awkward way of returning results from a
thread: dot-product example
Plain C/C++:
void sprod(const vector<float> &a,
const vector<float> &b,
int start,
int end,
double &sum) {
float lsum = 0.0;
for (int i=start; i < end; i++)
sum += a[i] * b[i];
}
#include <thread>
using namespace std;
…
vector<float> sum_array(nthr, 0.0);
vector<thread> t_arr;
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
t_arr.push_pack(thread(
sprod, ref(a), ref(b),
start, end,
ref(sum_array[i])));
}
for (i = 0; i < nthr; i++){
t_arr[i]->join();
sum += sum_array[i];
}

The Async function using futures
Plain C/C++:
float sprod(vector<float> &a,
vector<float> &b,
int start,
int end) {
float lsum = 0.0;
for (int i=start; i < end; i++)
lsum += a[i] * b[i];
return lsum;
}
#include <thread>
#include <future>
using namespace std;
…
future<float> f_arr[i];
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
f_arr[i] = async(launch::async,
sprod, a, b,
start, end);
}
for (i = 0; i < nthr; i++){
sum += f_arr[i].get();
}

Definition of async
• The template function async runs the function f asynchronously
(potentially in a separate thread) and returns a std::future that will
eventually hold the result of that function call.
• The launch::async argument makes the function run on a separate
thread (which could be held in a thread pool, or created for this call)

32
Example:
Numerical Integration

4.0
(1+x2) dx = 
0
1
 F(xi)x  
i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
4.0
2.0
1.0
X
0.0

33
Serial PI Program
The Map
The Reduction
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}

Map reduce in OpenMP
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for reduction(+:sum) private(x)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}

Main challenges in writing parallel software
• Difficult to write composable parallel software
• The parallel models of different languages do not work well together
• Poor resource management
• Difficult to write portable parallel software

Make Tasks a First Class Citizen
• Separation of concerns
• Concentrate on exposing parallelism
• Not how it is mapped onto hardware
Run-time
scheduler HW

• The (naïve) sequential Fibonacci calculation
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
a = fib(n-1);
b = fib(n-2);
return b+a;
}
}
An example of task-parallelism
Parallelism in fib:
• The two calls to fib are independent and can
be computed in any order and in parallel
• It helps that fib is side-effect free but
disjoint side-effects are OK
The need for synchronization:
• The return statement must be executed
after both recursive calls have been
completed because of data-dependence on
a and b.

int fib(int n){
if( n<2 ) return n;
else {
int a,b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
return b+a;
}
}
A task-parallel fib in OpenMP 3+
Starting code:
...
#pragma omp single
fib(n);
...

Work-stealing schedulers
• Cores work on tasks in their
own queue
• Generated tasks are put in local
queue
• If empty, select a random
queue to steal work from
• Steal from:
• Continuation, or
• Generated tasks
c c c c

Task-centric parallel models
Gaining momentum
• C/C++:
• OpenMP , Cilk Plus, TBB, GCD...
• C#:
• Microsoft TPL
• Java:
• fork/join
• Erlang:
• processes
• X10:
• activities
• Etc…

Task model benefits
• Automatic load-balancing through work-stealing
• Serial semantics => debug in serial mode
• Composable parallelism
• Parallel libraries can be called from parallel code
• Can be mixed with data-parallelism
• SIMD/Vector instructions
• Data-parallel workers can also be tasks
• Adapts naturally to
• Different number of cores, even in run-time
• Different speeds of cores (e.g. ARM big.LITTLE)

Greatest thing since sliced bread?
• Overheads are still too big to not care about when creating tasks
• Tasks need to have high enough arithmetic intensity to amortize the cost of
creation and scheduling
• Different models do not use same run-time
• You can’t have a task in TBB calling a function creating tasks written in
OpenMP
• Still no accepted way to target different ISA:s
• Research is going on
• The operating system does not know about tasks
• Current OS:s only schedules threads

Heteregeneous processing
• Same ISA, performance heterogeneous processing is transparant
• Real heterogeneity is challenging

OpenMP 4.0: extending OpenMP with
dependence annotations
sort(A);
sort(B);
sort(C);
sort(D);
merge(A,B,E);
merge(C,D,F);
A B C D
A,
B
C,
D

#pragma omp task
sort(A);
#pragma omp task
sort(B);
#pragma omp task
sort(C);
#pragma omp task
sort(D);
#pragma omp taskwait
#pragma omp task
merge(A,B,E);
#pragma omp task
merge(C,D,F);
A B C D
A,
B
C,
D

#pragma omp task depend(inout:A)
sort(A);
#pragma omp task depend(inout:B)
sort(B);
#pragma omp task depend(inout:C)
sort(C);
#pragma omp task depend(inout:D)
sort(D);
// taskwait not needed
#pragma omp task depend(in:A,B, out:E)
merge(A,B,E);
#pragma omp task depend(in:C,D, out:F)
merge(C,D,F);
A B C D
A,
B
C,
D

Benefits of tasks in OpenMP 4.0
• More parallelism can be exposed
• Complex synchronization patterns can be avoided/automated
• Knowing a task's memory usage/footprint, offloading to accelerators
can now be made almost transparent to the user
- In terms of memory handling and execution!

Writing heterogeneous code
GPU)
#pragma omp task device(gpu)
implements(inc_arr)
void cuda_inc_arr(int *A, int *B) {
cuda_inc_array_kernel <<<4,256>>>(A,B);
}
TilePRO64)
#pragma omp task device(tilera) implements(inc_arr)
void tilera_inc_arr(int *A,int *B) {
{
int i = omp_get_thread_num();
B[i] += A[i];
}
} Task Spawn)
#pragma omp task depend(in:A, out:B) target(tilera,gpu,host)
inc_arr(&A[0], &B[0]);

Conclusions
• Parallelism should be exploited
at all levels
• User higher abstraction models
and compiler support
• Vector instructions
• STL Parallel Algorithms
• TBB / OpenMP
• Only then, use threads
• Measure performance
bottlenecks with profilers
• Beware of
• Granularity
• False sharing

Options and trade offs for parallelism and concurrency in Modern C++

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Options and trade offs for parallelism and concurrency in Modern C++

Similar to Options and trade offs for parallelism and concurrency in Modern C++ (20)

Recently uploaded

Recently uploaded (20)

Options and trade offs for parallelism and concurrency in Modern C++