SlideShare a Scribd company logo
Options and trade-offs for
parallelism and concurrency in
Modern C++
Mats Brorsson
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• Conclusions
Parallelism everywhere
• Server level parallelism
• Distributed memory
• Multicore architectures
• Shared memory
• Instruction-level parallelism
• Vector parallelism
• Thread parallelism
• Hardware vs software threads
• Simultaneous multithreading
• ”Switch-on-event” multithreading
Vector vs Thread parallelism
• Vector parallelism maps naturally to
Regular Data Parallelism
• Inner loops can (sometimes) be
vectorized
• Ways to vectorization:
• Auto-vectorization
• Vector Intrinsics
• __mm_add_ps(__m128 x,y)
• Compiler hints
• Cilk Plus array notation
• #pragma omp simd
Images courtesy Intel and Rebel Science News
Key features for Performance
• Data locality
• Chunks that fit in cache
• Reuse data locally
• Avoid cache conflicts
• Use few virtual pages
• Avoid false sharing
• Parallel slack
• Specify potential parallelism much
higher than the actual parallelism
• Load balance
• All threads have the same amount of
work to do
Example: SAXPY, scaling of a vector
• SAXPY scales a vector, , by a factor, added by vector
• is used for both input and output
• Single-precision floating point (DAXPY: double precision)
• Low arithmetic intensity
• Little arithmetic work compared to the amount of data consumed and
produced
• 2 FLOPS, 8 bytes read and 4 bytes written
The Map Pattern
• Applies a function to every
element of a collection of data
items
• Elemental function
• No side effects
• Embarrassingly parallel
• Often combined with collective
patterns
Crash course C++11 thread programming
Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc
#include <thread> at top of file
std::thread t; declare a thread; acts as thread handle
t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2)
with arguments a1 and a2
t.join() join with thread t. Wait for thread with handle tid to finish
std::mutex m; Declare mutual exclusion lock
m.lock() Enter critical section protected by m
m.unlock(); Leave the critical section
Explicit threading on
multicores (in C++11)
• Create one thread per core
• Divide the work manually
• Substantial amount of extra
code over the serial
• Inflexible scheduling of work
void saxpy_t(float a, const vector<float> &x,
vector<float> &y, int nthreads,
int thr_id) {
int n = x.size();
int start = thr_id*n/num_threads;
int end= min((thr_id+1)*n/num_threads, n);
for (int i = start; i < end; i++)
y[i] = a*x[i] + y[i];
}
void main(…)
…
vector<thread> tarr;
for (int i = 0; i < nthreads; i++){
tarr.push_back(thread(saxpy_t, a, ref(x),
ref(y), nthreads, i));
}
// Wait for threads to finish
for (auto & t : tarr){
t.join();
}
Load imbalance?
What happens when the iterations are
different?
void map_serial(
float a; // scale factor
const std::vector<float> &x; // input vec
std::vector<float> &y; // output and input vec )
{
for (int i = 0; i < n; i++)
y[i] = map(i);
}
Explicit threading with
load imbalance
• Atomic update of index variable
• Fine granularity of load balancing
• Overheads in multiple threads
wanting to update index
• Note that the declaration of
saxpy_index to be atomic
guarantees no data races
• saxpy_index.fetch_add(1) returns
the old value and atomically adds 1
to it.
std::atomic<int> saxpy_index {-1};
void saxpy_t(float a, vector<float> &x,
vector<float> &y) {
int i = std::saxpy_index.fetch_add(1);
while (i < x.size()) {
y[i] = map(i, a*x[i] + y[i]);
i = std::saxpy_index.fetch_add(1);
}
}
void main(…)
…
vector<thread> tarr;
index = 0;
for (int i = 0; i < num_threads; i++){
tarr.push_back(thread(saxpy_t, a,
ref(x), ref(y));
}
// Wait
// join
for (auto & t : tarr){
t.join();
}
Load imbalance with chunks
• The index variable might be a
bottleneck
• Use CHUNKS so that each
threads work on a range of
indeces
• Note that the declaration of
index to be atomic guarantees
no data races
std::atomic<int> saxpy_index {-CHUNK};
void saxpy_t(float a, std::vector<float> x,
std::vector<float> y) {
int c = std::saxpy_index.fetch_add(CHUNK);
int n = x.size();
while (c < n) {
for (int i = c ; c < min(n, c+CHUNK); i++)
y[i] = map(i, a*x[i] + y[i]);
c = std::saxpy_index.fetch_add(CHUNK);
}
}
Sequence of maps vs Map of Sequence
• Also called: Code fusion
• Do this whenever possible!
• Increases the Arithmetic
intensity
• Less data to load and store
• Explicit changes needed
• Make sure consecutive elemental
functions do not refer to memory
either through compiler
optimizations or by design
Cache fusion optimization
• Almost as important as code
fusion
• Break down maps to sequences
of smaller maps, executed by
each thread
• Keep aggregate data small
enough to fit in cache
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
Higher abstraction models
• OpenMP
void saxpy_par_openmp(float a,
const vector<float> & x,
vector<float> & y) {
auto n = x.size();
#pragma omp parallel for
for (auto i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
• TBB
auto n = x.size();
tbb::parallel_for(size_t(0), n,
[&]( size_t i ) {
y[i] = a * x[i] + y[i];
});
• Parallel STL (C++17)
std::transform(std::par, x.begin(),
x.end(),
y.begin(),
y.begin(),
[=](float x, float y){
return a*x + y;
});
• Threads are assigned
independent set of
iterations
• Work-sharing
construct
24
OpenMP support for map: for loops
parallel
Work sharing
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=15
Implicit barrier
#pragma omp parallel
#pragma omp for
for (i=0; i < 16; i++)
c[i] = b[i] + a[i];
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
The Reduce Pattern
• Combining the elements of a
collection of data to a single
value
• A combiner function is used to
combine elementd pairwise
• The combiner function must be
associative for parallelism
• 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
Serial reduction
Example: Dot-product
float sdot(vector x, vector y){
float sum = 0.0;
for (int i; i < x.size(); i++)
sum += x[i]*y[i];
return sum;
}
Note that this is a fusion of a map (vector element product) and the reduce (sum).
Implementation of parallel reduction
• Simple approach:
• Let each thread make the reduce on its part of the data
• Let one (master) thread combine the results to a scalar value
Each thread performs local reduction in parallel
Master thread reduces to scalar value
Thread 0 1 2 3
Awkward way of returning results from a
thread: dot-product example
Plain C/C++:
void sprod(const vector<float> &a,
const vector<float> &b,
int start,
int end,
double &sum) {
float lsum = 0.0;
for (int i=start; i < end; i++)
sum += a[i] * b[i];
}
#include <thread>
using namespace std;
…
vector<float> sum_array(nthr, 0.0);
vector<thread> t_arr;
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
t_arr.push_pack(thread(
sprod, ref(a), ref(b),
start, end,
ref(sum_array[i])));
}
for (i = 0; i < nthr; i++){
t_arr[i]->join();
sum += sum_array[i];
}
The Async function using futures
Plain C/C++:
float sprod(vector<float> &a,
vector<float> &b,
int start,
int end) {
float lsum = 0.0;
for (int i=start; i < end; i++)
lsum += a[i] * b[i];
return lsum;
}
#include <thread>
#include <future>
using namespace std;
…
future<float> f_arr[i];
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
f_arr[i] = async(launch::async,
sprod, a, b,
start, end);
}
for (i = 0; i < nthr; i++){
sum += f_arr[i].get();
}
Definition of async
• The template function async runs the function f asynchronously
(potentially in a separate thread) and returns a std::future that will
eventually hold the result of that function call.
• The launch::async argument makes the function run on a separate
thread (which could be held in a thread pool, or created for this call)
32
Example:
Numerical Integration

4.0
(1+x2) dx = 
0
1
 F(xi)x  
i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
4.0
2.0
1.0
X
0.0
33
Serial PI Program
The Map
The Reduction
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Map reduce in OpenMP
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for reduction(+:sum) private(x)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
Main challenges in writing parallel software
• Difficult to write composable parallel software
• The parallel models of different languages do not work well together
• Poor resource management
• Difficult to write portable parallel software
Make Tasks a First Class Citizen
• Separation of concerns
• Concentrate on exposing parallelism
• Not how it is mapped onto hardware
Run-time
scheduler HW
• The (naïve) sequential Fibonacci calculation
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
a = fib(n-1);
b = fib(n-2);
return b+a;
}
}
An example of task-parallelism
Parallelism in fib:
• The two calls to fib are independent and can
be computed in any order and in parallel
• It helps that fib is side-effect free but
disjoint side-effects are OK
The need for synchronization:
• The return statement must be executed
after both recursive calls have been
completed because of data-dependence on
a and b.
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
return b+a;
}
}
A task-parallel fib in OpenMP 3+
Starting code:
...
#pragma omp parallel
#pragma omp single
fib(n);
...
Work-stealing schedulers
• Cores work on tasks in their
own queue
• Generated tasks are put in local
queue
• If empty, select a random
queue to steal work from
• Steal from:
• Continuation, or
• Generated tasks
c c c c
Task-centric parallel models
Gaining momentum
• C/C++:
• OpenMP , Cilk Plus, TBB, GCD...
• C#:
• Microsoft TPL
• Java:
• fork/join
• Erlang:
• processes
• X10:
• activities
• Etc…
Task model benefits
• Automatic load-balancing through work-stealing
• Serial semantics => debug in serial mode
• Composable parallelism
• Parallel libraries can be called from parallel code
• Can be mixed with data-parallelism
• SIMD/Vector instructions
• Data-parallel workers can also be tasks
• Adapts naturally to
• Different number of cores, even in run-time
• Different speeds of cores (e.g. ARM big.LITTLE)
Greatest thing since sliced bread?
• Overheads are still too big to not care about when creating tasks
• Tasks need to have high enough arithmetic intensity to amortize the cost of
creation and scheduling
• Different models do not use same run-time
• You can’t have a task in TBB calling a function creating tasks written in
OpenMP
• Still no accepted way to target different ISA:s
• Research is going on
• The operating system does not know about tasks
• Current OS:s only schedules threads
Heteregeneous processing
• Same ISA, performance heterogeneous processing is transparant
• Real heterogeneity is challenging
OpenMP 4.0: extending OpenMP with
dependence annotations
sort(A);
sort(B);
sort(C);
sort(D);
merge(A,B,E);
merge(C,D,F);
A B C D
A,
B
C,
D
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task
sort(A);
#pragma omp task
sort(B);
#pragma omp task
sort(C);
#pragma omp task
sort(D);
#pragma omp taskwait
#pragma omp task
merge(A,B,E);
#pragma omp task
merge(C,D,F);
A B C D
A,
B
C,
D
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task depend(inout:A)
sort(A);
#pragma omp task depend(inout:B)
sort(B);
#pragma omp task depend(inout:C)
sort(C);
#pragma omp task depend(inout:D)
sort(D);
// taskwait not needed
#pragma omp task depend(in:A,B, out:E)
merge(A,B,E);
#pragma omp task depend(in:C,D, out:F)
merge(C,D,F);
A B C D
A,
B
C,
D
Benefits of tasks in OpenMP 4.0
• More parallelism can be exposed
• Complex synchronization patterns can be avoided/automated
• Knowing a task's memory usage/footprint, offloading to accelerators
can now be made almost transparent to the user
- In terms of memory handling and execution!
Writing heterogeneous code
GPU)
#pragma omp task device(gpu)
implements(inc_arr)
void cuda_inc_arr(int *A, int *B) {
cuda_inc_array_kernel <<<4,256>>>(A,B);
}
TilePRO64)
#pragma omp task device(tilera) implements(inc_arr)
void tilera_inc_arr(int *A,int *B) {
#pragma omp parallel
{
int i = omp_get_thread_num();
B[i] += A[i];
}
} Task Spawn)
#pragma omp task depend(in:A, out:B) target(tilera,gpu,host)
inc_arr(&A[0], &B[0]);
Conclusions
• Parallelism should be exploited
at all levels
• User higher abstraction models
and compiler support
• Vector instructions
• STL Parallel Algorithms
• TBB / OpenMP
• Only then, use threads
• Measure performance
bottlenecks with profilers
• Beware of
• Granularity
• False sharing

More Related Content

What's hot

"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
Enthought, Inc.
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
Oswald Campesato
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
PyData
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
Oswald Campesato
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
台灣資料科學年會
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
rik0
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
PyData
 
Introduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at BerkeleyIntroduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at Berkeley
Ted Xiao
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
Oswald Campesato
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
Oswald Campesato
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
Namgee Lee
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
ashishtinku
 
Tensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi chaTensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi cha
Donghwi Cha
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
Tzar Umang
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
Oswald Campesato
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++
Jawad Khan
 
Lec4
Lec4Lec4
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
Alessio Tonioni
 

What's hot (20)

"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Introduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at BerkeleyIntroduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at Berkeley
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Tensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi chaTensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi cha
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++
 
Lec4
Lec4Lec4
Lec4
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
 

Similar to Options and trade offs for parallelism and concurrency in Modern C++

C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Language
mspline
 
Programming in python
Programming in pythonProgramming in python
Programming in python
Ivan Rojas
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
Jin-Hwa Kim
 
SimpleArray between Python and C++
SimpleArray between Python and C++SimpleArray between Python and C++
SimpleArray between Python and C++
Yung-Yu Chen
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
nikomatsakis
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35
Bilal Ahmed
 
C++ Language
C++ LanguageC++ Language
C++ Language
Syed Zaid Irshad
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
GeeksLab Odessa
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
Yung-Yu Chen
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Ortus Solutions, Corp
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
Albert DeFusco
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
PROIDEA
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”
Platonov Sergey
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
Chris Fregly
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
PrabhakarSingh646829
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
CUSTIS
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
oscon2007
 

Similar to Options and trade offs for parallelism and concurrency in Modern C++ (20)

C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Language
 
Programming in python
Programming in pythonProgramming in python
Programming in python
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
SimpleArray between Python and C++
SimpleArray between Python and C++SimpleArray between Python and C++
SimpleArray between Python and C++
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35
 
C++ Language
C++ LanguageC++ Language
C++ Language
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 

Recently uploaded

How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
YousufSait3
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 

Recently uploaded (20)

How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 

Options and trade offs for parallelism and concurrency in Modern C++

  • 1. Options and trade-offs for parallelism and concurrency in Modern C++ Mats Brorsson
  • 2. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • Conclusions
  • 3. Parallelism everywhere • Server level parallelism • Distributed memory • Multicore architectures • Shared memory • Instruction-level parallelism • Vector parallelism • Thread parallelism • Hardware vs software threads • Simultaneous multithreading • ”Switch-on-event” multithreading
  • 4. Vector vs Thread parallelism • Vector parallelism maps naturally to Regular Data Parallelism • Inner loops can (sometimes) be vectorized • Ways to vectorization: • Auto-vectorization • Vector Intrinsics • __mm_add_ps(__m128 x,y) • Compiler hints • Cilk Plus array notation • #pragma omp simd Images courtesy Intel and Rebel Science News
  • 5. Key features for Performance • Data locality • Chunks that fit in cache • Reuse data locally • Avoid cache conflicts • Use few virtual pages • Avoid false sharing • Parallel slack • Specify potential parallelism much higher than the actual parallelism • Load balance • All threads have the same amount of work to do
  • 6. Example: SAXPY, scaling of a vector • SAXPY scales a vector, , by a factor, added by vector • is used for both input and output • Single-precision floating point (DAXPY: double precision) • Low arithmetic intensity • Little arithmetic work compared to the amount of data consumed and produced • 2 FLOPS, 8 bytes read and 4 bytes written
  • 7. The Map Pattern • Applies a function to every element of a collection of data items • Elemental function • No side effects • Embarrassingly parallel • Often combined with collective patterns
  • 8. Crash course C++11 thread programming Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc #include <thread> at top of file std::thread t; declare a thread; acts as thread handle t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2) with arguments a1 and a2 t.join() join with thread t. Wait for thread with handle tid to finish std::mutex m; Declare mutual exclusion lock m.lock() Enter critical section protected by m m.unlock(); Leave the critical section
  • 9. Explicit threading on multicores (in C++11) • Create one thread per core • Divide the work manually • Substantial amount of extra code over the serial • Inflexible scheduling of work void saxpy_t(float a, const vector<float> &x, vector<float> &y, int nthreads, int thr_id) { int n = x.size(); int start = thr_id*n/num_threads; int end= min((thr_id+1)*n/num_threads, n); for (int i = start; i < end; i++) y[i] = a*x[i] + y[i]; } void main(…) … vector<thread> tarr; for (int i = 0; i < nthreads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y), nthreads, i)); } // Wait for threads to finish for (auto & t : tarr){ t.join(); }
  • 11. What happens when the iterations are different? void map_serial( float a; // scale factor const std::vector<float> &x; // input vec std::vector<float> &y; // output and input vec ) { for (int i = 0; i < n; i++) y[i] = map(i); }
  • 12. Explicit threading with load imbalance • Atomic update of index variable • Fine granularity of load balancing • Overheads in multiple threads wanting to update index • Note that the declaration of saxpy_index to be atomic guarantees no data races • saxpy_index.fetch_add(1) returns the old value and atomically adds 1 to it. std::atomic<int> saxpy_index {-1}; void saxpy_t(float a, vector<float> &x, vector<float> &y) { int i = std::saxpy_index.fetch_add(1); while (i < x.size()) { y[i] = map(i, a*x[i] + y[i]); i = std::saxpy_index.fetch_add(1); } } void main(…) … vector<thread> tarr; index = 0; for (int i = 0; i < num_threads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y)); } // Wait // join for (auto & t : tarr){ t.join(); }
  • 13. Load imbalance with chunks • The index variable might be a bottleneck • Use CHUNKS so that each threads work on a range of indeces • Note that the declaration of index to be atomic guarantees no data races std::atomic<int> saxpy_index {-CHUNK}; void saxpy_t(float a, std::vector<float> x, std::vector<float> y) { int c = std::saxpy_index.fetch_add(CHUNK); int n = x.size(); while (c < n) { for (int i = c ; c < min(n, c+CHUNK); i++) y[i] = map(i, a*x[i] + y[i]); c = std::saxpy_index.fetch_add(CHUNK); } }
  • 14. Sequence of maps vs Map of Sequence • Also called: Code fusion • Do this whenever possible! • Increases the Arithmetic intensity • Less data to load and store • Explicit changes needed • Make sure consecutive elemental functions do not refer to memory either through compiler optimizations or by design
  • 15. Cache fusion optimization • Almost as important as code fusion • Break down maps to sequences of smaller maps, executed by each thread • Keep aggregate data small enough to fit in cache
  • 16. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 17. Higher abstraction models • OpenMP void saxpy_par_openmp(float a, const vector<float> & x, vector<float> & y) { auto n = x.size(); #pragma omp parallel for for (auto i = 0; i < n; i++) { y[i] = a * x[i] + y[i]; } } • TBB auto n = x.size(); tbb::parallel_for(size_t(0), n, [&]( size_t i ) { y[i] = a * x[i] + y[i]; }); • Parallel STL (C++17) std::transform(std::par, x.begin(), x.end(), y.begin(), y.begin(), [=](float x, float y){ return a*x + y; });
  • 18. • Threads are assigned independent set of iterations • Work-sharing construct 24 OpenMP support for map: for loops parallel Work sharing i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14 i=15 Implicit barrier #pragma omp parallel #pragma omp for for (i=0; i < 16; i++) c[i] = b[i] + a[i];
  • 19. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 20. The Reduce Pattern • Combining the elements of a collection of data to a single value • A combiner function is used to combine elementd pairwise • The combiner function must be associative for parallelism • 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
  • 21. Serial reduction Example: Dot-product float sdot(vector x, vector y){ float sum = 0.0; for (int i; i < x.size(); i++) sum += x[i]*y[i]; return sum; } Note that this is a fusion of a map (vector element product) and the reduce (sum).
  • 22. Implementation of parallel reduction • Simple approach: • Let each thread make the reduce on its part of the data • Let one (master) thread combine the results to a scalar value Each thread performs local reduction in parallel Master thread reduces to scalar value Thread 0 1 2 3
  • 23. Awkward way of returning results from a thread: dot-product example Plain C/C++: void sprod(const vector<float> &a, const vector<float> &b, int start, int end, double &sum) { float lsum = 0.0; for (int i=start; i < end; i++) sum += a[i] * b[i]; } #include <thread> using namespace std; … vector<float> sum_array(nthr, 0.0); vector<thread> t_arr; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; t_arr.push_pack(thread( sprod, ref(a), ref(b), start, end, ref(sum_array[i]))); } for (i = 0; i < nthr; i++){ t_arr[i]->join(); sum += sum_array[i]; }
  • 24. The Async function using futures Plain C/C++: float sprod(vector<float> &a, vector<float> &b, int start, int end) { float lsum = 0.0; for (int i=start; i < end; i++) lsum += a[i] * b[i]; return lsum; } #include <thread> #include <future> using namespace std; … future<float> f_arr[i]; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; f_arr[i] = async(launch::async, sprod, a, b, start, end); } for (i = 0; i < nthr; i++){ sum += f_arr[i].get(); }
  • 25. Definition of async • The template function async runs the function f asynchronously (potentially in a separate thread) and returns a std::future that will eventually hold the result of that function call. • The launch::async argument makes the function run on a separate thread (which could be held in a thread pool, or created for this call)
  • 26. 32 Example: Numerical Integration  4.0 (1+x2) dx =  0 1  F(xi)x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width x and height F(xi) at the middle of interval i. 4.0 2.0 1.0 X 0.0
  • 27. 33 Serial PI Program The Map The Reduction static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 28. Map reduce in OpenMP static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 29. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 30. Main challenges in writing parallel software • Difficult to write composable parallel software • The parallel models of different languages do not work well together • Poor resource management • Difficult to write portable parallel software
  • 31. Make Tasks a First Class Citizen • Separation of concerns • Concentrate on exposing parallelism • Not how it is mapped onto hardware Run-time scheduler HW
  • 32. • The (naïve) sequential Fibonacci calculation int fib(int n){ if( n<2 ) return n; else { int a,b; a = fib(n-1); b = fib(n-2); return b+a; } } An example of task-parallelism Parallelism in fib: • The two calls to fib are independent and can be computed in any order and in parallel • It helps that fib is side-effect free but disjoint side-effects are OK The need for synchronization: • The return statement must be executed after both recursive calls have been completed because of data-dependence on a and b.
  • 33. int fib(int n){ if( n<2 ) return n; else { int a,b; #pragma omp task shared(a) a = fib(n-1); #pragma omp task shared(b) b = fib(n-2); #pragma omp taskwait return b+a; } } A task-parallel fib in OpenMP 3+ Starting code: ... #pragma omp parallel #pragma omp single fib(n); ...
  • 34. Work-stealing schedulers • Cores work on tasks in their own queue • Generated tasks are put in local queue • If empty, select a random queue to steal work from • Steal from: • Continuation, or • Generated tasks c c c c
  • 35. Task-centric parallel models Gaining momentum • C/C++: • OpenMP , Cilk Plus, TBB, GCD... • C#: • Microsoft TPL • Java: • fork/join • Erlang: • processes • X10: • activities • Etc…
  • 36. Task model benefits • Automatic load-balancing through work-stealing • Serial semantics => debug in serial mode • Composable parallelism • Parallel libraries can be called from parallel code • Can be mixed with data-parallelism • SIMD/Vector instructions • Data-parallel workers can also be tasks • Adapts naturally to • Different number of cores, even in run-time • Different speeds of cores (e.g. ARM big.LITTLE)
  • 37. Greatest thing since sliced bread? • Overheads are still too big to not care about when creating tasks • Tasks need to have high enough arithmetic intensity to amortize the cost of creation and scheduling • Different models do not use same run-time • You can’t have a task in TBB calling a function creating tasks written in OpenMP • Still no accepted way to target different ISA:s • Research is going on • The operating system does not know about tasks • Current OS:s only schedules threads
  • 38. Heteregeneous processing • Same ISA, performance heterogeneous processing is transparant • Real heterogeneity is challenging
  • 39. OpenMP 4.0: extending OpenMP with dependence annotations sort(A); sort(B); sort(C); sort(D); merge(A,B,E); merge(C,D,F); A B C D A, B C, D
  • 40. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task sort(A); #pragma omp task sort(B); #pragma omp task sort(C); #pragma omp task sort(D); #pragma omp taskwait #pragma omp task merge(A,B,E); #pragma omp task merge(C,D,F); A B C D A, B C, D
  • 41. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task depend(inout:A) sort(A); #pragma omp task depend(inout:B) sort(B); #pragma omp task depend(inout:C) sort(C); #pragma omp task depend(inout:D) sort(D); // taskwait not needed #pragma omp task depend(in:A,B, out:E) merge(A,B,E); #pragma omp task depend(in:C,D, out:F) merge(C,D,F); A B C D A, B C, D
  • 42. Benefits of tasks in OpenMP 4.0 • More parallelism can be exposed • Complex synchronization patterns can be avoided/automated • Knowing a task's memory usage/footprint, offloading to accelerators can now be made almost transparent to the user - In terms of memory handling and execution!
  • 43. Writing heterogeneous code GPU) #pragma omp task device(gpu) implements(inc_arr) void cuda_inc_arr(int *A, int *B) { cuda_inc_array_kernel <<<4,256>>>(A,B); } TilePRO64) #pragma omp task device(tilera) implements(inc_arr) void tilera_inc_arr(int *A,int *B) { #pragma omp parallel { int i = omp_get_thread_num(); B[i] += A[i]; } } Task Spawn) #pragma omp task depend(in:A, out:B) target(tilera,gpu,host) inc_arr(&A[0], &B[0]);
  • 44. Conclusions • Parallelism should be exploited at all levels • User higher abstraction models and compiler support • Vector instructions • STL Parallel Algorithms • TBB / OpenMP • Only then, use threads • Measure performance bottlenecks with profilers • Beware of • Granularity • False sharing