2. Target Audience
People interests parallel programming topic
People wants to know how to improve the performance of their code
People wants to know how to acquire the (possible) peak performance from their computer
(There are a bunch of techniques / methods available for reaching peak performance and this
kind of things is out of the range of our discussion)
Someone wants to know the way that I am using my computers / servers (X
2
3. Outline
Why parallel programming?
What is parallel programming?
How to perform parallel programming (in C++ / Matlab / C#)
Conclusion / Further Discussions
3
4. Why Parallel Programming?
4
Please check the following C++ code, what’s the output?
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
5. Why Parallel Programming?
5
Answer of the question “Please check the following C++ code, what’s the output?”
// https://godbolt.org/z/fb7TdT495
// https://godbolt.org/z/3Kj1azb4h
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
4, 5, 6,
6. Why Parallel Programming?
6
Code Structure
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
Part 1: Variable Initialization
7. Why Parallel Programming?
7
Code Structure
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
Part 2: Data Processing / Calculation
8. Why Parallel Programming?
8
Code Structure
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
Part 3: Output
9. Why Parallel Programming?
9
In the mentioned simple example, the calculating part is simple add operation
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
4, 5, 6,
10. Why Parallel Programming?
10
What happened in the case of more complicated operation?
int main()
{
std::vector<Frame> image_frames =
{img1, img2, img3};
auto results = std::vector<Features>(3);
for(int i = 0; i < std::ranges::size(image_frames); ++i)
{
results[i] = feature_extraction(image_frames[i]);
}
…
return 0;
}
This example is calling a function
which named “feature_extraction”.
11. Why Parallel Programming?
11
Without Parallel Programming With Parallel Programming
Dish A
Dish B
Dish C
…
Dish A Dish B Dish C
Icon is from https://www.hiclipart.com/free-transparent-background-png-clipart-iuxpq/download
12. Why Parallel Programming?
12
Without Parallel Programming With Parallel Programming
Task A
Task B
Task C
…
Task A Task B Task C
Icon is from https://www.flaticon.com/free-icon/cpu_1250593
13. The Steps of Execution
13
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
Let’s review the previous simple case. How’s the program is executed?
14. The Steps of Execution
14
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
1 2 3
test_vector
15. The Steps of Execution
15
1 2 3
test_vector
3
a
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
16. The Steps of Execution
16
1 2 3
test_vector
3
a
Then, the execution runs sequentially?
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
17. The Steps of Execution
17
1 2 3
test_vector
3
a
Then, the execution runs sequentially?
= 4
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
18. The Steps of Execution
18
4 2 3
test_vector
3
a
Then, the execution runs sequentially?
= 5
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
19. The Steps of Execution
19
4 5 3
test_vector
3
a
Then, the execution runs sequentially?
= 6
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
20. The Steps of Execution
20
4 5 6
test_vector
3
a
Then, the execution runs sequentially?
int main()
{
std::vector<int> test_vector = {1, 2, 3};
int a = 3;
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < std::ranges::size(test_vector); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
21. The Concept of Parallelization
21
1 2 3
test_vector
3
Despite the way of sequentialization, is it possible to speed up?
Why not let’s make the program runs parallelly (enable the operations run simultaneously)?
a 3
3
New
test_vector 4 5 6
22. The Concept of Parallelization
22
1 2 3
test_vector
3
How this can be done in our program? Solution: Parallel Programming!
a 3
3
New
test_vector 4 5 6
23. The Concept of Parallelization
23
Parallelization enabling
Tools in C++:
- OpenMP
- TBB(Threading Building Blocks)
- std::thread
- Execution Policy in STL
Tools in Matlab
Tools in C#
24. Parallelization Implementation
24
Parallelization enabling with OpenMP #include <omp.h>
int main()
{
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
#pragma omp parallel for
for(int i = 0; i < test_vector.size(); i++)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
test_vector[0] =
test_vector[0] + a;
test_vector[1] =
test_vector[1] + a;
test_vector[2] =
test_vector[2] + a;
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
25. Parallelization Implementation
25
Parallelization enabling with OpenMP / TBB
// https://godbolt.org/z/szMc4jbqn
// https://godbolt.org/z/haM1qd6eY
#include <omp.h>
int main()
{
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
#pragma omp parallel for
for(int i = 0; i < test_vector.size(); ++i)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
// https://godbolt.org/z/dcssoWj8K
#include <tbb/parallel_for.h>
int main()
{
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
tbb::parallel_for( tbb::blocked_range<int>(0,test_vector.size()),
[&](tbb::blocked_range<int> r)
{
for (int i=r.begin(); i<r.end(); ++i)
{
test_vector[i] = test_vector[i] + a;
}
});
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
26. Parallelization Implementation
26
Parallelization enabling with OpenMP / TBB
// https://godbolt.org/z/haM1qd6eY
#include <omp.h>
int main()
{
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
#pragma omp parallel for
for(int i = 0; i < test_vector.size(); i++)
{
test_vector[i] = test_vector[i] + a;
}
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
// https://godbolt.org/z/dcssoWj8K
#include <tbb/parallel_for.h>
int main()
{
auto test_vector = std::vector<int>{1, 2, 3};
int a = 3;
tbb::parallel_for( tbb::blocked_range<int>(0,test_vector.size()),
[&](tbb::blocked_range<int> r)
{
for (int i=r.begin(); i<r.end(); ++i)
{
test_vector[i] = test_vector[i] + a;
}
});
for(int i = 0; i < test_vector.size(); ++i)
{
std::cout << test_vector[i] << ", ";
}
return 0;
}
A lambda function is here!
27. Parallelization Methods Comparison
27
Comparing OpenMP / TBB and std::thread
// The following code is an example of std::thread
// https://stackoverflow.com/a/11229853/6667035
// https://godbolt.org/z/YeY9d4EeP
#include <string>
#include <iostream>
#include <numeric>
#include <thread>
#include <vector>
void task1(std::vector<int> input) // The function we want to execute on the new thread.
{
for(int i = 0; i < input.size(); i++)
{
std::cout << "output from task1 function: " << input[i];
}
}
void function1()
{
auto test_vector1 = std::vector<int>(100);
std::iota(test_vector1.begin(), test_vector1.end(), 1);
int sum = 0;
for(int i = 0; i < test_vector1.size(); i++)
{
sum += test_vector1[i];
}
std::cout << sum << "n”;
}
int main()
{
auto test_vector = std::vector<int>(100);
std::iota(test_vector.begin(), test_vector.end(), 1);
std::thread t1(task1, test_vector);
function1();
t1.join();
return 0;
}
31. Parallelization in Matlab
31
parfor function usage example
Program without parfor Program with parfor
// https://www.mathworks.com/help/parallel-
computing/parfor.html
tic
n = 200;
A = 500;
a = zeros(1,n);
for i = 1:n
a(i) = max(abs(eig(rand(A))));
end
toc
Elapsed time is 31.935373 seconds.
tic
n = 200;
A = 500;
a = zeros(1,n);
parfor i = 1:n
a(i) = max(abs(eig(rand(A))));
end
toc
Elapsed time is 10.760068 seconds.
32. Parallelization in C#
32
Document of Parallel.For function usage: https://learn.microsoft.com/en-
us/dotnet/standard/parallel-programming/data-parallelism-task-parallel-library
33. Parallelization in C#
33
Parallel.For function usage example
Program without Parallel.For Program with Parallel.For
using System;
using System.Threading.Tasks;
public class ParallelTest
{
public static void Main(string[] args)
{
for(int i = 0; i < 10; i++)
{
Console.WriteLine (i + "n");
};
}
}
using System;
using System.Threading.Tasks;
public class ParallelTest
{
public static void Main(string[] args)
{
Parallel.For(0, 10, i =>
{
Console.WriteLine (i + "n");
}); // Parallel.For
}
}
A lambda function is here!
36. Concept of Parallelable
36
Please think that what’s the limitation of Parallelization
Answer: The limitation of parallelization is that the operation which is to be parallelize
should be independent!
What’s the meaning of independent?
37. Concept of Parallelable
37
Please think that what’s the limitation of Parallelization
Answer: The limitation of parallelization is that the operation which is to be parallelize
should be independent!
What’s the meaning of independent?
Let’s check the case of dependent first:
A B C
38. Concept of Parallelable
38
Please think that what’s the limitation of Parallelization
Answer: The limitation of parallelization is that the operation which is to be parallelize
should be independent!
What’s the meaning of independent?
Let’s check the case of dependent first:
The A, B and C operations cannot be made in parallelization!
Because B operation needs the output from A and
C operation needs the output from B!
A B C
39. Conclusion / Further Discussions
39
Parallelization technique can bring some performance increment when you use it
properly
Parallelization can make higher utilization of computers / computing devices
Is there any disadvantage of using parallelization method?
40. Conclusion / Further Discussions
40
Parallelization technique can bring some performance increment when you use it
properly
Parallelization can make higher utilization of computers / computing devices
Is there any disadvantage of using parallelization method?
Memory usage issue