SlideShare a Scribd company logo
Analysis of Algorithms
Multi-threaded Algorithms
Andres Mendez-Vazquez
April 15, 2016
1 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
2 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
3 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Multi-Threaded Algorithms
Motivation
Until now, our serial algorithms are quite suitable for running on a
single processor system.
However, multiprocessor algorithms are ubiquitous:
Therefore, extending our serial models to a parallel computation model
is a must.
Computational Model
There exist many competing models of parallel computation that are
essentially different:
Shared Memory
Message Passing
Etc.
4 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
5 / 94
The Model to Be Used
Symmetric Multiprocessor
The model that we will use is the Symmetric Multiprocessor (SMP) where
a shared memory exists.
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
L3 Cache
L2 Cache L1i Cache
L1d Cache CPU Core 1
CPU Core 3 CPU Core 4
CPU Core 2
L1d Cache
L1d Cache
L1d Cache
L1i Cache
L1i Cache
L1i Cache
L2 Cache
L2 CacheL2 Cache
P-to-P
MAIN SHARED MEMORY
Processor 1 Processor 2 Processor 3 Processor 4
BUS
6 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Dynamic Multi-Threading
Dynamic Multi-Threading
In reality it can be difficult to handle multi-threaded programs in a
SMP.
Thus, we will assume a simple concurrency platform that handles all
the resources:
Schedules
Memory
Etc
It is Called Dynamic Multi-threading.
Dynamic Multi-Threading Computing Operations
Spawn
Sync
Parallel
7 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
8 / 94
SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
SPAWN
SPAWN
When called before a procedure, the parent procedure may continue to
execute in parallel.
Note
The keyword spawn does not say anything about concurrent
execution, but it can happen.
The Scheduler decide which computations should run concurrently.
9 / 94
SYNC AND PARALLEL
SYNC
The keyword sync indicates that the procedure must wait for all its
spawned children to complete.
PARALLEL
This operation applies to loops, which make possible to execute the body
of the loop in parallel.
10 / 94
SYNC AND PARALLEL
SYNC
The keyword sync indicates that the procedure must wait for all its
spawned children to complete.
PARALLEL
This operation applies to loops, which make possible to execute the body
of the loop in parallel.
10 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
11 / 94
A Classic Parallel Piece of Code: Fibonacci Numbers
Fibonacci’s Definition
F0 = 0
F1 = 1
Fi = Fi−1 + Fi−2 for i > 1.
Naive Algorithm
Fibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 return x + y
12 / 94
A Classic Parallel Piece of Code: Fibonacci Numbers
Fibonacci’s Definition
F0 = 0
F1 = 1
Fi = Fi−1 + Fi−2 for i > 1.
Naive Algorithm
Fibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 return x + y
12 / 94
Time Complexity
Recursion and Complexity
Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1).
Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+
√
5
2 .
13 / 94
Time Complexity
Recursion and Complexity
Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1).
Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+
√
5
2 .
13 / 94
There is a Better Way
We can order the first tree numbers in the sequence as
F2 F1
F1 F0
=
1 1
1 0
Then
F2 F1
F1 F0
F2 F1
F1 F0
=
1 1
1 0
1 1
1 0
=
2 1
1 1
=
F3 F2
F2 F1
14 / 94
There is a Better Way
We can order the first tree numbers in the sequence as
F2 F1
F1 F0
=
1 1
1 0
Then
F2 F1
F1 F0
F2 F1
F1 F0
=
1 1
1 0
1 1
1 0
=
2 1
1 1
=
F3 F2
F2 F1
14 / 94
There is a Better Way
Calculating in O(log n) when n is a power of 2
1 1
1 0
n
=
F (n + 1) F (n)
F (n) F (n − 1)
Thus
1 1
1 0
n
2
1 1
1 0
n
2
=
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
However...
We will use the naive version to illustrate the principles of parallel
programming.
15 / 94
There is a Better Way
Calculating in O(log n) when n is a power of 2
1 1
1 0
n
=
F (n + 1) F (n)
F (n) F (n − 1)
Thus
1 1
1 0
n
2
1 1
1 0
n
2
=
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
F n
2
+ 1 F n
2
F n
2
F n
2
− 1
However...
We will use the naive version to illustrate the principles of parallel
programming.
15 / 94
The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
The Concurrent Code
Parallel Algorithm
PFibonacci(n)
1 if n ≤ 1 then
2 return n
3 else x = spawn Fibonacci(n − 1)
4 y = Fibonacci(n − 2)
5 sync
6 return x + y
16 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
17 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Definition
A directed acyclic G = (V , E) graph where
The vertices V are sets of instructions.
The edges E represent dependencies between sets of instructions i.e.
(u, v) instruction u before v.
Notes
A set of instructions without any parallel control are grouped in a
strand.
Thus, V represents a set of strands and E represents dependencies
between strands induced by parallel control.
A strand of maximal length will be called a thread.
18 / 94
How do we compute a complexity? Computation DAG
Thus
If there is an edge between thread u and v, then they are said to be
(logically) in series.
If there is no edge, then they are said to be (logically) in parallel.
19 / 94
How do we compute a complexity? Computation DAG
Thus
If there is an edge between thread u and v, then they are said to be
(logically) in series.
If there is no edge, then they are said to be (logically) in parallel.
19 / 94
Example: PFibonacci(4)
Example
PFibonacci(4)
PFibonacci(3) PFibonacci(2)
PFibonacci(2)
PFibonacci(1) PFibonacci(1)
PFibonacci(1) PFibonacci(0)
PFibonacci(0)
20 / 94
Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
Edge Classification
Continuation Edge
A continuation edge (u, v) connects a thread u to its successor v within
the same procedure instance.
Spawned Edge
When a thread u spawns a new thread v, then (u, v) is called a spawned
edge.
Call Edges
Call edges represent normal procedure call.
Return Edge
Return edge signals when a thread v returns to its calling procedure.
21 / 94
Example: PFibonacci(4)
The Different Edges
PFibonacci(4)
PFibonacci(3) PFibonacci(2)
PFibonacci(2)
PFibonacci(1) PFibonacci(1)
PFibonacci(1) PFibonacci(0)
PFibonacci(0)
Init Thread
Spawn Edge
Continuation Edge
Return Edge
Final Thread
Call Edge
22 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
23 / 94
Performance Measures
WORK
The work of a multi-threaded computation is the total time to execute the
entire computation on one processor.
Work =
i∈I
Time (Threadi)
SPAN
The span is the longest time to execute the strands along any path of the
DAG.
In a DAG which each strand takes unit time, the span equals the
number of vertices on a longest or critical path in the DAG.
24 / 94
Performance Measures
WORK
The work of a multi-threaded computation is the total time to execute the
entire computation on one processor.
Work =
i∈I
Time (Threadi)
SPAN
The span is the longest time to execute the strands along any path of the
DAG.
In a DAG which each strand takes unit time, the span equals the
number of vertices on a longest or critical path in the DAG.
24 / 94
Example: PFibonacci(4)
Example
Critical Path
PFibonacci(4)
PFibonacci(3) PFibonacci(2)
PFibonacci(2)
PFibonacci(1) PFibonacci(1)
PFibonacci(1) PFibonacci(0)
PFibonacci(0)
25 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Example
Example
In Fibonacci(4), we have
17 threads.
8 vertices in the longest path
We have that
Assuming unit time
WORK=17 time units
SPAN=8 time units
Note
Running time not only depends on work and span but
Available Cores
Scheduler Policies
26 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
27 / 94
Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
Running Time Classification
Single Processor
T1 running time on a single processor.
Multiple Processors
Tp running time on P processors.
Unlimited Processors
T∞ running time on unlimited processors, also called the span, if we
run each strand on its own processor.
28 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
29 / 94
Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
Work Law
Definition
In one step, an ideal parallel computer with P processors can do:
At most P units of work.
Thus in TP time, it can perform at most PTP work.
PTP ≥ T1 =⇒ Tp ≥
T1
P
30 / 94
Span Law
Definition
A P-processor ideal parallel computer cannot run faster than a
machine with unlimited number of processors.
However, a computer with unlimited number of processors can
emulate a P-processor machine by using simply P of its processors.
Therefore,
TP ≥ T∞
31 / 94
Span Law
Definition
A P-processor ideal parallel computer cannot run faster than a
machine with unlimited number of processors.
However, a computer with unlimited number of processors can
emulate a P-processor machine by using simply P of its processors.
Therefore,
TP ≥ T∞
31 / 94
Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
Work Calculations: Serial
Serial Computations
A B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B).
32 / 94
Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
Work Calculations: Parallel
Parallel Computations
A
B
Note
Work: T1 (A ∪ B) = T1 (A) + T1 (B).
Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}.
33 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
34 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Speedup and Parallelism
Speed up
The speed up of a computation on P processors is defined as T1
TP
.
Then, by work law T1
TP
≤ P. Thus, the speedup on P processors can
be at most P.
Notes
Linear Speedup when T1
TP
= Θ (P).
Perfect Linear Speedup when T1
TP
= P.
Parallelism
The parallelism of a computation on P processors is defined as T1
T∞
.
In specific, we are looking to have a lot of parallelism.
This changes from Algorithm to Algorithm.
35 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
36 / 94
Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
Greedy Scheduler
Definition
A greedy scheduler assigns as many strands to processors as
possible in each time step.
Note
On P processors, if at least P strands are ready to execute during a
time step, then we say that the step is a complete step.
Otherwise we say that it is an incomplete step.
This changes from Algorithm to Algorithm.
37 / 94
Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
Greedy Scheduler Theorem and Corollaries
Theorem 27.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a multi-threaded computation with work T1 and span T∞ in time
TP ≤ T1
P + T∞.
Corollary 27.2
The running time TP of any multi-threaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within
a factor of 2 of optimal.
Corollary 27.3
Let TP be the running time of a multi-threaded computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and let
T1 and T∞ be the work and span of the computation, respectively. Then,
if P T1
T∞
(Much Less), we have TP ≈ T1
P , or equivalently, a speedup of
approximately P .
38 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
39 / 94
Race Conditions
Determinacy Race
A determinacy race occurs when two logically parallel instructions access
the same memory location and at least one of the instructions performs a
write.
Example
Race-Example()
1 x = 0
2 parallel for i = 1 to 3 do
3 x = x + 1
4 print x
40 / 94
Race Conditions
Determinacy Race
A determinacy race occurs when two logically parallel instructions access
the same memory location and at least one of the instructions performs a
write.
Example
Race-Example()
1 x = 0
2 parallel for i = 1 to 3 do
3 x = x + 1
4 print x
40 / 94
Example
Determinacy Race Example
1
2
3
4 5
67
8 910
11
step x r1 r2 r3
1 0
2 0 0
3 0 1
4 0 1 0
5 0 1 0 0
6 0 1 0 1
7 0 1 1 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
41 / 94
Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
Example
NOTE
Although, this is of great importance is beyond the scope of this class:
For More about this topic, we have:
Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor
Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2008.
Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.).
Prentice Hall Press, Upper Saddle River, NJ, USA, 2007.
42 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
43 / 94
Example of Complexity: PFibonacci
Complexity
T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1)
Finally
T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n)
Parallelism
T1 (n)
T∞ (n)
= Θ
φn
n
44 / 94
Example of Complexity: PFibonacci
Complexity
T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1)
Finally
T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n)
Parallelism
T1 (n)
T∞ (n)
= Θ
φn
n
44 / 94
Example of Complexity: PFibonacci
Complexity
T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1)
Finally
T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n)
Parallelism
T1 (n)
T∞ (n)
= Θ
φn
n
44 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
45 / 94
Matrix Multiplication
Trick
To multiply two n × n matrices, we perform 8 matrix multiplications of
n
2 × n
2 matrices and one addition n × n of matrices.
Idea
A=
A11 A12
A21 A22
,B =
B11 B12
B21 B22
, C =
C11 C12
C21 C22
C =
C11 C12
C21 C22
=
A11 A12
A21 A22
B11 B12
B21 B22
= ...
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
46 / 94
Matrix Multiplication
Trick
To multiply two n × n matrices, we perform 8 matrix multiplications of
n
2 × n
2 matrices and one addition n × n of matrices.
Idea
A=
A11 A12
A21 A22
,B =
B11 B12
B21 B22
, C =
C11 C12
C21 C22
C =
C11 C12
C21 C22
=
A11 A12
A21 A22
B11 B12
B21 B22
= ...
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
46 / 94
Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
Any Idea to Parallelize the Code?
What do you think?
Did you notice the multiplications of sub-matrices?
Then What?
We have for example A11B11 and A12B21!!!
We can do the following
A11B11 + A12B21
47 / 94
The use of the recursion!!!
As always our friend!!!
48 / 94
Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
Pseudo-code of Matrix-Multiply
Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity
1 if (n == 1)
2 C [1, 1] = A [1, 1] + B [1, 1]
3 else
4 allocate a temporary matrix T [1...n, 1...n]
5 partition A, B, C, T into n
2
× n
2
sub-matrices
6 spawn Matrix − Multiply (C11, A11, B11, n/2)
7 spawn Matrix − Multiply (C12, A11, B12, n/2)
8 spawn Matrix − Multiply (C21, A21, B11, n/2)
9 spawn Matrix − Multiply (C22, A21, B12, n/2)
10 spawn Matrix − Multiply (T11, A12, B21, n/2)
11 spawn Matrix − Multiply (T12, A12, B21, n/2)
12 spawn Matrix − Multiply (T21, A22, B21, n/2)
13 Matrix − Multiply (T22, A22, B22, n/2)
14 sync
15 Matrix − Add(C, T, n)
49 / 94
Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
Explanation
Lines 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
Extra matrix for storing the second matrix in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
T
Line 5
Do the desired partition!!!
50 / 94
Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
Explanation
Lines 6 to 13
Calculating the products in
A11B11 A11B12
A21B11 A21B12
+
A12B21 A12B22
A22B21 A22B22
Using Recursion and Parallel Computations
Line 14
A barrier to wait until all the parallel computations are done!!!
Line 15
Call Matrix − Add to add C and T.
51 / 94
Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
Matrix ADD
Matrix Add Code
Matrix − Add(C, T, n)
// Add matrices C and T in-place to produce C = C + T
1 if (n == 1)
2 C [1, 1] = C [1, 1] + T [1, 1]
3 else
4 Partition C and T into n
2 × n
2 sub-matrices
5 spawn Matrix − Add (C11, T11, n/2)
6 spawn Matrix − Add (C12, T12, n/2)
7 spawn Matrix − Add (C21, T21, n/2)
8 Matrix − Add (C22, T22, n/2)
9 sync
52 / 94
Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
Explanation
Line 1 - 2
Stops the recursion once you have only two numbers to multiply
Line 4
To Partition
C =
A11B11 A11B12
A21B11 A21B12
T =
A12B21 A12B22
A22B21 A22B22
In lines 5 to 8
We do the following sum in parallel!!!
A11B11 A11B12
A21B11 A21B12
C
+
A12B21 A12B22
A22B21 A22B22
T
53 / 94
Calculating Complexity of Matrix Multiplication
Work of Matrix Multiplication
The work of T1 (n) of matrix multiplication satisfies the recurrence:
T1 (n) = 8T1
n
2
The sequential product
+ Θ n2
The sequential sum
= Θ n3
.
54 / 94
Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
Calculating Complexity of Matrix Multiplication
Span of Matrix Multiplication
T∞ (n) = T∞
n
2
The parallel product
+ Θ (log n)
The parallel sum
= Θ log2
n
This is because:
T∞
n
2 Matrix Multiplication is taking n
2 × n
2 matrices at the same
time because parallelism.
Θ (log n) is the span of the addition of the matrices (Remember, we
are using unlimited processors) which has a critical path of length
log n.
55 / 94
Collapsing the sum
Parallel Sum
+ +
56 / 94
How much Parallelism?
The Final Parallelism in this Algorithm is
T1 (n)
T∞ (n)
= Θ
n3
log2
n
Quite A Lot!!!
57 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
58 / 94
Merge-Sort : The Serial Version
We have
Merge − Sort (A, p, r)
Observation: Sort elements in A [p...r]
1 if (p < r) then
2 q = (p+r)/2
3 Merge − Sort (A, p, q)
4 Merge − Sort (A, q + 1, r)
5 Merge (A, p, q, r)
59 / 94
Merge-Sort : The Parallel Version
We have
Merge − Sort (A, p, r)
Observation: Sort elements in A [p...r]
1 if (p < r) then
2 q = (p+r)/2
3 spawn Merge − Sort (A, p, q)
4 Merge − Sort (A, q + 1, r) // Not necessary to spawn this
5 sync
6 Merge (A, p, q, r)
60 / 94
Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
Calculating Complexity of This simple Parallel Merge-Sort
Work of Merge-Sort
The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence:
T1 (n) =
Θ (1) if n = 1
2T1
n
2 + Θ (n) otherwise
= Θ (n log n)
Because the Master Theorem Case 2.
Span
T∞ (n) =
Θ (1) if n = 1
T∞
n
2 + Θ (n) otherwise
We have then
T∞
n
2 sort is taking two sorts at the same time because parallelism.
Then, T∞ (n) = Θ (n) because the Master Theorem Case 3.
61 / 94
How much Parallelism?
The Final Parallelism in this Algorithm is
T1 (n)
T∞ (n)
= Θ (log n)
NOT NOT A Lot!!!
62 / 94
Can we improve this?
We have a problem
We have a bottleneck!!! Where?
Yes in the Merge part!!!
We need to improve that bottleneck!!!
63 / 94
Can we improve this?
We have a problem
We have a bottleneck!!! Where?
Yes in the Merge part!!!
We need to improve that bottleneck!!!
63 / 94
Parallel Merge
Example: Here, we use and intermediate array T
64 / 94
Parallel Merge
Step 1. Find x = T [q1] where q1 = (p1+r1)/2 or the midpoint in
T [p1..r1]
65 / 94
Parallel Merge
Step 2. Use Binary Search in T [p1..r1] to find q2
66 / 94
Then
So that if we insert x between T [q2 − 1] and T [q2]
T p1 · · · q2 − 1 x q2 · · · r1 is sorted
67 / 94
Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
Binary Search
It takes a key x and a sub-array T [p..r] and it does
1 If T [p..r] is empty r < p, then it returns the index p.
2 if x ≤ T [p], then it returns p.
3 if x > T [p], then it returns the largest index q in the range
p < q ≤ r + 1 such that T [q − 1] < x.
68 / 94
Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
Binary Search Code
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max {p, r + 1}
3 while low < high
4 mid = log+high
2
5 if x ≤ T [mid]
6 high = mid
7 else low = mid + 1
8 return high
69 / 94
Parallel Merge
Step 3. Copy x in A [q3] where q3 = p3 + (q1 − p1) + (q2 − p2)
70 / 94
Parallel Merge
Step 4. Recursively merge T [p1..q1 − 1] and T [p2..q2 − 1] and place
result into A [p3..q3 − 1]
71 / 94
Parallel Merge
Step 5. Recursively merge T [q1 + 1..r1] and T [q2..r2] and place
result into A [q3 + 1..r3]
72 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
The Final Code for Parallel Merge
Par − Merge (T, p1, r1, p2, r2, A, p3)
1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1
2 if n1 < n2
3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2
4 if (n1 == 0)
5 return
6 else
7 q1 = (p1+r1)/2
8 q2 = BinarySearch (T [q1] , T, p2, r2)
9 q3 = p3 + (q1 − p1) + (q2 − p2)
10 A [q3] = T [q1]
11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3)
12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1)
13 sync
73 / 94
Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
Explanation
Line 1
Obtain the length of the two arrays to be merged
Line 2: If one is larger than the other
We exchange the variables to work the largest element!!! In this case we
make n1 ≥ n2
Line 4
if n1 == 0 return nothing to merge!!!
74 / 94
Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
Explanation
Line 10
It copies T [q1] directly into A [q3]
Line 11 and 12
They are used to recurse using nested parallelism to merge the sub-arrays
less and greater than x.
Line 13
The sync is used to ensure that the subproblems have completed before
the procedure returns.
75 / 94
First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
First the Span Complexity of Parallel Merge: T∞ (n)
Suppositions
n = n1 + n2
What case should we study?
Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)}
We notice then that
Because lines 3-6 n2 ≤ n1
76 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Then
2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2
Thus
In the worst case, a recursive call in lines 11 merges:
n1
2 elements of T [p1...r1] (Remember we are halving the array by
mid-point).
With all n2 elements of T [p2...r2].
77 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Thus, the number of elements involved in such a call is
n1
2
+ n2 ≤
n1
2
+
n2
2
+
n2
2
≤
n1
2
+
n2
2
+
n/2
2
=
n1 + n2
2
+
n
4
≤
n
2
+
n
4
=
3n
4
78 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
Span Complexity of the Parallel Merge with One
Processor: T1 (n)
Knowing that the Binary Search takes
Θ (log n)
We get the span for parallel merge
T∞ (n) = T∞
3n
4
+ Θ (log n)
This can can be solved using the exercise 4.6-2 in the Cormen’s Book
T∞ (n) = Θ log2
n
79 / 94
Calculating Work Complexity of Parallel Merge
Ok!!! We need to calculate the WORK
T1 (n) = Θ (Something)
Thus
We need to calculate the upper and lower bound.
80 / 94
Calculating Work Complexity of Parallel Merge
Ok!!! We need to calculate the WORK
T1 (n) = Θ (Something)
Thus
We need to calculate the upper and lower bound.
80 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Work of Parallel Merge
The work of T1 (n) of this Parallel Merge satisfies:
T1 (n) = Ω (n)
Because each of the n elements must be copied from array T to array
A.
What about the Upper Bound O?
First notice that we can have a merge with
n
4 elements when we have we have the worst case of n1
2 + n2 in the
other merge.
And 3n
4 for the worst case.
And the work of the Binary Search of O (log n)
81 / 94
Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
Calculating Work Complexity of Parallel Merge
Then
Then, for some α ∈ 1
4, 3
4 , then we have the following recursion for the
Parallel Merge when we have one processor:
T1 (n) = T1 (αn) + T1 ((1 − α) n)
Merge Part
+ Θ (log n)
Binary Search
Remark: α varies at each level of the recursion!!!
82 / 94
Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
Calculating Work Complexity of Parallel Merge
Then
Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2.
We have then using c3 for Θ (log n)
T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n
≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n
= c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements)
= c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n
≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0
83 / 94
Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
Calculating Work Complexity of Parallel Merge
Now, we have that given 0 < α(1 − α) < 1
We have log (α(1 − α)) < 0
Thus, making n large enough
log n + log (α(1 − α)) > 0 (1)
Then
T1 (n) ≤ c1n − (c2 − c3) log n
84 / 94
Calculating Work Complexity of Parallel Merge
Now, we choose c2 and c3 such that
c2 − c3 ≥ 0
We have that
T1 (n) ≤ c1n = O(n)
85 / 94
Calculating Work Complexity of Parallel Merge
Now, we choose c2 and c3 such that
c2 − c3 ≥ 0
We have that
T1 (n) ≤ c1n = O(n)
85 / 94
Finally
Then
T1 (n) = Θ (n)
The parallelism of Parallel Merge
T1 (n)
T∞ (n)
= Θ
n
log2
n
86 / 94
Finally
Then
T1 (n) = Θ (n)
The parallelism of Parallel Merge
T1 (n)
T∞ (n)
= Θ
n
log2
n
86 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then What is the Complexity of Parallel Merge-sort with
Parallel Merge?
First, the new code - Input A [p..r] - Output B [s..s + r − p]
Par − Merge − Sort (A, p, r, B, s)
1 n = r − p + 1
2 if (n == 1)
3 B [s] = A [p]
4 else let T [1..n] be a new array
5 q = (p+r)/2
6 q = q − p + 1
7 spawn Par − Merge − Sort (A, p, q, T, 1)
8 Par − Merge − Sort (A, q + 1, r, T, q + 1)
9 sync
10 Par − Merge (T, 1, q , q + 1, n, B, s)
87 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Work
We can use the worst work in the parallel to generate the recursion:
TPMS
1 (n) = 2TPMS
1
n
2
+ TPM
1 (n)
= 2TPMS
1
n
2
+ Θ (n)
= Θ (n log n) Case 2 of the MT
88 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Span
We get the following recursion for the span by taking in account that lines
7 and 8 of parallel merge sort run in parallel:
TPMS
∞ (n) = TPMS
∞
n
2
+ TPM
∞ (n)
= TPMS
∞
n
2
+ Θ log2
n
= Θ log3
n Exercise 4.6-2 in the Cormen’s Book
89 / 94
Then, What is the amount of Parallelism of Parallel
Merge-sort with Parallel Merge?
Parallelism
T1 (n)
T∞ (n)
= Θ
n
log2
n
90 / 94
Plotting both Parallelisms
We get the incredible difference between both algorithm
91 / 94
Plotting the T∞
We get the incredible difference when running both algorithms with
an infinite number of processors!!!
92 / 94
Outline
1 Introduction
Why Multi-Threaded Algorithms?
2 Model To Be Used
Symmetric Multiprocessor
Operations
Example
3 Computation DAG
Introduction
4 Performance Measures
Introduction
Running Time Classification
5 Parallel Laws
Work and Span Laws
Speedup and Parallelism
Greedy Scheduler
Scheduling Rises the Following Issue
6 Examples
Parallel Fibonacci
Matrix Multiplication
Parallel Merge-Sort
7 Exercises
Some Exercises you can try!!!
93 / 94
Exercises
27.1-1
27.1-2
27.1-4
27.1-6
27.1-7
27.2-1
27.2-3
27.2-4
27.2-5
27.3-1
27.3-2
27.3-3
27.3-4
94 / 94

More Related Content

What's hot

Simulation using OMNet++
Simulation using OMNet++Simulation using OMNet++
Simulation using OMNet++
jeromy fu
 
BANKER'S ALGORITHM
BANKER'S ALGORITHMBANKER'S ALGORITHM
BANKER'S ALGORITHM
Muhammad Baqar Kazmi
 
POST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEMPOST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEM
Rajendran
 
Paging and segmentation
Paging and segmentationPaging and segmentation
Paging and segmentation
Piyush Rochwani
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
Concurrency Control in Database Management System
Concurrency Control in Database Management SystemConcurrency Control in Database Management System
Concurrency Control in Database Management System
Janki Shah
 
Product Cipher
Product CipherProduct Cipher
Product Cipher
SHUBHA CHATURVEDI
 
Peephole optimization techniques in compiler design
Peephole optimization techniques in compiler designPeephole optimization techniques in compiler design
Peephole optimization techniques in compiler design
Anul Chaudhary
 
Network Layer design Issues.pptx
Network Layer design Issues.pptxNetwork Layer design Issues.pptx
Network Layer design Issues.pptx
Acad
 
Information and network security 35 the chinese remainder theorem
Information and network security 35 the chinese remainder theoremInformation and network security 35 the chinese remainder theorem
Information and network security 35 the chinese remainder theorem
Vaibhav Khanna
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Ashikapokiya12345
 
Error Recovery strategies and yacc | Compiler Design
Error Recovery strategies and yacc | Compiler DesignError Recovery strategies and yacc | Compiler Design
Error Recovery strategies and yacc | Compiler Design
Shamsul Huda
 
DES
DESDES
Hash Function
Hash FunctionHash Function
Hash Function
Siddharth Srivastava
 
Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security
Dr. Kapil Gupta
 
Rsa cryptosystem
Rsa cryptosystemRsa cryptosystem
Rsa cryptosystem
Abhishek Gautam
 
How Hashing Algorithms Work
How Hashing Algorithms WorkHow Hashing Algorithms Work
How Hashing Algorithms Work
CheapSSLsecurity
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
Dr. C.V. Suresh Babu
 
Asymmetric Cryptography
Asymmetric CryptographyAsymmetric Cryptography
Asymmetric Cryptography
UTD Computer Security Group
 
Key Management and Distribution
Key Management and DistributionKey Management and Distribution
Key Management and Distribution
Syed Bahadur Shah
 

What's hot (20)

Simulation using OMNet++
Simulation using OMNet++Simulation using OMNet++
Simulation using OMNet++
 
BANKER'S ALGORITHM
BANKER'S ALGORITHMBANKER'S ALGORITHM
BANKER'S ALGORITHM
 
POST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEMPOST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEM
 
Paging and segmentation
Paging and segmentationPaging and segmentation
Paging and segmentation
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
 
Concurrency Control in Database Management System
Concurrency Control in Database Management SystemConcurrency Control in Database Management System
Concurrency Control in Database Management System
 
Product Cipher
Product CipherProduct Cipher
Product Cipher
 
Peephole optimization techniques in compiler design
Peephole optimization techniques in compiler designPeephole optimization techniques in compiler design
Peephole optimization techniques in compiler design
 
Network Layer design Issues.pptx
Network Layer design Issues.pptxNetwork Layer design Issues.pptx
Network Layer design Issues.pptx
 
Information and network security 35 the chinese remainder theorem
Information and network security 35 the chinese remainder theoremInformation and network security 35 the chinese remainder theorem
Information and network security 35 the chinese remainder theorem
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Error Recovery strategies and yacc | Compiler Design
Error Recovery strategies and yacc | Compiler DesignError Recovery strategies and yacc | Compiler Design
Error Recovery strategies and yacc | Compiler Design
 
DES
DESDES
DES
 
Hash Function
Hash FunctionHash Function
Hash Function
 
Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security Chapter 1 Introduction of Cryptography and Network security
Chapter 1 Introduction of Cryptography and Network security
 
Rsa cryptosystem
Rsa cryptosystemRsa cryptosystem
Rsa cryptosystem
 
How Hashing Algorithms Work
How Hashing Algorithms WorkHow Hashing Algorithms Work
How Hashing Algorithms Work
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
Asymmetric Cryptography
Asymmetric CryptographyAsymmetric Cryptography
Asymmetric Cryptography
 
Key Management and Distribution
Key Management and DistributionKey Management and Distribution
Key Management and Distribution
 

Viewers also liked

Multithreaded algorithms
Multithreaded algorithmsMultithreaded algorithms
Multithreaded algorithms
Soicon Karo
 
Aula sobre multithreading
Aula sobre multithreadingAula sobre multithreading
Aula sobre multithreading
Bianca Dantas
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Waqas Tariq
 
Software Projects Java Projects Ieee Projects Domains
Software Projects Java Projects Ieee Projects DomainsSoftware Projects Java Projects Ieee Projects Domains
Software Projects Java Projects Ieee Projects Domains
ncct
 
Chuong 05 de quy
Chuong 05 de quyChuong 05 de quy
Chuong 05 de quy
Cau Chu Nho
 
optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm
Fatemeh Karimi
 
Comparing different concurrency models on the JVM
Comparing different concurrency models on the JVMComparing different concurrency models on the JVM
Comparing different concurrency models on the JVM
Mario Fusco
 
Parallel sorting
Parallel sortingParallel sorting
Parallel sorting
Mr. Vikram Singh Slathia
 
Многопоточные Алгоритмы (для BitByte 2014)
Многопоточные Алгоритмы (для BitByte 2014)Многопоточные Алгоритмы (для BitByte 2014)
Многопоточные Алгоритмы (для BitByte 2014)Roman Elizarov
 

Viewers also liked (10)

Multithreaded algorithms
Multithreaded algorithmsMultithreaded algorithms
Multithreaded algorithms
 
Aula sobre multithreading
Aula sobre multithreadingAula sobre multithreading
Aula sobre multithreading
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
 
Software Projects Java Projects Ieee Projects Domains
Software Projects Java Projects Ieee Projects DomainsSoftware Projects Java Projects Ieee Projects Domains
Software Projects Java Projects Ieee Projects Domains
 
Chuong 05 de quy
Chuong 05 de quyChuong 05 de quy
Chuong 05 de quy
 
optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm
 
Comparing different concurrency models on the JVM
Comparing different concurrency models on the JVMComparing different concurrency models on the JVM
Comparing different concurrency models on the JVM
 
Parallel sorting
Parallel sortingParallel sorting
Parallel sorting
 
Многопоточные Алгоритмы (для BitByte 2014)
Многопоточные Алгоритмы (для BitByte 2014)Многопоточные Алгоритмы (для BitByte 2014)
Многопоточные Алгоритмы (для BitByte 2014)
 

Similar to 24 Multithreaded Algorithms

Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentWebinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Severalnines
 
The Art of Message Queues - TEKX
The Art of Message Queues - TEKXThe Art of Message Queues - TEKX
The Art of Message Queues - TEKX
Mike Willbanks
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 Mistakes
John Coggeshall
 
Java performance - not so scary after all
Java performance - not so scary after allJava performance - not so scary after all
Java performance - not so scary after all
Holly Cummins
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
dairsie
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
Ganesan Narayanasamy
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
Mediacurrent
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability Mistakes
John Coggeshall
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
VISHAL DONGA
 
Cpu scheduling
Cpu schedulingCpu scheduling
Cpu scheduling
GRajendra
 
Meet a parallel, asynchronous PHP world
Meet a parallel, asynchronous PHP worldMeet a parallel, asynchronous PHP world
Meet a parallel, asynchronous PHP world
Steve Maraspin
 
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
Ulf Wendel
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Applications of parellel computing
Applications of parellel computingApplications of parellel computing
Applications of parellel computing
pbhopi
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
Ritu Arora
 

Similar to 24 Multithreaded Algorithms (20)

Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentWebinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
 
The Art of Message Queues - TEKX
The Art of Message Queues - TEKXThe Art of Message Queues - TEKX
The Art of Message Queues - TEKX
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 Mistakes
 
Java performance - not so scary after all
Java performance - not so scary after allJava performance - not so scary after all
Java performance - not so scary after all
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability Mistakes
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 
Cpu scheduling
Cpu schedulingCpu scheduling
Cpu scheduling
 
Meet a parallel, asynchronous PHP world
Meet a parallel, asynchronous PHP worldMeet a parallel, asynchronous PHP world
Meet a parallel, asynchronous PHP world
 
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Applications of parellel computing
Applications of parellel computingApplications of parellel computing
Applications of parellel computing
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 

More from Andres Mendez-Vazquez

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 

Recently uploaded (20)

sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 

24 Multithreaded Algorithms

  • 1. Analysis of Algorithms Multi-threaded Algorithms Andres Mendez-Vazquez April 15, 2016 1 / 94
  • 2. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 2 / 94
  • 3. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 3 / 94
  • 4. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 5. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 6. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 7. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 8. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 9. Multi-Threaded Algorithms Motivation Until now, our serial algorithms are quite suitable for running on a single processor system. However, multiprocessor algorithms are ubiquitous: Therefore, extending our serial models to a parallel computation model is a must. Computational Model There exist many competing models of parallel computation that are essentially different: Shared Memory Message Passing Etc. 4 / 94
  • 10. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 5 / 94
  • 11. The Model to Be Used Symmetric Multiprocessor The model that we will use is the Symmetric Multiprocessor (SMP) where a shared memory exists. L3 Cache L2 Cache L1i Cache L1d Cache CPU Core 1 CPU Core 3 CPU Core 4 CPU Core 2 L1d Cache L1d Cache L1d Cache L1i Cache L1i Cache L1i Cache L2 Cache L2 CacheL2 Cache P-to-P L3 Cache L2 Cache L1i Cache L1d Cache CPU Core 1 CPU Core 3 CPU Core 4 CPU Core 2 L1d Cache L1d Cache L1d Cache L1i Cache L1i Cache L1i Cache L2 Cache L2 CacheL2 Cache P-to-P L3 Cache L2 Cache L1i Cache L1d Cache CPU Core 1 CPU Core 3 CPU Core 4 CPU Core 2 L1d Cache L1d Cache L1d Cache L1i Cache L1i Cache L1i Cache L2 Cache L2 CacheL2 Cache P-to-P L3 Cache L2 Cache L1i Cache L1d Cache CPU Core 1 CPU Core 3 CPU Core 4 CPU Core 2 L1d Cache L1d Cache L1d Cache L1i Cache L1i Cache L1i Cache L2 Cache L2 CacheL2 Cache P-to-P MAIN SHARED MEMORY Processor 1 Processor 2 Processor 3 Processor 4 BUS 6 / 94
  • 12. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 13. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 14. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 15. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 16. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 17. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 18. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 19. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 20. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 21. Dynamic Multi-Threading Dynamic Multi-Threading In reality it can be difficult to handle multi-threaded programs in a SMP. Thus, we will assume a simple concurrency platform that handles all the resources: Schedules Memory Etc It is Called Dynamic Multi-threading. Dynamic Multi-Threading Computing Operations Spawn Sync Parallel 7 / 94
  • 22. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 8 / 94
  • 23. SPAWN SPAWN When called before a procedure, the parent procedure may continue to execute in parallel. Note The keyword spawn does not say anything about concurrent execution, but it can happen. The Scheduler decide which computations should run concurrently. 9 / 94
  • 24. SPAWN SPAWN When called before a procedure, the parent procedure may continue to execute in parallel. Note The keyword spawn does not say anything about concurrent execution, but it can happen. The Scheduler decide which computations should run concurrently. 9 / 94
  • 25. SPAWN SPAWN When called before a procedure, the parent procedure may continue to execute in parallel. Note The keyword spawn does not say anything about concurrent execution, but it can happen. The Scheduler decide which computations should run concurrently. 9 / 94
  • 26. SYNC AND PARALLEL SYNC The keyword sync indicates that the procedure must wait for all its spawned children to complete. PARALLEL This operation applies to loops, which make possible to execute the body of the loop in parallel. 10 / 94
  • 27. SYNC AND PARALLEL SYNC The keyword sync indicates that the procedure must wait for all its spawned children to complete. PARALLEL This operation applies to loops, which make possible to execute the body of the loop in parallel. 10 / 94
  • 28. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 11 / 94
  • 29. A Classic Parallel Piece of Code: Fibonacci Numbers Fibonacci’s Definition F0 = 0 F1 = 1 Fi = Fi−1 + Fi−2 for i > 1. Naive Algorithm Fibonacci(n) 1 if n ≤ 1 then 2 return n 3 else x = Fibonacci(n − 1) 4 y = Fibonacci(n − 2) 5 return x + y 12 / 94
  • 30. A Classic Parallel Piece of Code: Fibonacci Numbers Fibonacci’s Definition F0 = 0 F1 = 1 Fi = Fi−1 + Fi−2 for i > 1. Naive Algorithm Fibonacci(n) 1 if n ≤ 1 then 2 return n 3 else x = Fibonacci(n − 1) 4 y = Fibonacci(n − 2) 5 return x + y 12 / 94
  • 31. Time Complexity Recursion and Complexity Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1). Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+ √ 5 2 . 13 / 94
  • 32. Time Complexity Recursion and Complexity Recursion T (n) =T (n − 1) + T (n − 2) + Θ (1). Complexity T(n) = Θ (Fn) = Θ (φn), φ = 1+ √ 5 2 . 13 / 94
  • 33. There is a Better Way We can order the first tree numbers in the sequence as F2 F1 F1 F0 = 1 1 1 0 Then F2 F1 F1 F0 F2 F1 F1 F0 = 1 1 1 0 1 1 1 0 = 2 1 1 1 = F3 F2 F2 F1 14 / 94
  • 34. There is a Better Way We can order the first tree numbers in the sequence as F2 F1 F1 F0 = 1 1 1 0 Then F2 F1 F1 F0 F2 F1 F1 F0 = 1 1 1 0 1 1 1 0 = 2 1 1 1 = F3 F2 F2 F1 14 / 94
  • 35. There is a Better Way Calculating in O(log n) when n is a power of 2 1 1 1 0 n = F (n + 1) F (n) F (n) F (n − 1) Thus 1 1 1 0 n 2 1 1 1 0 n 2 = F n 2 + 1 F n 2 F n 2 F n 2 − 1 F n 2 + 1 F n 2 F n 2 F n 2 − 1 However... We will use the naive version to illustrate the principles of parallel programming. 15 / 94
  • 36. There is a Better Way Calculating in O(log n) when n is a power of 2 1 1 1 0 n = F (n + 1) F (n) F (n) F (n − 1) Thus 1 1 1 0 n 2 1 1 1 0 n 2 = F n 2 + 1 F n 2 F n 2 F n 2 − 1 F n 2 + 1 F n 2 F n 2 F n 2 − 1 However... We will use the naive version to illustrate the principles of parallel programming. 15 / 94
  • 37. The Concurrent Code Parallel Algorithm PFibonacci(n) 1 if n ≤ 1 then 2 return n 3 else x = spawn Fibonacci(n − 1) 4 y = Fibonacci(n − 2) 5 sync 6 return x + y 16 / 94
  • 38. The Concurrent Code Parallel Algorithm PFibonacci(n) 1 if n ≤ 1 then 2 return n 3 else x = spawn Fibonacci(n − 1) 4 y = Fibonacci(n − 2) 5 sync 6 return x + y 16 / 94
  • 39. The Concurrent Code Parallel Algorithm PFibonacci(n) 1 if n ≤ 1 then 2 return n 3 else x = spawn Fibonacci(n − 1) 4 y = Fibonacci(n − 2) 5 sync 6 return x + y 16 / 94
  • 40. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 17 / 94
  • 41. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 42. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 43. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 44. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 45. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 46. How do we compute a complexity? Computation DAG Definition A directed acyclic G = (V , E) graph where The vertices V are sets of instructions. The edges E represent dependencies between sets of instructions i.e. (u, v) instruction u before v. Notes A set of instructions without any parallel control are grouped in a strand. Thus, V represents a set of strands and E represents dependencies between strands induced by parallel control. A strand of maximal length will be called a thread. 18 / 94
  • 47. How do we compute a complexity? Computation DAG Thus If there is an edge between thread u and v, then they are said to be (logically) in series. If there is no edge, then they are said to be (logically) in parallel. 19 / 94
  • 48. How do we compute a complexity? Computation DAG Thus If there is an edge between thread u and v, then they are said to be (logically) in series. If there is no edge, then they are said to be (logically) in parallel. 19 / 94
  • 49. Example: PFibonacci(4) Example PFibonacci(4) PFibonacci(3) PFibonacci(2) PFibonacci(2) PFibonacci(1) PFibonacci(1) PFibonacci(1) PFibonacci(0) PFibonacci(0) 20 / 94
  • 50. Edge Classification Continuation Edge A continuation edge (u, v) connects a thread u to its successor v within the same procedure instance. Spawned Edge When a thread u spawns a new thread v, then (u, v) is called a spawned edge. Call Edges Call edges represent normal procedure call. Return Edge Return edge signals when a thread v returns to its calling procedure. 21 / 94
  • 51. Edge Classification Continuation Edge A continuation edge (u, v) connects a thread u to its successor v within the same procedure instance. Spawned Edge When a thread u spawns a new thread v, then (u, v) is called a spawned edge. Call Edges Call edges represent normal procedure call. Return Edge Return edge signals when a thread v returns to its calling procedure. 21 / 94
  • 52. Edge Classification Continuation Edge A continuation edge (u, v) connects a thread u to its successor v within the same procedure instance. Spawned Edge When a thread u spawns a new thread v, then (u, v) is called a spawned edge. Call Edges Call edges represent normal procedure call. Return Edge Return edge signals when a thread v returns to its calling procedure. 21 / 94
  • 53. Edge Classification Continuation Edge A continuation edge (u, v) connects a thread u to its successor v within the same procedure instance. Spawned Edge When a thread u spawns a new thread v, then (u, v) is called a spawned edge. Call Edges Call edges represent normal procedure call. Return Edge Return edge signals when a thread v returns to its calling procedure. 21 / 94
  • 54. Example: PFibonacci(4) The Different Edges PFibonacci(4) PFibonacci(3) PFibonacci(2) PFibonacci(2) PFibonacci(1) PFibonacci(1) PFibonacci(1) PFibonacci(0) PFibonacci(0) Init Thread Spawn Edge Continuation Edge Return Edge Final Thread Call Edge 22 / 94
  • 55. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 23 / 94
  • 56. Performance Measures WORK The work of a multi-threaded computation is the total time to execute the entire computation on one processor. Work = i∈I Time (Threadi) SPAN The span is the longest time to execute the strands along any path of the DAG. In a DAG which each strand takes unit time, the span equals the number of vertices on a longest or critical path in the DAG. 24 / 94
  • 57. Performance Measures WORK The work of a multi-threaded computation is the total time to execute the entire computation on one processor. Work = i∈I Time (Threadi) SPAN The span is the longest time to execute the strands along any path of the DAG. In a DAG which each strand takes unit time, the span equals the number of vertices on a longest or critical path in the DAG. 24 / 94
  • 58. Example: PFibonacci(4) Example Critical Path PFibonacci(4) PFibonacci(3) PFibonacci(2) PFibonacci(2) PFibonacci(1) PFibonacci(1) PFibonacci(1) PFibonacci(0) PFibonacci(0) 25 / 94
  • 59. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 60. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 61. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 62. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 63. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 64. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 65. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 66. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 67. Example Example In Fibonacci(4), we have 17 threads. 8 vertices in the longest path We have that Assuming unit time WORK=17 time units SPAN=8 time units Note Running time not only depends on work and span but Available Cores Scheduler Policies 26 / 94
  • 68. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 27 / 94
  • 69. Running Time Classification Single Processor T1 running time on a single processor. Multiple Processors Tp running time on P processors. Unlimited Processors T∞ running time on unlimited processors, also called the span, if we run each strand on its own processor. 28 / 94
  • 70. Running Time Classification Single Processor T1 running time on a single processor. Multiple Processors Tp running time on P processors. Unlimited Processors T∞ running time on unlimited processors, also called the span, if we run each strand on its own processor. 28 / 94
  • 71. Running Time Classification Single Processor T1 running time on a single processor. Multiple Processors Tp running time on P processors. Unlimited Processors T∞ running time on unlimited processors, also called the span, if we run each strand on its own processor. 28 / 94
  • 72. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 29 / 94
  • 73. Work Law Definition In one step, an ideal parallel computer with P processors can do: At most P units of work. Thus in TP time, it can perform at most PTP work. PTP ≥ T1 =⇒ Tp ≥ T1 P 30 / 94
  • 74. Work Law Definition In one step, an ideal parallel computer with P processors can do: At most P units of work. Thus in TP time, it can perform at most PTP work. PTP ≥ T1 =⇒ Tp ≥ T1 P 30 / 94
  • 75. Work Law Definition In one step, an ideal parallel computer with P processors can do: At most P units of work. Thus in TP time, it can perform at most PTP work. PTP ≥ T1 =⇒ Tp ≥ T1 P 30 / 94
  • 76. Work Law Definition In one step, an ideal parallel computer with P processors can do: At most P units of work. Thus in TP time, it can perform at most PTP work. PTP ≥ T1 =⇒ Tp ≥ T1 P 30 / 94
  • 77. Span Law Definition A P-processor ideal parallel computer cannot run faster than a machine with unlimited number of processors. However, a computer with unlimited number of processors can emulate a P-processor machine by using simply P of its processors. Therefore, TP ≥ T∞ 31 / 94
  • 78. Span Law Definition A P-processor ideal parallel computer cannot run faster than a machine with unlimited number of processors. However, a computer with unlimited number of processors can emulate a P-processor machine by using simply P of its processors. Therefore, TP ≥ T∞ 31 / 94
  • 79. Work Calculations: Serial Serial Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B). 32 / 94
  • 80. Work Calculations: Serial Serial Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B). 32 / 94
  • 81. Work Calculations: Serial Serial Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B). 32 / 94
  • 82. Work Calculations: Parallel Parallel Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}. 33 / 94
  • 83. Work Calculations: Parallel Parallel Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}. 33 / 94
  • 84. Work Calculations: Parallel Parallel Computations A B Note Work: T1 (A ∪ B) = T1 (A) + T1 (B). Span: T∞ (A ∪ B) = max {T∞ (A) , T∞ (B)}. 33 / 94
  • 85. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 34 / 94
  • 86. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 87. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 88. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 89. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 90. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 91. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 92. Speedup and Parallelism Speed up The speed up of a computation on P processors is defined as T1 TP . Then, by work law T1 TP ≤ P. Thus, the speedup on P processors can be at most P. Notes Linear Speedup when T1 TP = Θ (P). Perfect Linear Speedup when T1 TP = P. Parallelism The parallelism of a computation on P processors is defined as T1 T∞ . In specific, we are looking to have a lot of parallelism. This changes from Algorithm to Algorithm. 35 / 94
  • 93. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 36 / 94
  • 94. Greedy Scheduler Definition A greedy scheduler assigns as many strands to processors as possible in each time step. Note On P processors, if at least P strands are ready to execute during a time step, then we say that the step is a complete step. Otherwise we say that it is an incomplete step. This changes from Algorithm to Algorithm. 37 / 94
  • 95. Greedy Scheduler Definition A greedy scheduler assigns as many strands to processors as possible in each time step. Note On P processors, if at least P strands are ready to execute during a time step, then we say that the step is a complete step. Otherwise we say that it is an incomplete step. This changes from Algorithm to Algorithm. 37 / 94
  • 96. Greedy Scheduler Definition A greedy scheduler assigns as many strands to processors as possible in each time step. Note On P processors, if at least P strands are ready to execute during a time step, then we say that the step is a complete step. Otherwise we say that it is an incomplete step. This changes from Algorithm to Algorithm. 37 / 94
  • 97. Greedy Scheduler Definition A greedy scheduler assigns as many strands to processors as possible in each time step. Note On P processors, if at least P strands are ready to execute during a time step, then we say that the step is a complete step. Otherwise we say that it is an incomplete step. This changes from Algorithm to Algorithm. 37 / 94
  • 98. Greedy Scheduler Theorem and Corollaries Theorem 27.1 On an ideal parallel computer with P processors, a greedy scheduler executes a multi-threaded computation with work T1 and span T∞ in time TP ≤ T1 P + T∞. Corollary 27.2 The running time TP of any multi-threaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal. Corollary 27.3 Let TP be the running time of a multi-threaded computation produced by a greedy scheduler on an ideal parallel computer with P processors, and let T1 and T∞ be the work and span of the computation, respectively. Then, if P T1 T∞ (Much Less), we have TP ≈ T1 P , or equivalently, a speedup of approximately P . 38 / 94
  • 99. Greedy Scheduler Theorem and Corollaries Theorem 27.1 On an ideal parallel computer with P processors, a greedy scheduler executes a multi-threaded computation with work T1 and span T∞ in time TP ≤ T1 P + T∞. Corollary 27.2 The running time TP of any multi-threaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal. Corollary 27.3 Let TP be the running time of a multi-threaded computation produced by a greedy scheduler on an ideal parallel computer with P processors, and let T1 and T∞ be the work and span of the computation, respectively. Then, if P T1 T∞ (Much Less), we have TP ≈ T1 P , or equivalently, a speedup of approximately P . 38 / 94
  • 100. Greedy Scheduler Theorem and Corollaries Theorem 27.1 On an ideal parallel computer with P processors, a greedy scheduler executes a multi-threaded computation with work T1 and span T∞ in time TP ≤ T1 P + T∞. Corollary 27.2 The running time TP of any multi-threaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal. Corollary 27.3 Let TP be the running time of a multi-threaded computation produced by a greedy scheduler on an ideal parallel computer with P processors, and let T1 and T∞ be the work and span of the computation, respectively. Then, if P T1 T∞ (Much Less), we have TP ≈ T1 P , or equivalently, a speedup of approximately P . 38 / 94
  • 101. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 39 / 94
  • 102. Race Conditions Determinacy Race A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. Example Race-Example() 1 x = 0 2 parallel for i = 1 to 3 do 3 x = x + 1 4 print x 40 / 94
  • 103. Race Conditions Determinacy Race A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. Example Race-Example() 1 x = 0 2 parallel for i = 1 to 3 do 3 x = x + 1 4 print x 40 / 94
  • 104. Example Determinacy Race Example 1 2 3 4 5 67 8 910 11 step x r1 r2 r3 1 0 2 0 0 3 0 1 4 0 1 0 5 0 1 0 0 6 0 1 0 1 7 0 1 1 1 8 1 1 1 1 9 1 1 1 1 10 1 1 1 1 41 / 94
  • 105. Example NOTE Although, this is of great importance is beyond the scope of this class: For More about this topic, we have: Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA, 2007. 42 / 94
  • 106. Example NOTE Although, this is of great importance is beyond the scope of this class: For More about this topic, we have: Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA, 2007. 42 / 94
  • 107. Example NOTE Although, this is of great importance is beyond the scope of this class: For More about this topic, we have: Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA, 2007. 42 / 94
  • 108. Example NOTE Although, this is of great importance is beyond the scope of this class: For More about this topic, we have: Maurice Herlihy and Nir Shavit, “The Art of Multiprocessor Programming,” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Andrew S. Tanenbaum, “Modern Operating Systems” (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA, 2007. 42 / 94
  • 109. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 43 / 94
  • 110. Example of Complexity: PFibonacci Complexity T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1) Finally T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n) Parallelism T1 (n) T∞ (n) = Θ φn n 44 / 94
  • 111. Example of Complexity: PFibonacci Complexity T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1) Finally T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n) Parallelism T1 (n) T∞ (n) = Θ φn n 44 / 94
  • 112. Example of Complexity: PFibonacci Complexity T∞ (n) = max {T∞ (n − 1) , T∞ (n − 2)} + Θ (1) Finally T∞ (n) = T∞ (n − 1) + Θ (1) = Θ (n) Parallelism T1 (n) T∞ (n) = Θ φn n 44 / 94
  • 113. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 45 / 94
  • 114. Matrix Multiplication Trick To multiply two n × n matrices, we perform 8 matrix multiplications of n 2 × n 2 matrices and one addition n × n of matrices. Idea A= A11 A12 A21 A22 ,B = B11 B12 B21 B22 , C = C11 C12 C21 C22 C = C11 C12 C21 C22 = A11 A12 A21 A22 B11 B12 B21 B22 = ... A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 46 / 94
  • 115. Matrix Multiplication Trick To multiply two n × n matrices, we perform 8 matrix multiplications of n 2 × n 2 matrices and one addition n × n of matrices. Idea A= A11 A12 A21 A22 ,B = B11 B12 B21 B22 , C = C11 C12 C21 C22 C = C11 C12 C21 C22 = A11 A12 A21 A22 B11 B12 B21 B22 = ... A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 46 / 94
  • 116. Any Idea to Parallelize the Code? What do you think? Did you notice the multiplications of sub-matrices? Then What? We have for example A11B11 and A12B21!!! We can do the following A11B11 + A12B21 47 / 94
  • 117. Any Idea to Parallelize the Code? What do you think? Did you notice the multiplications of sub-matrices? Then What? We have for example A11B11 and A12B21!!! We can do the following A11B11 + A12B21 47 / 94
  • 118. Any Idea to Parallelize the Code? What do you think? Did you notice the multiplications of sub-matrices? Then What? We have for example A11B11 and A12B21!!! We can do the following A11B11 + A12B21 47 / 94
  • 119. The use of the recursion!!! As always our friend!!! 48 / 94
  • 120. Pseudo-code of Matrix-Multiply Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity 1 if (n == 1) 2 C [1, 1] = A [1, 1] + B [1, 1] 3 else 4 allocate a temporary matrix T [1...n, 1...n] 5 partition A, B, C, T into n 2 × n 2 sub-matrices 6 spawn Matrix − Multiply (C11, A11, B11, n/2) 7 spawn Matrix − Multiply (C12, A11, B12, n/2) 8 spawn Matrix − Multiply (C21, A21, B11, n/2) 9 spawn Matrix − Multiply (C22, A21, B12, n/2) 10 spawn Matrix − Multiply (T11, A12, B21, n/2) 11 spawn Matrix − Multiply (T12, A12, B21, n/2) 12 spawn Matrix − Multiply (T21, A22, B21, n/2) 13 Matrix − Multiply (T22, A22, B22, n/2) 14 sync 15 Matrix − Add(C, T, n) 49 / 94
  • 121. Pseudo-code of Matrix-Multiply Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity 1 if (n == 1) 2 C [1, 1] = A [1, 1] + B [1, 1] 3 else 4 allocate a temporary matrix T [1...n, 1...n] 5 partition A, B, C, T into n 2 × n 2 sub-matrices 6 spawn Matrix − Multiply (C11, A11, B11, n/2) 7 spawn Matrix − Multiply (C12, A11, B12, n/2) 8 spawn Matrix − Multiply (C21, A21, B11, n/2) 9 spawn Matrix − Multiply (C22, A21, B12, n/2) 10 spawn Matrix − Multiply (T11, A12, B21, n/2) 11 spawn Matrix − Multiply (T12, A12, B21, n/2) 12 spawn Matrix − Multiply (T21, A22, B21, n/2) 13 Matrix − Multiply (T22, A22, B22, n/2) 14 sync 15 Matrix − Add(C, T, n) 49 / 94
  • 122. Pseudo-code of Matrix-Multiply Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity 1 if (n == 1) 2 C [1, 1] = A [1, 1] + B [1, 1] 3 else 4 allocate a temporary matrix T [1...n, 1...n] 5 partition A, B, C, T into n 2 × n 2 sub-matrices 6 spawn Matrix − Multiply (C11, A11, B11, n/2) 7 spawn Matrix − Multiply (C12, A11, B12, n/2) 8 spawn Matrix − Multiply (C21, A21, B11, n/2) 9 spawn Matrix − Multiply (C22, A21, B12, n/2) 10 spawn Matrix − Multiply (T11, A12, B21, n/2) 11 spawn Matrix − Multiply (T12, A12, B21, n/2) 12 spawn Matrix − Multiply (T21, A22, B21, n/2) 13 Matrix − Multiply (T22, A22, B22, n/2) 14 sync 15 Matrix − Add(C, T, n) 49 / 94
  • 123. Pseudo-code of Matrix-Multiply Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity 1 if (n == 1) 2 C [1, 1] = A [1, 1] + B [1, 1] 3 else 4 allocate a temporary matrix T [1...n, 1...n] 5 partition A, B, C, T into n 2 × n 2 sub-matrices 6 spawn Matrix − Multiply (C11, A11, B11, n/2) 7 spawn Matrix − Multiply (C12, A11, B12, n/2) 8 spawn Matrix − Multiply (C21, A21, B11, n/2) 9 spawn Matrix − Multiply (C22, A21, B12, n/2) 10 spawn Matrix − Multiply (T11, A12, B21, n/2) 11 spawn Matrix − Multiply (T12, A12, B21, n/2) 12 spawn Matrix − Multiply (T21, A22, B21, n/2) 13 Matrix − Multiply (T22, A22, B22, n/2) 14 sync 15 Matrix − Add(C, T, n) 49 / 94
  • 124. Pseudo-code of Matrix-Multiply Matrix − Multiply(C, A, B, n) // The result of A × B in C with n a power of 2 for simplicity 1 if (n == 1) 2 C [1, 1] = A [1, 1] + B [1, 1] 3 else 4 allocate a temporary matrix T [1...n, 1...n] 5 partition A, B, C, T into n 2 × n 2 sub-matrices 6 spawn Matrix − Multiply (C11, A11, B11, n/2) 7 spawn Matrix − Multiply (C12, A11, B12, n/2) 8 spawn Matrix − Multiply (C21, A21, B11, n/2) 9 spawn Matrix − Multiply (C22, A21, B12, n/2) 10 spawn Matrix − Multiply (T11, A12, B21, n/2) 11 spawn Matrix − Multiply (T12, A12, B21, n/2) 12 spawn Matrix − Multiply (T21, A22, B21, n/2) 13 Matrix − Multiply (T22, A22, B22, n/2) 14 sync 15 Matrix − Add(C, T, n) 49 / 94
  • 125. Explanation Lines 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 Extra matrix for storing the second matrix in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 T Line 5 Do the desired partition!!! 50 / 94
  • 126. Explanation Lines 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 Extra matrix for storing the second matrix in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 T Line 5 Do the desired partition!!! 50 / 94
  • 127. Explanation Lines 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 Extra matrix for storing the second matrix in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 T Line 5 Do the desired partition!!! 50 / 94
  • 128. Explanation Lines 6 to 13 Calculating the products in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 Using Recursion and Parallel Computations Line 14 A barrier to wait until all the parallel computations are done!!! Line 15 Call Matrix − Add to add C and T. 51 / 94
  • 129. Explanation Lines 6 to 13 Calculating the products in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 Using Recursion and Parallel Computations Line 14 A barrier to wait until all the parallel computations are done!!! Line 15 Call Matrix − Add to add C and T. 51 / 94
  • 130. Explanation Lines 6 to 13 Calculating the products in A11B11 A11B12 A21B11 A21B12 + A12B21 A12B22 A22B21 A22B22 Using Recursion and Parallel Computations Line 14 A barrier to wait until all the parallel computations are done!!! Line 15 Call Matrix − Add to add C and T. 51 / 94
  • 131. Matrix ADD Matrix Add Code Matrix − Add(C, T, n) // Add matrices C and T in-place to produce C = C + T 1 if (n == 1) 2 C [1, 1] = C [1, 1] + T [1, 1] 3 else 4 Partition C and T into n 2 × n 2 sub-matrices 5 spawn Matrix − Add (C11, T11, n/2) 6 spawn Matrix − Add (C12, T12, n/2) 7 spawn Matrix − Add (C21, T21, n/2) 8 Matrix − Add (C22, T22, n/2) 9 sync 52 / 94
  • 132. Matrix ADD Matrix Add Code Matrix − Add(C, T, n) // Add matrices C and T in-place to produce C = C + T 1 if (n == 1) 2 C [1, 1] = C [1, 1] + T [1, 1] 3 else 4 Partition C and T into n 2 × n 2 sub-matrices 5 spawn Matrix − Add (C11, T11, n/2) 6 spawn Matrix − Add (C12, T12, n/2) 7 spawn Matrix − Add (C21, T21, n/2) 8 Matrix − Add (C22, T22, n/2) 9 sync 52 / 94
  • 133. Matrix ADD Matrix Add Code Matrix − Add(C, T, n) // Add matrices C and T in-place to produce C = C + T 1 if (n == 1) 2 C [1, 1] = C [1, 1] + T [1, 1] 3 else 4 Partition C and T into n 2 × n 2 sub-matrices 5 spawn Matrix − Add (C11, T11, n/2) 6 spawn Matrix − Add (C12, T12, n/2) 7 spawn Matrix − Add (C21, T21, n/2) 8 Matrix − Add (C22, T22, n/2) 9 sync 52 / 94
  • 134. Matrix ADD Matrix Add Code Matrix − Add(C, T, n) // Add matrices C and T in-place to produce C = C + T 1 if (n == 1) 2 C [1, 1] = C [1, 1] + T [1, 1] 3 else 4 Partition C and T into n 2 × n 2 sub-matrices 5 spawn Matrix − Add (C11, T11, n/2) 6 spawn Matrix − Add (C12, T12, n/2) 7 spawn Matrix − Add (C21, T21, n/2) 8 Matrix − Add (C22, T22, n/2) 9 sync 52 / 94
  • 135. Explanation Line 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 To Partition C = A11B11 A11B12 A21B11 A21B12 T = A12B21 A12B22 A22B21 A22B22 In lines 5 to 8 We do the following sum in parallel!!! A11B11 A11B12 A21B11 A21B12 C + A12B21 A12B22 A22B21 A22B22 T 53 / 94
  • 136. Explanation Line 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 To Partition C = A11B11 A11B12 A21B11 A21B12 T = A12B21 A12B22 A22B21 A22B22 In lines 5 to 8 We do the following sum in parallel!!! A11B11 A11B12 A21B11 A21B12 C + A12B21 A12B22 A22B21 A22B22 T 53 / 94
  • 137. Explanation Line 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 To Partition C = A11B11 A11B12 A21B11 A21B12 T = A12B21 A12B22 A22B21 A22B22 In lines 5 to 8 We do the following sum in parallel!!! A11B11 A11B12 A21B11 A21B12 C + A12B21 A12B22 A22B21 A22B22 T 53 / 94
  • 138. Explanation Line 1 - 2 Stops the recursion once you have only two numbers to multiply Line 4 To Partition C = A11B11 A11B12 A21B11 A21B12 T = A12B21 A12B22 A22B21 A22B22 In lines 5 to 8 We do the following sum in parallel!!! A11B11 A11B12 A21B11 A21B12 C + A12B21 A12B22 A22B21 A22B22 T 53 / 94
  • 139. Calculating Complexity of Matrix Multiplication Work of Matrix Multiplication The work of T1 (n) of matrix multiplication satisfies the recurrence: T1 (n) = 8T1 n 2 The sequential product + Θ n2 The sequential sum = Θ n3 . 54 / 94
  • 140. Calculating Complexity of Matrix Multiplication Span of Matrix Multiplication T∞ (n) = T∞ n 2 The parallel product + Θ (log n) The parallel sum = Θ log2 n This is because: T∞ n 2 Matrix Multiplication is taking n 2 × n 2 matrices at the same time because parallelism. Θ (log n) is the span of the addition of the matrices (Remember, we are using unlimited processors) which has a critical path of length log n. 55 / 94
  • 141. Calculating Complexity of Matrix Multiplication Span of Matrix Multiplication T∞ (n) = T∞ n 2 The parallel product + Θ (log n) The parallel sum = Θ log2 n This is because: T∞ n 2 Matrix Multiplication is taking n 2 × n 2 matrices at the same time because parallelism. Θ (log n) is the span of the addition of the matrices (Remember, we are using unlimited processors) which has a critical path of length log n. 55 / 94
  • 142. Calculating Complexity of Matrix Multiplication Span of Matrix Multiplication T∞ (n) = T∞ n 2 The parallel product + Θ (log n) The parallel sum = Θ log2 n This is because: T∞ n 2 Matrix Multiplication is taking n 2 × n 2 matrices at the same time because parallelism. Θ (log n) is the span of the addition of the matrices (Remember, we are using unlimited processors) which has a critical path of length log n. 55 / 94
  • 143. Collapsing the sum Parallel Sum + + 56 / 94
  • 144. How much Parallelism? The Final Parallelism in this Algorithm is T1 (n) T∞ (n) = Θ n3 log2 n Quite A Lot!!! 57 / 94
  • 145. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 58 / 94
  • 146. Merge-Sort : The Serial Version We have Merge − Sort (A, p, r) Observation: Sort elements in A [p...r] 1 if (p < r) then 2 q = (p+r)/2 3 Merge − Sort (A, p, q) 4 Merge − Sort (A, q + 1, r) 5 Merge (A, p, q, r) 59 / 94
  • 147. Merge-Sort : The Parallel Version We have Merge − Sort (A, p, r) Observation: Sort elements in A [p...r] 1 if (p < r) then 2 q = (p+r)/2 3 spawn Merge − Sort (A, p, q) 4 Merge − Sort (A, q + 1, r) // Not necessary to spawn this 5 sync 6 Merge (A, p, q, r) 60 / 94
  • 148. Calculating Complexity of This simple Parallel Merge-Sort Work of Merge-Sort The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence: T1 (n) = Θ (1) if n = 1 2T1 n 2 + Θ (n) otherwise = Θ (n log n) Because the Master Theorem Case 2. Span T∞ (n) = Θ (1) if n = 1 T∞ n 2 + Θ (n) otherwise We have then T∞ n 2 sort is taking two sorts at the same time because parallelism. Then, T∞ (n) = Θ (n) because the Master Theorem Case 3. 61 / 94
  • 149. Calculating Complexity of This simple Parallel Merge-Sort Work of Merge-Sort The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence: T1 (n) = Θ (1) if n = 1 2T1 n 2 + Θ (n) otherwise = Θ (n log n) Because the Master Theorem Case 2. Span T∞ (n) = Θ (1) if n = 1 T∞ n 2 + Θ (n) otherwise We have then T∞ n 2 sort is taking two sorts at the same time because parallelism. Then, T∞ (n) = Θ (n) because the Master Theorem Case 3. 61 / 94
  • 150. Calculating Complexity of This simple Parallel Merge-Sort Work of Merge-Sort The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence: T1 (n) = Θ (1) if n = 1 2T1 n 2 + Θ (n) otherwise = Θ (n log n) Because the Master Theorem Case 2. Span T∞ (n) = Θ (1) if n = 1 T∞ n 2 + Θ (n) otherwise We have then T∞ n 2 sort is taking two sorts at the same time because parallelism. Then, T∞ (n) = Θ (n) because the Master Theorem Case 3. 61 / 94
  • 151. Calculating Complexity of This simple Parallel Merge-Sort Work of Merge-Sort The work of T1 (n) of this Parallel Merge-Sort satisfies the recurrence: T1 (n) = Θ (1) if n = 1 2T1 n 2 + Θ (n) otherwise = Θ (n log n) Because the Master Theorem Case 2. Span T∞ (n) = Θ (1) if n = 1 T∞ n 2 + Θ (n) otherwise We have then T∞ n 2 sort is taking two sorts at the same time because parallelism. Then, T∞ (n) = Θ (n) because the Master Theorem Case 3. 61 / 94
  • 152. How much Parallelism? The Final Parallelism in this Algorithm is T1 (n) T∞ (n) = Θ (log n) NOT NOT A Lot!!! 62 / 94
  • 153. Can we improve this? We have a problem We have a bottleneck!!! Where? Yes in the Merge part!!! We need to improve that bottleneck!!! 63 / 94
  • 154. Can we improve this? We have a problem We have a bottleneck!!! Where? Yes in the Merge part!!! We need to improve that bottleneck!!! 63 / 94
  • 155. Parallel Merge Example: Here, we use and intermediate array T 64 / 94
  • 156. Parallel Merge Step 1. Find x = T [q1] where q1 = (p1+r1)/2 or the midpoint in T [p1..r1] 65 / 94
  • 157. Parallel Merge Step 2. Use Binary Search in T [p1..r1] to find q2 66 / 94
  • 158. Then So that if we insert x between T [q2 − 1] and T [q2] T p1 · · · q2 − 1 x q2 · · · r1 is sorted 67 / 94
  • 159. Binary Search It takes a key x and a sub-array T [p..r] and it does 1 If T [p..r] is empty r < p, then it returns the index p. 2 if x ≤ T [p], then it returns p. 3 if x > T [p], then it returns the largest index q in the range p < q ≤ r + 1 such that T [q − 1] < x. 68 / 94
  • 160. Binary Search It takes a key x and a sub-array T [p..r] and it does 1 If T [p..r] is empty r < p, then it returns the index p. 2 if x ≤ T [p], then it returns p. 3 if x > T [p], then it returns the largest index q in the range p < q ≤ r + 1 such that T [q − 1] < x. 68 / 94
  • 161. Binary Search It takes a key x and a sub-array T [p..r] and it does 1 If T [p..r] is empty r < p, then it returns the index p. 2 if x ≤ T [p], then it returns p. 3 if x > T [p], then it returns the largest index q in the range p < q ≤ r + 1 such that T [q − 1] < x. 68 / 94
  • 162. Binary Search Code BINARY-SEARCH(x, T, p, r) 1 low = p 2 high = max {p, r + 1} 3 while low < high 4 mid = log+high 2 5 if x ≤ T [mid] 6 high = mid 7 else low = mid + 1 8 return high 69 / 94
  • 163. Binary Search Code BINARY-SEARCH(x, T, p, r) 1 low = p 2 high = max {p, r + 1} 3 while low < high 4 mid = log+high 2 5 if x ≤ T [mid] 6 high = mid 7 else low = mid + 1 8 return high 69 / 94
  • 164. Binary Search Code BINARY-SEARCH(x, T, p, r) 1 low = p 2 high = max {p, r + 1} 3 while low < high 4 mid = log+high 2 5 if x ≤ T [mid] 6 high = mid 7 else low = mid + 1 8 return high 69 / 94
  • 165. Binary Search Code BINARY-SEARCH(x, T, p, r) 1 low = p 2 high = max {p, r + 1} 3 while low < high 4 mid = log+high 2 5 if x ≤ T [mid] 6 high = mid 7 else low = mid + 1 8 return high 69 / 94
  • 166. Binary Search Code BINARY-SEARCH(x, T, p, r) 1 low = p 2 high = max {p, r + 1} 3 while low < high 4 mid = log+high 2 5 if x ≤ T [mid] 6 high = mid 7 else low = mid + 1 8 return high 69 / 94
  • 167. Parallel Merge Step 3. Copy x in A [q3] where q3 = p3 + (q1 − p1) + (q2 − p2) 70 / 94
  • 168. Parallel Merge Step 4. Recursively merge T [p1..q1 − 1] and T [p2..q2 − 1] and place result into A [p3..q3 − 1] 71 / 94
  • 169. Parallel Merge Step 5. Recursively merge T [q1 + 1..r1] and T [q2..r2] and place result into A [q3 + 1..r3] 72 / 94
  • 170. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 171. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 172. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 173. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 174. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 175. The Final Code for Parallel Merge Par − Merge (T, p1, r1, p2, r2, A, p3) 1 n1 = r1 − p1 + 1, n2 = r2 − p2 + 1 2 if n1 < n2 3 Exchange p1 ↔ p2, r1 ↔ r2, n1 ↔ n2 4 if (n1 == 0) 5 return 6 else 7 q1 = (p1+r1)/2 8 q2 = BinarySearch (T [q1] , T, p2, r2) 9 q3 = p3 + (q1 − p1) + (q2 − p2) 10 A [q3] = T [q1] 11 spawn Par − Merge (T, p1, q1 − 1, p2, q2 − 1, A, p3) 12 Par − Merge (T, q1 + 1, r1, q2 + 1, r2, A, q3 + 1) 13 sync 73 / 94
  • 176. Explanation Line 1 Obtain the length of the two arrays to be merged Line 2: If one is larger than the other We exchange the variables to work the largest element!!! In this case we make n1 ≥ n2 Line 4 if n1 == 0 return nothing to merge!!! 74 / 94
  • 177. Explanation Line 1 Obtain the length of the two arrays to be merged Line 2: If one is larger than the other We exchange the variables to work the largest element!!! In this case we make n1 ≥ n2 Line 4 if n1 == 0 return nothing to merge!!! 74 / 94
  • 178. Explanation Line 1 Obtain the length of the two arrays to be merged Line 2: If one is larger than the other We exchange the variables to work the largest element!!! In this case we make n1 ≥ n2 Line 4 if n1 == 0 return nothing to merge!!! 74 / 94
  • 179. Explanation Line 10 It copies T [q1] directly into A [q3] Line 11 and 12 They are used to recurse using nested parallelism to merge the sub-arrays less and greater than x. Line 13 The sync is used to ensure that the subproblems have completed before the procedure returns. 75 / 94
  • 180. Explanation Line 10 It copies T [q1] directly into A [q3] Line 11 and 12 They are used to recurse using nested parallelism to merge the sub-arrays less and greater than x. Line 13 The sync is used to ensure that the subproblems have completed before the procedure returns. 75 / 94
  • 181. Explanation Line 10 It copies T [q1] directly into A [q3] Line 11 and 12 They are used to recurse using nested parallelism to merge the sub-arrays less and greater than x. Line 13 The sync is used to ensure that the subproblems have completed before the procedure returns. 75 / 94
  • 182. First the Span Complexity of Parallel Merge: T∞ (n) Suppositions n = n1 + n2 What case should we study? Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)} We notice then that Because lines 3-6 n2 ≤ n1 76 / 94
  • 183. First the Span Complexity of Parallel Merge: T∞ (n) Suppositions n = n1 + n2 What case should we study? Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)} We notice then that Because lines 3-6 n2 ≤ n1 76 / 94
  • 184. First the Span Complexity of Parallel Merge: T∞ (n) Suppositions n = n1 + n2 What case should we study? Remember T∞ (n) = max {T∞ (n1) + T∞ (n2)} We notice then that Because lines 3-6 n2 ≤ n1 76 / 94
  • 185. Span Complexity of the Parallel Merge with One Processor: T1 (n) Then 2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2 Thus In the worst case, a recursive call in lines 11 merges: n1 2 elements of T [p1...r1] (Remember we are halving the array by mid-point). With all n2 elements of T [p2...r2]. 77 / 94
  • 186. Span Complexity of the Parallel Merge with One Processor: T1 (n) Then 2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2 Thus In the worst case, a recursive call in lines 11 merges: n1 2 elements of T [p1...r1] (Remember we are halving the array by mid-point). With all n2 elements of T [p2...r2]. 77 / 94
  • 187. Span Complexity of the Parallel Merge with One Processor: T1 (n) Then 2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2 Thus In the worst case, a recursive call in lines 11 merges: n1 2 elements of T [p1...r1] (Remember we are halving the array by mid-point). With all n2 elements of T [p2...r2]. 77 / 94
  • 188. Span Complexity of the Parallel Merge with One Processor: T1 (n) Then 2n2 ≤ n1 + n2 = n =⇒ n2 ≤ n/2 Thus In the worst case, a recursive call in lines 11 merges: n1 2 elements of T [p1...r1] (Remember we are halving the array by mid-point). With all n2 elements of T [p2...r2]. 77 / 94
  • 189. Span Complexity of the Parallel Merge with One Processor: T1 (n) Thus, the number of elements involved in such a call is n1 2 + n2 ≤ n1 2 + n2 2 + n2 2 ≤ n1 2 + n2 2 + n/2 2 = n1 + n2 2 + n 4 ≤ n 2 + n 4 = 3n 4 78 / 94
  • 190. Span Complexity of the Parallel Merge with One Processor: T1 (n) Thus, the number of elements involved in such a call is n1 2 + n2 ≤ n1 2 + n2 2 + n2 2 ≤ n1 2 + n2 2 + n/2 2 = n1 + n2 2 + n 4 ≤ n 2 + n 4 = 3n 4 78 / 94
  • 191. Span Complexity of the Parallel Merge with One Processor: T1 (n) Thus, the number of elements involved in such a call is n1 2 + n2 ≤ n1 2 + n2 2 + n2 2 ≤ n1 2 + n2 2 + n/2 2 = n1 + n2 2 + n 4 ≤ n 2 + n 4 = 3n 4 78 / 94
  • 192. Span Complexity of the Parallel Merge with One Processor: T1 (n) Thus, the number of elements involved in such a call is n1 2 + n2 ≤ n1 2 + n2 2 + n2 2 ≤ n1 2 + n2 2 + n/2 2 = n1 + n2 2 + n 4 ≤ n 2 + n 4 = 3n 4 78 / 94
  • 193. Span Complexity of the Parallel Merge with One Processor: T1 (n) Knowing that the Binary Search takes Θ (log n) We get the span for parallel merge T∞ (n) = T∞ 3n 4 + Θ (log n) This can can be solved using the exercise 4.6-2 in the Cormen’s Book T∞ (n) = Θ log2 n 79 / 94
  • 194. Span Complexity of the Parallel Merge with One Processor: T1 (n) Knowing that the Binary Search takes Θ (log n) We get the span for parallel merge T∞ (n) = T∞ 3n 4 + Θ (log n) This can can be solved using the exercise 4.6-2 in the Cormen’s Book T∞ (n) = Θ log2 n 79 / 94
  • 195. Span Complexity of the Parallel Merge with One Processor: T1 (n) Knowing that the Binary Search takes Θ (log n) We get the span for parallel merge T∞ (n) = T∞ 3n 4 + Θ (log n) This can can be solved using the exercise 4.6-2 in the Cormen’s Book T∞ (n) = Θ log2 n 79 / 94
  • 196. Calculating Work Complexity of Parallel Merge Ok!!! We need to calculate the WORK T1 (n) = Θ (Something) Thus We need to calculate the upper and lower bound. 80 / 94
  • 197. Calculating Work Complexity of Parallel Merge Ok!!! We need to calculate the WORK T1 (n) = Θ (Something) Thus We need to calculate the upper and lower bound. 80 / 94
  • 198. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 199. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 200. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 201. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 202. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 203. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 204. Calculating Work Complexity of Parallel Merge Work of Parallel Merge The work of T1 (n) of this Parallel Merge satisfies: T1 (n) = Ω (n) Because each of the n elements must be copied from array T to array A. What about the Upper Bound O? First notice that we can have a merge with n 4 elements when we have we have the worst case of n1 2 + n2 in the other merge. And 3n 4 for the worst case. And the work of the Binary Search of O (log n) 81 / 94
  • 205. Calculating Work Complexity of Parallel Merge Then Then, for some α ∈ 1 4, 3 4 , then we have the following recursion for the Parallel Merge when we have one processor: T1 (n) = T1 (αn) + T1 ((1 − α) n) Merge Part + Θ (log n) Binary Search Remark: α varies at each level of the recursion!!! 82 / 94
  • 206. Calculating Work Complexity of Parallel Merge Then Then, for some α ∈ 1 4, 3 4 , then we have the following recursion for the Parallel Merge when we have one processor: T1 (n) = T1 (αn) + T1 ((1 − α) n) Merge Part + Θ (log n) Binary Search Remark: α varies at each level of the recursion!!! 82 / 94
  • 207. Calculating Work Complexity of Parallel Merge Then Then, for some α ∈ 1 4, 3 4 , then we have the following recursion for the Parallel Merge when we have one processor: T1 (n) = T1 (αn) + T1 ((1 − α) n) Merge Part + Θ (log n) Binary Search Remark: α varies at each level of the recursion!!! 82 / 94
  • 208. Calculating Work Complexity of Parallel Merge Then Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2. We have then using c3 for Θ (log n) T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n ≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n = c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements) = c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n ≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0 83 / 94
  • 209. Calculating Work Complexity of Parallel Merge Then Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2. We have then using c3 for Θ (log n) T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n ≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n = c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements) = c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n ≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0 83 / 94
  • 210. Calculating Work Complexity of Parallel Merge Then Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2. We have then using c3 for Θ (log n) T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n ≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n = c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements) = c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n ≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0 83 / 94
  • 211. Calculating Work Complexity of Parallel Merge Then Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2. We have then using c3 for Θ (log n) T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n ≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n = c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements) = c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n ≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0 83 / 94
  • 212. Calculating Work Complexity of Parallel Merge Then Assume that T1 (n) ≤ c1n − c2 log n for positive constants c1 and c2. We have then using c3 for Θ (log n) T1 (n) ≤ T1 (αn) + T1 ((1 − α) n) + c3 log n ≤ c1αn − c2 log (αn) + c1 (1 − α) n − c2 log ((1 − α) n) + c3 log n = c1n − c2 log (α(1 − α)) − 2c2 log n + c3 log n (splitting elements) = c1n − c2 (log n + log (α(1 − α))) − (c2 − c3) log n ≤ c1n − (c2 − c3) log n because log n + log (α(1 − α)) > 0 83 / 94
  • 213. Calculating Work Complexity of Parallel Merge Now, we have that given 0 < α(1 − α) < 1 We have log (α(1 − α)) < 0 Thus, making n large enough log n + log (α(1 − α)) > 0 (1) Then T1 (n) ≤ c1n − (c2 − c3) log n 84 / 94
  • 214. Calculating Work Complexity of Parallel Merge Now, we have that given 0 < α(1 − α) < 1 We have log (α(1 − α)) < 0 Thus, making n large enough log n + log (α(1 − α)) > 0 (1) Then T1 (n) ≤ c1n − (c2 − c3) log n 84 / 94
  • 215. Calculating Work Complexity of Parallel Merge Now, we have that given 0 < α(1 − α) < 1 We have log (α(1 − α)) < 0 Thus, making n large enough log n + log (α(1 − α)) > 0 (1) Then T1 (n) ≤ c1n − (c2 − c3) log n 84 / 94
  • 216. Calculating Work Complexity of Parallel Merge Now, we choose c2 and c3 such that c2 − c3 ≥ 0 We have that T1 (n) ≤ c1n = O(n) 85 / 94
  • 217. Calculating Work Complexity of Parallel Merge Now, we choose c2 and c3 such that c2 − c3 ≥ 0 We have that T1 (n) ≤ c1n = O(n) 85 / 94
  • 218. Finally Then T1 (n) = Θ (n) The parallelism of Parallel Merge T1 (n) T∞ (n) = Θ n log2 n 86 / 94
  • 219. Finally Then T1 (n) = Θ (n) The parallelism of Parallel Merge T1 (n) T∞ (n) = Θ n log2 n 86 / 94
  • 220. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 221. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 222. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 223. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 224. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 225. Then What is the Complexity of Parallel Merge-sort with Parallel Merge? First, the new code - Input A [p..r] - Output B [s..s + r − p] Par − Merge − Sort (A, p, r, B, s) 1 n = r − p + 1 2 if (n == 1) 3 B [s] = A [p] 4 else let T [1..n] be a new array 5 q = (p+r)/2 6 q = q − p + 1 7 spawn Par − Merge − Sort (A, p, q, T, 1) 8 Par − Merge − Sort (A, q + 1, r, T, q + 1) 9 sync 10 Par − Merge (T, 1, q , q + 1, n, B, s) 87 / 94
  • 226. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Work We can use the worst work in the parallel to generate the recursion: TPMS 1 (n) = 2TPMS 1 n 2 + TPM 1 (n) = 2TPMS 1 n 2 + Θ (n) = Θ (n log n) Case 2 of the MT 88 / 94
  • 227. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Work We can use the worst work in the parallel to generate the recursion: TPMS 1 (n) = 2TPMS 1 n 2 + TPM 1 (n) = 2TPMS 1 n 2 + Θ (n) = Θ (n log n) Case 2 of the MT 88 / 94
  • 228. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Work We can use the worst work in the parallel to generate the recursion: TPMS 1 (n) = 2TPMS 1 n 2 + TPM 1 (n) = 2TPMS 1 n 2 + Θ (n) = Θ (n log n) Case 2 of the MT 88 / 94
  • 229. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Work We can use the worst work in the parallel to generate the recursion: TPMS 1 (n) = 2TPMS 1 n 2 + TPM 1 (n) = 2TPMS 1 n 2 + Θ (n) = Θ (n log n) Case 2 of the MT 88 / 94
  • 230. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Span We get the following recursion for the span by taking in account that lines 7 and 8 of parallel merge sort run in parallel: TPMS ∞ (n) = TPMS ∞ n 2 + TPM ∞ (n) = TPMS ∞ n 2 + Θ log2 n = Θ log3 n Exercise 4.6-2 in the Cormen’s Book 89 / 94
  • 231. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Span We get the following recursion for the span by taking in account that lines 7 and 8 of parallel merge sort run in parallel: TPMS ∞ (n) = TPMS ∞ n 2 + TPM ∞ (n) = TPMS ∞ n 2 + Θ log2 n = Θ log3 n Exercise 4.6-2 in the Cormen’s Book 89 / 94
  • 232. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Span We get the following recursion for the span by taking in account that lines 7 and 8 of parallel merge sort run in parallel: TPMS ∞ (n) = TPMS ∞ n 2 + TPM ∞ (n) = TPMS ∞ n 2 + Θ log2 n = Θ log3 n Exercise 4.6-2 in the Cormen’s Book 89 / 94
  • 233. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Span We get the following recursion for the span by taking in account that lines 7 and 8 of parallel merge sort run in parallel: TPMS ∞ (n) = TPMS ∞ n 2 + TPM ∞ (n) = TPMS ∞ n 2 + Θ log2 n = Θ log3 n Exercise 4.6-2 in the Cormen’s Book 89 / 94
  • 234. Then, What is the amount of Parallelism of Parallel Merge-sort with Parallel Merge? Parallelism T1 (n) T∞ (n) = Θ n log2 n 90 / 94
  • 235. Plotting both Parallelisms We get the incredible difference between both algorithm 91 / 94
  • 236. Plotting the T∞ We get the incredible difference when running both algorithms with an infinite number of processors!!! 92 / 94
  • 237. Outline 1 Introduction Why Multi-Threaded Algorithms? 2 Model To Be Used Symmetric Multiprocessor Operations Example 3 Computation DAG Introduction 4 Performance Measures Introduction Running Time Classification 5 Parallel Laws Work and Span Laws Speedup and Parallelism Greedy Scheduler Scheduling Rises the Following Issue 6 Examples Parallel Fibonacci Matrix Multiplication Parallel Merge-Sort 7 Exercises Some Exercises you can try!!! 93 / 94