By
Dr. Heman Pathak
Associate Professor
KGC - Dehradun
CHAPTER-6
Parallel Computing
Theory and Practice
By Michel J. Quinn
CLASSIFYING PARALLEL ALGORITHMS
Parallel Algorithms
Data Parallelism
SIMD MIMD
Control Parallelism
MIMD Pipelined
Data Parallelism in MIMD
• The number of data items per functional unit is determined before any of the data items are
processed.
• Pre scheduling is commonly used when the time needed to process each data item is
identical, or when the ratio of data items to functional units is high.
Pre-scheduled data parallel algorithm
• Data items are not assigned to functional units until run time.
• A global list of work to be done is kept, and when a functional unit is without work, another
task (or small set of tasks) is removed from the list and assigned.
• Processes schedule themselves as the program executes, hence the name self scheduled.
self-scheduled data parallel algorithm
Control Parallelism
• Control parallelism is achieved through the simultaneous
application of different operations to different data elements.
• The flow of data among these processes can be arbitrarily
complex.
• If the data-flow graph forms a simple directed path, then we say
the algorithm is pipelined.
• We will use the term asynchronous algorithm to refer to any
control-parallel algorithm that is not pipelined.
REDUCTION
5
• lf a cost optimal CREW PRAM algorithm exists and the way the PRAM processors
interact through shared variables maps onto the target architecture, a PRAM
algorithm is a reasonable staring point.
Design Strategy 1
 Consider the problem of performing a reduction operation on a set of n
values, where n is much larger than p, the number of available
processors.
 Objective is to develop a parallel algorithm that introduces the
minimum amount of extra operations compared to the best sequential
algorithm.
REDUCTION - Where n is much larger than p
6
Summation is the reduction operation
• Cost-optimal PRAM algorithm for global sum exists:
▫ n/ log n processors can add n numbers in (log n) time.
• Same principle can be used to develop good parallel algorithms for real SIMD and MIMD
computers, even if p << n/ log n.
• 𝒏/𝒑 𝒐𝒓 𝒏/𝒑 values are allocated to each processor.
• In the first phase of the parallel algorithm each processor adds its set of values, resulting
in p partial sums.
• In the second phase partial sums are combine into global sum.
REDUCTION - Where n is much larger than p
7
Summation is the reduction operation
• It is important to check to make sure that the constant of proportionality
associated with the cost of the PRAM algorithm is not significantly higher
than the constant of proportionality associated with an optimal sequential
algorithm.
• Make sure that the total number of operations performed by all the
processors executing the PRAM algorithm is about the same as the total
number of operations performed by a single processor executing the best
sequential algorithm.
Hypercube SIMD Model
8
6 7
54
2 3
10
Sum of n Numbers: Hypercube
• If the PRAM processor interaction pattern forms a graph that embeds with
dilation-1 in a target SIMD architecture, then there is a natural translation
from the PRAM algorithm to the SIMD algorithm.
• The processors in the PRAM summation algorithm combine values in a
binomial tree pattern.
• Dialation-1 embedding of binomial tree on hypercube is possible.
• The hypercube processor array version follows directly from the PRAM
algorithm, the only significant difference is that the hypercube processor
array model has no shared memory; processors interact by passing data.
9
Sum of n Numbers: Hypercube
10
Sum of n Numbers: Hypercube
11
Sum of n Numbers: Hypercube
• Each processing element adds at most 𝑛/𝑝 values to
find its local sum in (n/p) time.
• Processor 0, which iterates the second inner for loop
more than any other processor, performs log p
communication steps and log p addition steps.
• The complexity of finding the sum of n values is
(n/p + log p) using the hypercube processor array
model with p processors.
12
Sum of n Numbers: Hypercube
13
Every processing element to have a copy of the global sum
• Add a broadcast phase to the end of the algorithm. Once processing
element 0 has the global sum, the value can be transmitted to the
other processors in log p communication steps by reversing the
direction of the edges in the binomial tree.
• Each processing element swaps values across every dimension of the
hypercube. After log n swap-and-accumulate steps, every processing
element has the global sum.
Sum of n Numbers: Hypercube
14
Shuffle Exchange SIMD Model
15
Sum of n Numbers: Shuffle Exchange
 lf the PRAM processor interaction pattern does not form a graph that embeds
in the target SIMD architecture, then the translation is not straightforward,
but may still have an efficient SIMD algorithm.
 No dilation-1 embedding of a binomial tree in a shuffle-exchange network.
 The sums are combined in pairs, then a logarithmic number of combining
steps can find the grand total.
 Two data routings - shuffle followed by an exchange – on the shuffle
exchange model are sufficient to bring together two subtotals.
 After log p shuffle exchange steps, processor 0 has the grand total.
16
17
Sum of n Numbers: Shuffle Exchange
18
Sum of n Numbers: Shuffle Exchange
• At the termination of this algorithm, the value of (0)sum is
the sum.
• Every processing element spends (n/p) time computing
its local sum.
• Since there are log p iterations of the shuffle-exchange-add
loop and every iteration takes constant time, the parallel
algorithm has complexity (n/p + log p).
19
2-D MESH SlMD Model
20
Sum of n Numbers: 2D-MESH
• No dilation-1 embedding exists for a balanced binary tree or binomial
tree in a mesh.
• Establish a lower bound on the complexity of parallel algorithm to be
used on a particular topology. Once the lower bound is established,
there is no reason to search for a solution of lower complexity.
• In order to find the sum of n values spread evenly among p processors
organized in 𝑝 X 𝑝 mesh, at least one of the processor in the mesh
must eventually contains the grand sum.
21
Sum of n Numbers: 2D-MESH
• The total number of communication steps to get the subtotals from
the corner processors must be at least 2( 𝒑 - 1), assuming that
during any time unit only communications in a single direction are
allowed.
• Since the algorithm has at least 2 ( 𝑝 - 1) communication steps,
the time complexity of the parallel algorithm is at least
(n/p + 𝑝).
• There is no point looking for a parallel algorithm for a model that
requires (log p) communication steps.
22
23
24
UMA Multiprocessor Model
25
Sum of n Numbers: UMA
 Unlike the PRAM model, processors execute instructions
asynchronously.
 For that reason we must ensure that no processor accesses a
variable containing a partial sum until that variable has been set.
 Each element of array flags begins with the value 0.
 When the value is set to 1, the corresponding element of array
mono has a partial sum in it.
26
Sum of n Numbers: UMA
27
Sum of n Numbers: UMA n=16 p=4
28
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
6 -4 19 2 -9 0 3 -5 10 -3 -8 1 7 -2 4 5
P=0 0 4 8 12
14 6 -9 10 7
P=1 1 5 9 13
-9 -4 0 -3 -2
P=2 2 6 10 14
18 19 3 -8 4
P=3 3 7 11 15
3 2 -5 1 5
P0=14 P1=-9 P2=18 P3=3
P0=32 P1=-6 P2=18 P3=3
P0=26
Worst-case Time Complexity
• If the initial process creates p-1 other processes all by itself, the time
complexity of the process creation is (p).
• In practice we do not count this cost, since processes are created only
once, at the beginning of the program, and most algorithms we analyze
form subroutines of larger applications.
• Sequentially initializing array flags has time complexity (p).
• Each process finds the sum of n/p values. If we make the assumption
that memory bank conflicts do not increase the complexity by more
than a constant factor, the complexity of this section of code is (n/p).
29
Worst-case Time Complexity
• The while loop executes log p times.
• Each iteration of the while loop has time complexity (1).
The total complexity of while loop is (log p).
• Synchronization among all the processes occurs at the
final endfor.
• Complexity of synchronization is (p).
• The overall complexity of the algorithm is-
• (p + n/p + log p + p) = (n/p + p)
30
Sum of n Numbers: UMA
 Since the time complexity of the parallel algorithm is
(n/p + p) why bother with a complicated fan-in-style
parallel addition?
 It is simpler to compute the global sum from the local Sums
by having each process enter a critical section where its
local sum is added to the global sum.
 The resulting algorithm has time complexity (n/p + p).
31
32
Sum of n Numbers: UMA
33
 Both algorithms has been implemented on
the Sequent BalanceTM (a UMA multiprocessor)
using Sequent C.
 Figure compares the execution times of the
two reduction steps as a function of the
number of active processes.
 The original fan-in-style algorithm is
uniformly superior to the critical section-
style algorithm.
 The constant of proportionality associated
with the (p) term is smaller in the first
program.
Sum of n Numbers: UMA
34
Design Strategy 2: Look for a data-parallel algorithm before considering a control parallel algorithm.
• The only parallelism we can exploit on a SIMD computer is data parallelism.
• On MIMD computers, however we can look for ways to exploit both data parallelism
and control parallelism.
• Data-parallel algorithms are more common, easier to design and debug and better
able to scale to large numbers of processors than control parallel algorithms.
• For this reason a data-parallel solution should be sought first, and a control parallel
implementation considered a last resort.
• When we write in data-parallel style on a MIMD machine, the result is a SPMD (Single
Program Multiple Data) program. In general, SPMD programs are easier to write and
debug than arbitrary MIMD programs.
In Multicomputers
35
BROADCAST - in Multicomputers
• One processor broadcasting a list of values to all other
processors on a hypercube multicomputer.
• The execution time of the implemented algorithm has two
primary components:
▫ The time needed to initiate the messages and
▫ The time needed to perform the data transfers.
• Message start-up time is called message-passing overhead
or message latency.
36
BROADCAST - in Multicomputers
 If the amount of data to be broadcast is small, the message-
passing overhead time dominates the data-transfer time.
 The best algorithm is the one that minimizes the number of
communications performed by any processor.
 The binomial tree is a suitable broadcast pattern because there
is a dilation-1 embedding of a binomial tree into a hypercube.
 The resulting algorithm requires only log p communication
steps.
37
BROADCAST - in Multicomputers
38
BROADCAST - in Multicomputers
39
P=8 Source = 000 Value
Id Position
Partner
i=0
Partner
i=1
Partner
i=2
000 000 001 010 100
001 001 - 011 101
010 010 - - 110
011 011 - - 111
100 100 - - -
101 101 - - -
110 110 - - -
111 111 - - -
BROADCAST - in Multicomputers
40
 If the amount of data to be broadcast is large, the data-
transfer time dominates the message-passing overhead.
 Under these circumstances the binomial tree-based
algorithm has a serious weakness-
 at any one time no more than p/2 out of p log p
communication links are in use.
 If the time needed to pass the message from one processor
to another is M, then the broadcast algorithm requires time
M log p.
BROADCAST - in Multicomputers
41
 Johnsson and Ho ( 1989) have designed a broadcast algorithm that
executes up to log p times faster than the binomial-tree algorithm.
 Their algorithm relies upon the fact that every hypercube contains
log p edge-disjoint spanning trees with the same root node.
 The algorithm breaks the message into log p parts and broadcasts
each part to other nodes through a different binomial spanning tree.
 Because the spanning trees have no edges in common, all data flows
concurrently, and the entire algorithm executes approximately in time
Mlog p/ log p = M.
BROADCAST - in Multicomputers
42
BROADCAST - in Multicomputers
Design Strategy 3: As problem size grows, use the
algorithm that makes best use of the available resources.
In the case of broadcasting large
data sets on a hypercube
multicomputer, the most
constrained resource is the
network capacity.
Johnsson and Ho's algorithm
makes better use of this resource
than the binomial tree broadcast
algorithm and, as a result,
achieves higher performance.
43
Hypercube Multi-Computers
44
Prefix-Sums-On Hypercube Multicomputers
45
Prefix-Sums-On Hypercube Multicomputers
46
 The cost optimal algorithm requires
𝑛/𝑙𝑜𝑔𝑛 processors to solve the
problem in (log n) time.
 In order to achieve cost optimality,
each processor uses the best
sequential algorithm to manipulate
its own set of n/p elements of A.
 Same strategy is used to design an
efficient multicomputer algorithm
where p << n/log n.
Prefix-Sums-On Hypercube Multicomputers
47
Design Strategy 4:
Let each processor perform the most efficient
sequential algorithm on its share of the data.
Prefix-Sums-On Hypercube Multicomputers
48
There are n number of elements and p processors. n is integer multiple of p.
The elements of A are distributed evenly among the local memories of the p processors.
During step one each processor finds the sum of its n/p elements.
In step two the processors cooperate to find the p prefix sums of their local sums.
During step three each processor computes the prefix sums of its n/p values, using values
held in lower-numbered processors.
Prefix-Sums-On Hypercube Multicomputers
49
Prefix-Sums-On Hypercube Multicomputers
50
 The communication time required by
step two depends upon the
multicomputer's topology.
 The memory access pattern of the
PRAM algorithm does not directly
translate into a communication
pattern having a dilation-1 embedding
in a hypercube.
 For this reason we should look for a
better method of computing the prefix
sums.
Sum of n Numbers: Hypercube
51
Prefix-Sums - On Hypercube Multicomputers
52
 Finding prefix sums is similar to performing a reduction, except, for each element in
the list, we are only interested in values from prior elements.
 We can modify the hypercube reduction algorithm to perform prefix sums.
 As in the reduction algorithm, every processor swaps values across each dimension
of the hypercube. However, the processor maintains two variables containing totals.
 The first variable contains the total of all values received.
 The second variable contains the total of all values received from smaller-numbered
processors.
 At the end of log p swap-and-add steps, the second variable associated with each
processor contains the prefix sum for that processor.
53
Prefix-Sums-On Hypercube Multicomputers
54
 : The time needed to perform operation 
 : The time needed to initiate a message
 : The message transmission time per value
For example, sending a k-element from one
processor to another requires time  + k.
Prefix-Sums-On Hypercube Multicomputers
55
• During step one each processor finds the sum of n/p values
in (n/p -1)  time units.
• During step three processor 0 computes the prefix sums of
its n/p values in (n/p - 1 )  time units.
• Processors 1 through p - 1 must add the sum of the lower-
numbered processors’ values to the first element on its list
before computing the prefix sums.
• These processors perform step three in (n/p)  time units.
Prefix-Sums-On Hypercube Multicomputers
56
 Step two has log p phases.
 During each phase a processor performs the  operation, at most, two
times, so the computation time required by step two is no more than
2logp.
 During each phase a processor sends one value to a neighbouring
processor and receives one value from that processor.
 The total communication time of step two is 2( + ) logp.
 Summing the computation and the communication time yields a total
execution time of 2( +  + ) logp for step two of the algorithm.
Prefix-Sums-On Hypercube Multicomputers
57
Estimated Execution Time = (n/p -1)  + 2( +  + ) logp + (n/p) 
Prefix-Sums-On Hypercube Multicomputers
58
 In other words, the efficiency of this
algorithm cannot exceed 50%, no
matter how large the problem size or
how small the message latency.
 Figure compares the predicted
speedup with the speedup actually
achieved by this algorithm on the
nCUBE 3200TM, where the associative
operator is integer addition,  = 414
nanoseconds,  = 363 microseconds,
and  = 4.5 microseconds.

Elementary Parallel Algorithms

  • 1.
    By Dr. Heman Pathak AssociateProfessor KGC - Dehradun CHAPTER-6 Parallel Computing Theory and Practice By Michel J. Quinn
  • 2.
    CLASSIFYING PARALLEL ALGORITHMS ParallelAlgorithms Data Parallelism SIMD MIMD Control Parallelism MIMD Pipelined
  • 3.
    Data Parallelism inMIMD • The number of data items per functional unit is determined before any of the data items are processed. • Pre scheduling is commonly used when the time needed to process each data item is identical, or when the ratio of data items to functional units is high. Pre-scheduled data parallel algorithm • Data items are not assigned to functional units until run time. • A global list of work to be done is kept, and when a functional unit is without work, another task (or small set of tasks) is removed from the list and assigned. • Processes schedule themselves as the program executes, hence the name self scheduled. self-scheduled data parallel algorithm
  • 4.
    Control Parallelism • Controlparallelism is achieved through the simultaneous application of different operations to different data elements. • The flow of data among these processes can be arbitrarily complex. • If the data-flow graph forms a simple directed path, then we say the algorithm is pipelined. • We will use the term asynchronous algorithm to refer to any control-parallel algorithm that is not pipelined.
  • 5.
    REDUCTION 5 • lf acost optimal CREW PRAM algorithm exists and the way the PRAM processors interact through shared variables maps onto the target architecture, a PRAM algorithm is a reasonable staring point. Design Strategy 1  Consider the problem of performing a reduction operation on a set of n values, where n is much larger than p, the number of available processors.  Objective is to develop a parallel algorithm that introduces the minimum amount of extra operations compared to the best sequential algorithm.
  • 6.
    REDUCTION - Wheren is much larger than p 6 Summation is the reduction operation • Cost-optimal PRAM algorithm for global sum exists: ▫ n/ log n processors can add n numbers in (log n) time. • Same principle can be used to develop good parallel algorithms for real SIMD and MIMD computers, even if p << n/ log n. • 𝒏/𝒑 𝒐𝒓 𝒏/𝒑 values are allocated to each processor. • In the first phase of the parallel algorithm each processor adds its set of values, resulting in p partial sums. • In the second phase partial sums are combine into global sum.
  • 7.
    REDUCTION - Wheren is much larger than p 7 Summation is the reduction operation • It is important to check to make sure that the constant of proportionality associated with the cost of the PRAM algorithm is not significantly higher than the constant of proportionality associated with an optimal sequential algorithm. • Make sure that the total number of operations performed by all the processors executing the PRAM algorithm is about the same as the total number of operations performed by a single processor executing the best sequential algorithm.
  • 8.
  • 9.
    Sum of nNumbers: Hypercube • If the PRAM processor interaction pattern forms a graph that embeds with dilation-1 in a target SIMD architecture, then there is a natural translation from the PRAM algorithm to the SIMD algorithm. • The processors in the PRAM summation algorithm combine values in a binomial tree pattern. • Dialation-1 embedding of binomial tree on hypercube is possible. • The hypercube processor array version follows directly from the PRAM algorithm, the only significant difference is that the hypercube processor array model has no shared memory; processors interact by passing data. 9
  • 10.
    Sum of nNumbers: Hypercube 10
  • 11.
    Sum of nNumbers: Hypercube 11
  • 12.
    Sum of nNumbers: Hypercube • Each processing element adds at most 𝑛/𝑝 values to find its local sum in (n/p) time. • Processor 0, which iterates the second inner for loop more than any other processor, performs log p communication steps and log p addition steps. • The complexity of finding the sum of n values is (n/p + log p) using the hypercube processor array model with p processors. 12
  • 13.
    Sum of nNumbers: Hypercube 13 Every processing element to have a copy of the global sum • Add a broadcast phase to the end of the algorithm. Once processing element 0 has the global sum, the value can be transmitted to the other processors in log p communication steps by reversing the direction of the edges in the binomial tree. • Each processing element swaps values across every dimension of the hypercube. After log n swap-and-accumulate steps, every processing element has the global sum.
  • 14.
    Sum of nNumbers: Hypercube 14
  • 15.
  • 16.
    Sum of nNumbers: Shuffle Exchange  lf the PRAM processor interaction pattern does not form a graph that embeds in the target SIMD architecture, then the translation is not straightforward, but may still have an efficient SIMD algorithm.  No dilation-1 embedding of a binomial tree in a shuffle-exchange network.  The sums are combined in pairs, then a logarithmic number of combining steps can find the grand total.  Two data routings - shuffle followed by an exchange – on the shuffle exchange model are sufficient to bring together two subtotals.  After log p shuffle exchange steps, processor 0 has the grand total. 16
  • 17.
  • 18.
    Sum of nNumbers: Shuffle Exchange 18
  • 19.
    Sum of nNumbers: Shuffle Exchange • At the termination of this algorithm, the value of (0)sum is the sum. • Every processing element spends (n/p) time computing its local sum. • Since there are log p iterations of the shuffle-exchange-add loop and every iteration takes constant time, the parallel algorithm has complexity (n/p + log p). 19
  • 20.
    2-D MESH SlMDModel 20
  • 21.
    Sum of nNumbers: 2D-MESH • No dilation-1 embedding exists for a balanced binary tree or binomial tree in a mesh. • Establish a lower bound on the complexity of parallel algorithm to be used on a particular topology. Once the lower bound is established, there is no reason to search for a solution of lower complexity. • In order to find the sum of n values spread evenly among p processors organized in 𝑝 X 𝑝 mesh, at least one of the processor in the mesh must eventually contains the grand sum. 21
  • 22.
    Sum of nNumbers: 2D-MESH • The total number of communication steps to get the subtotals from the corner processors must be at least 2( 𝒑 - 1), assuming that during any time unit only communications in a single direction are allowed. • Since the algorithm has at least 2 ( 𝑝 - 1) communication steps, the time complexity of the parallel algorithm is at least (n/p + 𝑝). • There is no point looking for a parallel algorithm for a model that requires (log p) communication steps. 22
  • 23.
  • 24.
  • 25.
  • 26.
    Sum of nNumbers: UMA  Unlike the PRAM model, processors execute instructions asynchronously.  For that reason we must ensure that no processor accesses a variable containing a partial sum until that variable has been set.  Each element of array flags begins with the value 0.  When the value is set to 1, the corresponding element of array mono has a partial sum in it. 26
  • 27.
    Sum of nNumbers: UMA 27
  • 28.
    Sum of nNumbers: UMA n=16 p=4 28 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6 -4 19 2 -9 0 3 -5 10 -3 -8 1 7 -2 4 5 P=0 0 4 8 12 14 6 -9 10 7 P=1 1 5 9 13 -9 -4 0 -3 -2 P=2 2 6 10 14 18 19 3 -8 4 P=3 3 7 11 15 3 2 -5 1 5 P0=14 P1=-9 P2=18 P3=3 P0=32 P1=-6 P2=18 P3=3 P0=26
  • 29.
    Worst-case Time Complexity •If the initial process creates p-1 other processes all by itself, the time complexity of the process creation is (p). • In practice we do not count this cost, since processes are created only once, at the beginning of the program, and most algorithms we analyze form subroutines of larger applications. • Sequentially initializing array flags has time complexity (p). • Each process finds the sum of n/p values. If we make the assumption that memory bank conflicts do not increase the complexity by more than a constant factor, the complexity of this section of code is (n/p). 29
  • 30.
    Worst-case Time Complexity •The while loop executes log p times. • Each iteration of the while loop has time complexity (1). The total complexity of while loop is (log p). • Synchronization among all the processes occurs at the final endfor. • Complexity of synchronization is (p). • The overall complexity of the algorithm is- • (p + n/p + log p + p) = (n/p + p) 30
  • 31.
    Sum of nNumbers: UMA  Since the time complexity of the parallel algorithm is (n/p + p) why bother with a complicated fan-in-style parallel addition?  It is simpler to compute the global sum from the local Sums by having each process enter a critical section where its local sum is added to the global sum.  The resulting algorithm has time complexity (n/p + p). 31
  • 32.
  • 33.
    Sum of nNumbers: UMA 33  Both algorithms has been implemented on the Sequent BalanceTM (a UMA multiprocessor) using Sequent C.  Figure compares the execution times of the two reduction steps as a function of the number of active processes.  The original fan-in-style algorithm is uniformly superior to the critical section- style algorithm.  The constant of proportionality associated with the (p) term is smaller in the first program.
  • 34.
    Sum of nNumbers: UMA 34 Design Strategy 2: Look for a data-parallel algorithm before considering a control parallel algorithm. • The only parallelism we can exploit on a SIMD computer is data parallelism. • On MIMD computers, however we can look for ways to exploit both data parallelism and control parallelism. • Data-parallel algorithms are more common, easier to design and debug and better able to scale to large numbers of processors than control parallel algorithms. • For this reason a data-parallel solution should be sought first, and a control parallel implementation considered a last resort. • When we write in data-parallel style on a MIMD machine, the result is a SPMD (Single Program Multiple Data) program. In general, SPMD programs are easier to write and debug than arbitrary MIMD programs.
  • 35.
  • 36.
    BROADCAST - inMulticomputers • One processor broadcasting a list of values to all other processors on a hypercube multicomputer. • The execution time of the implemented algorithm has two primary components: ▫ The time needed to initiate the messages and ▫ The time needed to perform the data transfers. • Message start-up time is called message-passing overhead or message latency. 36
  • 37.
    BROADCAST - inMulticomputers  If the amount of data to be broadcast is small, the message- passing overhead time dominates the data-transfer time.  The best algorithm is the one that minimizes the number of communications performed by any processor.  The binomial tree is a suitable broadcast pattern because there is a dilation-1 embedding of a binomial tree into a hypercube.  The resulting algorithm requires only log p communication steps. 37
  • 38.
    BROADCAST - inMulticomputers 38
  • 39.
    BROADCAST - inMulticomputers 39 P=8 Source = 000 Value Id Position Partner i=0 Partner i=1 Partner i=2 000 000 001 010 100 001 001 - 011 101 010 010 - - 110 011 011 - - 111 100 100 - - - 101 101 - - - 110 110 - - - 111 111 - - -
  • 40.
    BROADCAST - inMulticomputers 40  If the amount of data to be broadcast is large, the data- transfer time dominates the message-passing overhead.  Under these circumstances the binomial tree-based algorithm has a serious weakness-  at any one time no more than p/2 out of p log p communication links are in use.  If the time needed to pass the message from one processor to another is M, then the broadcast algorithm requires time M log p.
  • 41.
    BROADCAST - inMulticomputers 41  Johnsson and Ho ( 1989) have designed a broadcast algorithm that executes up to log p times faster than the binomial-tree algorithm.  Their algorithm relies upon the fact that every hypercube contains log p edge-disjoint spanning trees with the same root node.  The algorithm breaks the message into log p parts and broadcasts each part to other nodes through a different binomial spanning tree.  Because the spanning trees have no edges in common, all data flows concurrently, and the entire algorithm executes approximately in time Mlog p/ log p = M.
  • 42.
    BROADCAST - inMulticomputers 42
  • 43.
    BROADCAST - inMulticomputers Design Strategy 3: As problem size grows, use the algorithm that makes best use of the available resources. In the case of broadcasting large data sets on a hypercube multicomputer, the most constrained resource is the network capacity. Johnsson and Ho's algorithm makes better use of this resource than the binomial tree broadcast algorithm and, as a result, achieves higher performance. 43
  • 44.
  • 45.
  • 46.
    Prefix-Sums-On Hypercube Multicomputers 46 The cost optimal algorithm requires 𝑛/𝑙𝑜𝑔𝑛 processors to solve the problem in (log n) time.  In order to achieve cost optimality, each processor uses the best sequential algorithm to manipulate its own set of n/p elements of A.  Same strategy is used to design an efficient multicomputer algorithm where p << n/log n.
  • 47.
    Prefix-Sums-On Hypercube Multicomputers 47 DesignStrategy 4: Let each processor perform the most efficient sequential algorithm on its share of the data.
  • 48.
    Prefix-Sums-On Hypercube Multicomputers 48 Thereare n number of elements and p processors. n is integer multiple of p. The elements of A are distributed evenly among the local memories of the p processors. During step one each processor finds the sum of its n/p elements. In step two the processors cooperate to find the p prefix sums of their local sums. During step three each processor computes the prefix sums of its n/p values, using values held in lower-numbered processors.
  • 49.
  • 50.
    Prefix-Sums-On Hypercube Multicomputers 50 The communication time required by step two depends upon the multicomputer's topology.  The memory access pattern of the PRAM algorithm does not directly translate into a communication pattern having a dilation-1 embedding in a hypercube.  For this reason we should look for a better method of computing the prefix sums.
  • 51.
    Sum of nNumbers: Hypercube 51
  • 52.
    Prefix-Sums - OnHypercube Multicomputers 52  Finding prefix sums is similar to performing a reduction, except, for each element in the list, we are only interested in values from prior elements.  We can modify the hypercube reduction algorithm to perform prefix sums.  As in the reduction algorithm, every processor swaps values across each dimension of the hypercube. However, the processor maintains two variables containing totals.  The first variable contains the total of all values received.  The second variable contains the total of all values received from smaller-numbered processors.  At the end of log p swap-and-add steps, the second variable associated with each processor contains the prefix sum for that processor.
  • 53.
  • 54.
    Prefix-Sums-On Hypercube Multicomputers 54 : The time needed to perform operation   : The time needed to initiate a message  : The message transmission time per value For example, sending a k-element from one processor to another requires time  + k.
  • 55.
    Prefix-Sums-On Hypercube Multicomputers 55 •During step one each processor finds the sum of n/p values in (n/p -1)  time units. • During step three processor 0 computes the prefix sums of its n/p values in (n/p - 1 )  time units. • Processors 1 through p - 1 must add the sum of the lower- numbered processors’ values to the first element on its list before computing the prefix sums. • These processors perform step three in (n/p)  time units.
  • 56.
    Prefix-Sums-On Hypercube Multicomputers 56 Step two has log p phases.  During each phase a processor performs the  operation, at most, two times, so the computation time required by step two is no more than 2logp.  During each phase a processor sends one value to a neighbouring processor and receives one value from that processor.  The total communication time of step two is 2( + ) logp.  Summing the computation and the communication time yields a total execution time of 2( +  + ) logp for step two of the algorithm.
  • 57.
    Prefix-Sums-On Hypercube Multicomputers 57 EstimatedExecution Time = (n/p -1)  + 2( +  + ) logp + (n/p) 
  • 58.
    Prefix-Sums-On Hypercube Multicomputers 58 In other words, the efficiency of this algorithm cannot exceed 50%, no matter how large the problem size or how small the message latency.  Figure compares the predicted speedup with the speedup actually achieved by this algorithm on the nCUBE 3200TM, where the associative operator is integer addition,  = 414 nanoseconds,  = 363 microseconds, and  = 4.5 microseconds.