Parallel Algorithms

Algorithms
Parallel Algorithms
1

2
A simple parallel algorithm
Adding n numbers in parallel

3
• Example for 8 numbers: We start with 4 processors and
each of them adds 2 items in the first step.
• The number of items is halved at every subsequent step.
Hence log n steps are required for adding n numbers.
The processor requirement is O(n) .
We have omitted many details from our description of the algorithm.
• How do we allocate tasks to processors?
• Where is the input stored?
• How do the processors access the input as well as intermediate
results?
We do not ask these questions while designing sequential algorithms.

4
How do we analyze a parallel
algorithm?
A parallel algorithms is analyzed mainly in terms of its
time, processor and work complexities.
• Time complexity T(n) : How many time steps are needed?
• Processor complexity P(n) : How many processors are used?
• Work complexity W(n) : What is the total work done by all
the processors? Hence,
For our example: T(n) = O(log n)
P(n) = O(n)
W(n) = O(n log n)

5
How do we judge efficiency?
• We say A1 is more efficient than A2 if W1(n) = o(W2(n))
regardless of their time complexities.
For example, W1(n) = O(n) and W2(n) = O(n log n)
• Consider two parallel algorithms A1and A2 for the same problem.
A1: W1(n) work in T1(n) time.
A2: W2(n) work in T2(n) time.
If W1(n) and W2(n) are asymptotically the same then A1 is more
efficient than A2 if T1(n) = o(T2(n)).
For example, W1(n) = W2(n) = O(n), but
T1(n) = O(log n), T2(n) = O(log2 n)

6
How do we judge efficiency?
• It is difficult to give a more formal definition of
efficiency.
Consider the following situation.
For A1 , W1(n) = O(n log n) and T1(n) = O(n).
For A2 , W 2(n) = O(n log2 n) and T2(n) = O(log n)
• It is difficult to say which one is the better algorithm.
Though A1 is more efficient in terms of work, A2 runs
much faster.
• Both algorithms are interesting and one may be better
than the other depending on a specific parallel machine.

7
Optimal parallel algorithms
• Consider a problem, and let T(n) be the worst-case time
upper bound on a serial algorithm for an input of length
n.
• Assume also that T(n) is the lower bound for solving the
problem. Hence, we cannot have a better upper bound.
• Consider a parallel algorithm for the same problem that
does W(n) work in Tpar(n) time.
The parallel algorithm is work optimal, if W(n) = O(T(n))
It is work-time-optimal, if Tpar(n) cannot be improved.

8
Adding n numbers in parallel

9
A work-optimal algorithm for adding n
numbers
Step 1.
• Use only n/log n processors and assign log n numbers to
each processor.
• Each processor adds log n numbers sequentially in O(log n)
time.
Step 2.
• We have only n/log n numbers left. We now execute our
original algorithm on these n/log n numbers.
• Now T(n) = O(log n)
W(n) = O(n/log n x log n) = O(n)

10
Why is parallel computing important?
• We can justify the importance of parallel computing for
two reasons.
Very large application domains, and
Physical limitations of VLSI circuits
• Though computers are getting faster and faster, user
demands for solving very large problems is growing at a
still faster rate.
• Some examples include weather forecasting, simulation
of protein folding, computational physics etc.

11
• The Pentium III processor uses 180 nano meter (nm) technology, i.e.,
a circuit element like a transistor can be etched within
180 x 10-9 m.
• Pentium IV processor uses 160nm technology.
• Intel has recently trialed processors made by using 65nm
technology.

12
How many transistors can we pack?
• Pentium III has about 42 million transistors and
Pentium IV about 55 million transistors.
• The number of transistors on a chip is approximately
doubling every 18 months (Moore’s Law).
• There are now 100 transistors for every ant on Earth
(Moore said so in a recent lecture).

13
• All semiconductor devices are Si based. It is fairly safe to assume
that a circuit element will take at least a single Si atom.
• The covalent bonding in Si has a bond length approximately 20nm.
• Hence, we will reach the limit of miniaturization very soon.
• The upper bound on the speed of electronic signals is 3 x 108m/sec,
the speed of light.
• Hence, communication between two adjacent transistors will take
approximately 10-18sec.
• If we assume that a floating point operation involves switching of at
least a few thousand transistors, such an operation will take about
10-15sec in the limit.
• Hence, we are looking at 1000 teraflop machines at the peak of this
technology. (TFLOPS, 1012 FLOPS)
1 flop = a floating point operation
This is a very optimistic scenario.

14
Other Problems
• The most difficult problem is to control power dissipation.
• 75 watts is considered a maximum power output of a
processor.
• As we pack more transistors, the power output goes up and
better cooling is necessary.
• Intel cooled its 8 GHz demo processor using liquid Nitrogen !

15
The advantages of parallel computing
• Parallel computing offers the possibility of overcoming such
physical limits by solving problems in parallel.
• In principle, thousands, even millions of processors can be
used to solve a problem in parallel and today’s fastest
parallel computers have already reached teraflop speeds.
• Today’s microprocessors are already using several parallel
processing techniques like instruction level parallelism,
pipelined instruction fetching etc.
• Intel uses hyper threading in Pentium IV mainly because the
processor is clocked at 3 GHz, but the memory bus operates
only at about 400-800 MHz.

16
Problems in parallel computing
• The sequential or uni-processor computing
model is based on von Neumann’s stored
program model.
• A program is written, compiled and stored in
memory and it is executed by bringing one
instruction at a time to the CPU.

17
Problems in parallel computing
• Programs are written keeping this model in mind.
Hence, there is a close match between the software
and the hardware on which it runs.
• The theoretical RAM model captures these concepts
nicely.
• There are many different models of parallel computing
and each model is programmed in a different way.
• Hence an algorithm designer has to keep in mind a
specific model for designing an algorithm.
• Most parallel machines are suitable for solving specific
types of problems.
• Designing operating systems is also a major issue.

18
The PRAM model
n processors are connected to a shared memory.

19
The PRAM model
• Each processor should be able to access any memory
location in each clock cycle.
• Hence, there may be conflicts in memory access. Also,
memory management hardware needs to be very complex.
• We need some kind of hardware to connect the processors
to individual locations in the shared memory.
• The SB-PRAM developed at University of Saarlandes by Prof.
Wolfgang Paul’s group is one such model.
http://www-wjp.cs.uni-sb.de/projects/sbpram/

20
The PRAM model
A more realistic PRAM model

An overview of the lecture 2
• Models of parallel computation
• Characteristics of SIMD models
• Design issue for network SIMD models
• The mesh and the hypercube
architectures
• Classification of the PRAM model
• Matrix multiplication on the EREW PRAM
21

Models of parallel computation
Parallel computational models can be
broadly classified into two categories,
• Single Instruction Multiple Data (SIMD)
• Multiple Instruction Multiple Data (MIMD)
22

Models of parallel computation
• SIMD models are used for solving
problems which have regular structures.
We will mainly study SIMD models in this
course.
• MIMD models are more general and used
for solving problems which lack regular
structures.
23

SIMD models
An N- processor SIMD computer has the
following characteristics :
• Each processor can store both program
and data in its local memory.
• Each processor stores an identical copy
of the same program in its local memory.
24

SIMD models
• At each clock cycle, each processor
executes the same instruction from this
program. However, the data are different
in different processors.
• The processors communicate among
themselves either through an
interconnection network or through a
shared memory.
25

Design issues for network
SIMD models
• A network SIMD model is a graph. The
nodes of the graph are the processors
and the edges are the links between the
processors.
• Since each processor solves only a small
part of the overall problem, it is necessary
that processors communicate with each
other while solving the overall problem.
26

Design issues for network
SIMD models
• The main design issues for network SIMD
models are communication diameter,
bisection width, and scalability.
• We will discuss two most popular network
models, mesh and hypercube in this
lecture.
27

Communication diameter
• Communication diameter is the diameter
of the graph that represents the network
model. The diameter of a graph is the
longest distance between a pair of nodes.
• If the diameter for a model is d, the lower
bound for any computation on that model
is Ω(d).
28

• The data can be distributed in such a way
that the two furthest nodes may need to
communicate.
29

Communication between two furthest
nodes takes Ω(d) time steps.
30

Bisection width
• The bisection width of a network model is
the number of links to be removed to
decompose the graph into two equal
parts.
• If the bisection width is large, more
information can be exchanged between
the two halves of the graph and hence
problems can be solved faster.
31

Dividing the graph into two parts.
Bisection width
32

Scalability
• A network model must be scalable so that
more processors can be easily added
when new resources are available.
• The model should be regular so that each
processor has a small number of links
incident on it.
33

Scalability
• If the number of links is large for each
processor, it is difficult to add new
processors as too many new links have to
be added.
• If we want to keep the diameter small, we
need more links per processor. If we want
our model to be scalable, we need less
links per processor.
34

Diameter and Scalability
• The best model in terms of diameter is the
complete graph. The diameter is 1.
However, if we need to add a new node to
an n-processor machine, we need n - 1
new links.
35

Diameter and Scalability
• The best model in terms of scalability is
the linear array. We need to add only one
link for a new processor. However, the
diameter is n for a machine with n
processors.
36

The mesh architecture
• Each internal processor of a 2-dimensional
mesh is connected to 4 neighbors.
• When we combine two different meshes,
only the processors on the boundary need
extra links. Hence it is highly scalable.
37

• Both the diameter and bisection width of an
n-processor, 2-dimensional mesh is
A 4 x 4 mesh
The mesh architecture
( )O n
38

Hypercubes of 0, 1, 2 and 3 dimensions
The hypercube architecture
39

• The diameter of a d-dimensional
hypercube is d as we need to flip at most d
bits (traverse d links) to reach one
processor from another.
• The bisection width of a d-dimensional
hypercube is 2d-1.
40

• The hypercube is a highly scalable
architecture. Two d-dimensional
hypercubes can be easily combined to
form a d+1-dimensional hypercube.
• The hypercube has several variants like
butterfly, shuffle-exchange network and
cube-connected cycles.
41

Adding n numbers in steps
Adding n numbers on the mesh
n 42

Adding n numbers in log n steps
Adding n numbers on the
hypercube
44

46
Complexity Analysis: Given n processors
connected via a hypercube, S_Sum_Hypercube needs
log n rounds to compute the sum. Since n messages
are sent and received in each round, the total number of
messages is O(n log n).
1. Time complexity: O(log n).
2. Message complexity: O(n log n).

Classification of the PRAM model
• In the PRAM model, processors
communicate by reading from and writing
to the shared memory locations.
47

Classification of the PRAM
model
• The power of a PRAM depends on the
kind of access to the shared memory
locations.
48

model
In every clock cycle,
• In the Exclusive Read Exclusive Write
(EREW) PRAM, each memory location
can be accessed only by one processor.
• In the Concurrent Read Exclusive Write
(CREW) PRAM, multiple processor can
read from the same memory location, but
only one processor can write.
49

model
• In the Concurrent Read Concurrent Write
(CRCW) PRAM, multiple processor can
read from or write to the same memory
location.
50

model
• It is easy to allow concurrent reading.
However, concurrent writing gives rise to
conflicts.
• If multiple processors write to the same
memory location simultaneously, it is not
clear what is written to the memory
location.
51

model
• In the Common CRCW PRAM, all the
processors must write the same value.
• In the Arbitrary CRCW PRAM, one of the
processors arbitrarily succeeds in writing.
• In the Priority CRCW PRAM, processors
have priorities associated with them and
the highest priority processor succeeds in
writing.
52

model
• The EREW PRAM is the weakest and the
Priority CRCW PRAM is the strongest
PRAM model.
• The relative powers of the different PRAM
models are as follows.
53

model
• An algorithm designed for a weaker
model can be executed within the same
time and work complexities on a
stronger model.
54

model
• We say model A is less powerful
compared to model B if either:
• the time complexity for solving a
problem is asymptotically less in model
B as compared to model A. or,
• if the time complexities are the same,
the processor or work complexity is
asymptotically less in model B as
compared to model A. 55

model
An algorithm designed for a stronger PRAM
model can be simulated on a weaker model
either with asymptotically more processors
(work) or with asymptotically more time.
56

Adding n numbers on a PRAM
57

• This algorithm works on the EREW PRAM
model as there are no read or write
conflicts.
• We will use this algorithm to design a
matrix multiplication algorithm on the
EREW PRAM.
58

For simplicity, we assume that n = 2p for some integer p.
Matrix multiplication
59

• Each can be computed in
parallel.
• We allocate n processors for computing ci,j.
Suppose these processors are P1, P2,…,Pn.
• In the first time step, processor
computes the product ai,m x bm,j.
• We have now n numbers and we use the
addition algorithm to sum these n numbers
in log n time.
, , 1 ,i jc i j n
, 1mP m n
60

• Computing each takes n
processors and log n time.
• Since there are n2 such ci,j s, we need
overall O(n3) processors and O(log n)
time.
• The processor requirement can be
reduced to O(n3 / log n). Exercise !
• Hence, the work complexity is O(n3)
, , 1 ,i jc i j n
61

• However, this algorithm requires
concurrent read capability.
• Note that, each element ai,j (and bi,j)
participates in computing n elements from
the C matrix.
• Hence n different processors will try to
read each ai,j (and bi,j) in our algorithm.
62

For simplicity, we assume that n = 2p for some integer p.
63

• Hence our algorithm runs on the CREW
PRAM and we need to avoid the read
conflicts to make it run on the EREW
PRAM.
• We will create n copies of each of the
elements ai,j (and bi,j). Then one copy can
be used for computing each ci,j .
64

Creating n copies of a number in O (log n)
time using O (n) processors on the EREW
PRAM.
• In the first step, one processor reads the
number and creates a copy. Hence, there
are two copies now.
• In the second step, two processors read
these two copies and create four copies.
65

• Since the number of copies doubles in
every step, n copies are created in O(log
n) steps.
• Though we need n processors, the
processor requirement can be reduced to
O (n / log n).
66

• Since there are n2 elements in the matrix A
(and in B), we need O (n3 / log n)
processors and O (log n) time to create n
copies of each element.
• After this, there are no read conflicts in our
algorithm. The overall matrix multiplication
algorithm now take O (log n) time and
O (n3 / log n) processors on the EREW
PRAM.
67

• The memory requirement is of course
much higher for the EREW PRAM.
68

69
Using n3 Processors
Algorithm MatMult_CREW
/* Step 1 */
Forall Pi,j,k, where do in parallel
C[i,j,k] = A[i,k]*B[k,j]
endfor
/* Step 2 */
For I =1 to log n do
forall Pi,j,k, where do in parallel
if (2k modulo 2l)=0 then
C[i,j,2k] C[i,j,2k] + C[i,j, 2k – 2i-1]
endif
endfor
/* The output matrix is stored in locations C[i,j,n], where */
endfor

70
Complexity Analysis
•In the first step, the products are conducted in parallel
in constant time, that is, O(1).
•These products are summed in O(log n) time during
the second step. Therefore, the run time is O(log n).
•Since the number of processors used is n3, the cost is
O(n3 log n).
1. Run time, T(n) = O(log n).
2. Number of processors, P(n) = n3.
3. Cost, C(n) = O(n3 log n).

71
Reducing the Number of Processors
In the above algorithm, although
all the processors were busy during the first step,
But not all of them performed addition operations during the
second step.
 The second step consists of log n iterations.
During the first iteration, only n3/2 processors performed
addition operations,
only n3/4 performed addition operations in the second
iteration, and so on.
With this understanding, we may be able to use a smaller
machine with only n3/log n processors.

72
Reducing the Number of Processors
1. Each processor Pi,j,k , where
computes the sum of log n products. This
step will produce (n3/log n) partial sums.
2. The sum of products produced in step 1 are
added to produce the resulting matrix as
discussed before.

73
Complexity Analysis
1. Run time, T(n) = O(log n).
2. Number of processors, P(n) = n3/log n.
3. Cost, C(n) = O(n3).

Parallel Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Parallel Algorithms

Similar to Parallel Algorithms (20)

More from Dr Sandeep Kumar Poonia

More from Dr Sandeep Kumar Poonia (20)

Recently uploaded

Recently uploaded (20)

Parallel Algorithms