CHAPTER - 7
1
Sequential Matrix Multiplication
2
Sequential Matrix Multiplication
3
Algorithms for Processor Arrays
Matrix Multiplication on the 2-D Mesh SIMD Model
4
Matrix Multiplication on the 2-D Mesh
• Gentleman (1978) has shown that multiplication of two nXn matrices on the 2-D mesh
SIMD model requires (n) data routing steps.
A Lower Bound
• Given a data item originally available at a single processor in some model of parallel
computation, let the function σ(k) be the maximum number of processors to which
the data can be transmitted in k or fewer data routing steps.
• σ(0) = 1 σ(1) = 5 σ(2) = 13 σ(k) = 2k2 + 2k + 1
Definition
5
MESH Network
(k): be the maximum number of
processors to which the data can
be transmitted in k or fewer data
routing steps.
(0): 1
(1): 5
(2): 13
(k): 2k2 +2k + 1
15 3
2
4
7
8
6
10
11
9
12
13
6
Matrix Multiplication on the 2-D Mesh
Lemma:
Suppose two nXn matrices A and B are to be multiplied, and
assume that every element of A and B is stored exactly once
and that no processing element contains more then one
element of either matrix.
If we ignore any data broadcasting facility, multiplying A and B
to produce the nXn matrix C requires at least s data routing
steps where σ(2s) ≥ n2.
7
Matrix Multiplication on the 2-D Mesh
• Consider an arbitrary element Ci,j of the product
matrix.
• This element is the inner product of row i of
matrix A and column j of matrix B.
• There must be a path from the processors where
each of these elements is stored, to the processor
where the result Ci,j is stored.
• Let s denote the length of the longest such path.
The creation of matrix
product C takes at least s
data routing steps.
8
Matrix Multiplication on the 2-D Mesh
• There is a path of length at most s from bu,v to the
processor where Ci,v is found, and there is also a path of
length at most s from ai,j to Ci,v.
• Hence there is a path of length at most 2s from any
element bu,v to every element ai,j.
• Similarly these paths define a set of paths of length at
most 2s from any element au,v to every element bi,j
where 1<=i, j<=n.
The n2 elements of A are stored in unique processors. There is a path of
length 2s from any element of B to every element of A and from any
element of A to every element of B. σ(2s) ≥ n2
9
10
Matrix Multiplication on the 2-D Mesh
An Optimal Algorithm:
Given a 2-D mesh SIMD model with wrap around
connections, it is easy to devise an algorithm that uses n2
processors to multiply two n x n arrays in (n) time.
Matrix multiplication requires n3 multiplication. The only
way that n2 processing elements can complete the
multiplication in (n) time is for (n2) processing elements
to be contributing toward the result every step.
11
Matrix Multiplication on the 2-D Mesh
A33,B33A32,B32A31,B31A30,B30
A23,B23A22,B22A21,B21A20,B20
A13,B13A12,B12A11,B11A10,B10
A03,B03A02,B02A01,B01A00,B00
Initial Condition
12
Matrix Multiplication on the 2-D Mesh
A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
13
B33B32B31B30
B23B22B21B20
B13B12B11B10
B03B02B01B00
C33C32C31C30
C23C22C21C20
C13C12C11C10
C03C02C01C00
A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 +
A12*B21 + A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
Matrix Multiplication on the 2-D Mesh
14
A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
Initial Position of
matrices A and B
15
B33B32B31B30
B23B22B21B20
B13B12B11B10
B03B02B01B00
A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33
B33
B32
B31
B30
B23
B22
B21
B20
B13
B12
B11
B10
B03
B02
B01
B00
B23
B13
B12
B03
B02
B01
Matrices A and B after Rotation
16
A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
A32A31A30
A21A20
A10
A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00
17
Final Position of
matrices A and B
after Staggering
A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 + A12*B21
+ A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00
18
Final Position of matrices A and B after Staggering
Matrix Multiplication on the 2-D Mesh
19
Matrix Multiplication on the 2-D Mesh
20
Matrix Multiplication on the 2-D Mesh
21
Matrix Multiplication on the 2-D Mesh
Global : n, k
Local : a,b,c
begin {Stagger mAtrices}
for k = 1 to n-1 do
for All p(i,j) where 1≤i,j ≤ n do
if i > k then a = east(a) endif
if j > k then b = south(b) endif
endfor
endfor
{Compute dot product}
for All p(i,j) where 1≤i,j ≤ n do
c = a X b
endfor
for k = 1 to n-1 do
for all p(i,j) where 1≤i,j ≤ n do
a = east(a) b = south(b) c= c + a X b
endfor
endfor
end
 The first phase of the parallel
algorithm staggers the two
matrices.
 The Second phase computes all
the products ai,k X bk,j and
accumulates the sums.
 When phase 2 is complete, Ci,j =
ai,k X bk,j
𝑛
𝑘=1
22
1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
23
A X B = C
1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
24
1 0 2 3
-1 1 5 4
-4 2 -2 -3
0 -1 2 0
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 0 0 15
5 -1 20 -12
-12 0 -4 6
0 -1 4 0
25
0 2 3 1
1 5 4 -1
2 -2 -3 -4
-1 2 0 0
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 -4 0 5
-1 -2 12 12
8 -1 28 -10
-10 -2 -10 -2
1 -9 4 0
1 0 2 3
-1 1 5 4
-4 2 -2 -3
0 -1 2 0
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 0 0 15
5 -1 20 -12
-12 0 -4 6
0 -1 4 0
26
2 3 1 0
5 4 -1 1
-2 -3 -4 2
2 0 0 -1
3 0 2 -2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
5 -2 14 12
13 3 26 -8
-8 10 -10 8
-9 -9 4 3
0 2 3 1
1 5 4 -1
2 -2 -3 -4
-1 2 0 0
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 -4 0 5
-1 -2 12 12
8 -1 28 -10
-10 -2 -10 -2
1 -9 4 0
27
3 1 0 2
4 -1 1 5
-3 -4 2 -2
0 0 -1 2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
2 3 1 0
5 4 -1 1
-2 -3 -4 2
2 0 0 -1
3 0 2 -2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
5 -2 14 12
13 3 26 -8
-8 10 -10 8
-9 -9 4 3
1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
28
A X B = C
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
MATRIX MULTIPLICATION
on the Hypercube SIMD Model
29
Matrix Multiplication on Hypercube
No. of
Processors
Time
Complexity
Topology
1 (n3) SISD
n (n2) Linear Array
n2 (n) Mesh with Wrap around Connection
n3 (log n) Hypercube
30
Matrix Multiplication on Hypercube
A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 +
A12*B21 + A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
31
100 101
110 111
000 001
010 011
To multiply two n X n matrices
No. of PEs = n3 = (2q)3
C2X2 = A2X2 X B2X2
n = 2
No. of PEs = 23 = 8
q = 1
Dimension = 3
Address PEs =X2 X1 X0
a00 a01 b00 b01 a00 b00 + a01 b10 a00. b01 + a01 b11
X =
a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11
Matrix Multiplication on Hypercube
32
Matrix Multiplication on Hypercube
 The processing elements should be thought of as filling an nxnxn lattice.
 Processor P(x), where 0<= x<= 23q -1 has local memory locations a, b, c, s
and t.
 When the parallel algorithm begins execution, matrix elements ai,j and bi,j
for 0<=i, j<= n -1, are stored in variables a and b of processor P(2q * i + j)
 After the parallel algorithm is complete, matrix elements ci,j for 0<=i, j<=
n -1, should be stored in variable c of processor P(2q * i + j).
33
The (i,j)th elements of matrix A and B are distributed over n2
distinct PEs (2q * i + j) where 0<=i,j<n
a00,b00
P
000 a00 b00
001 a01 b01
010 a10 b10
011 a11 b11
100
101
110
111
a01,b01
a10,b10 a11,b11
BIT (m,l) : returns lth bit in binary representation of m
BIT_Comp (m,l) : returns the value of the integer formed by complementing lth bit of m
100 101
110 111
000 001
010 011
34
Matrix Multiplication on Hypercube
K=0
[2*i + j] = ai,j
[2*i + j] = bi,j
K=1
[4 + 2*i + j] = ai,j
[4 + 2*i + j] = bi,j
i, j K=0 K=1
0,0 0 = 000 4 = 100
0,1 1 = 001 5 = 101
1,0 2 = 010 6 = 110
1,1 3 = 011 7 = 111
35
For all Pm where BIT(m,2) = 1
t = BIT_COMP(m,2)
a = [t]a, b= [t]b
P
000 a00 b00
001 a01 b01
010 a10 b10
011 a11 b11
100 a00 b00
101 a01 b01
110 a10 b10
111 a11 b11
100
101
110
111
000 001
010 011
a00,b00 a01,b01
a10,b10 a11,b11
a00,b00
a10,b10 a11,b11
a01,b01
Matrix Multiplication on Hypercube
36
Matrix Multiplication on Hypercube
j=0
[4*k + 2*i ] = ai,k
j=1
[4*k + 2*i + 1] = ai,k
i,k J=0 J=1
00 0 = 000 1 = 001
01 4 = 010 5 = 101
10 2 = 010 3 = 011
11 6 = 110 7 = 111
37
100 101
110
111
000 001
010 011
a00 a01
a10 a11
a00 a01
a10 a11
For all Pm where BIT(m,0) <> BIT (m,2)
t = BIT_COMP(m,0)
a = [t]a P
000 a00 b00
001 a00 b01
010 a10 b10
011 a10 b11
100 a01 b00
101 a01 b01
110 a11 b10
111 a11 b11
a00
a10
a11
a01
Matrix Multiplication on Hypercube
38
Matrix Multiplication on Hypercube
i=0
[4*k + j ] = bk,j
i=1
[4*k + 2 + j] = bk,j
k,j i=0 i=1
00 0 = 000 2 = 010
01 1 = 001 3 = 011
10 4 = 100 6 = 110
11 5 = 101 7 = 111
39
For all Pm where BIT(m,1) <> BIT (m,2)
t = BIT_COMP(m,1)
b = [t]b
P
000 a00 b00
001 a00 b01
010 a10 b00
011 a10 b01
100 a01 b10
101 a01 b11
110 a11 b10
111 a11 b11
100 101
110 111
000 001
010 011
b00 b01
b10 b11
b00
b10 b11
b01
b00 b01
b10 b11
Matrix Multiplication on Hypercube
40
P
000 a00 b00
001 a00 b01
010 a10 b00
011 a10 b01
100 a01 b10
101 a01 b11
110 a11 b10
111 a11 b11
100 101
110 111
000 001
010 011
a00
b00
a00
b01
a10
b00
a10
b01
a01
b10
a11
b10
a01
b11
a11
b11
a00 a01 b00 b01 a00 b00 + a01 b10 a00 b01 + a01 b11
X =
a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11
C = a X b
41
For all Pm
t = BIT_COMP(m,2)
s = [t]c c = c + s
P
000 a00 b00 a01 b10
001 a00 b01 a01 b11
010 a10 b00 a11 b10
011 a10 b01 a11 b11
100 a01 b10 a00 b00
101 a01 b11 a00 b01
110 a11 b10 a10 b00
111 a11 b11 a10 b01
100 101
110
111
000 001
010 011
a00,b00 a00,b01
a10,b00 a10,b01
a01,b10
a11,b10
a11,b11
a01,b11
Matrix Multiplication on Hypercube
42
43
Matrix Multiplication on Hypercube
44
Matrix Multiplication on Hypercube
45
MATRIX MULTIPLICATION
on Shuffle Exchange SIMD Model
46
Matrix Multiplication on Shuffle Exchange
47
MATRIX MULTIPLICATION
on Multiprocessors
48
Matrix Multiplication for Multiprocessors
• When there are a number of nested
loops, all suitable for parallelization,
which loop should be made parallel?
A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33
49
Matrix Multiplication for Multiprocessors
50
Matrix Multiplication for Multiprocessors
• In the case of matrix multiplication, we can parallelize the j or the i loop.
• If j loop is parallelize, the parallel algorithm executes n synchronizations
(one per iteration of the i loop) and the grain size of the parallel code is
(n2/p).
• If i loop is parallelize, the parallel algorithm executes only one
synchronization and the grain size of the parallel code is (n3/p).
• On most UMA multiprocessors the parallelized loop i version will
execute faster.
51
Matrix Multiplication for Multiprocessors
52
Matrix Multiplication for Multiprocessors
• Each processor computes n/p rows of C.
• Time needed to compute a single row is n2.
• The processes synchronize exactly once;
synchronization overhead, then, is (p).
• Hence the complexity of this parallel
algorithm is (n3/p + p).
• since there are only n rows, at most n
processors may execute this algorithm.
• If we ignore memory contention, we can
expect speedup to be nearly linear.
53
Matrix Multiplication for Multiprocessors
 In UMA every global memory cell is an equal distance from every processor.
 But on loosely coupled multiprocessors, where some matrix elements may be
much easier to access than others, it is important to keep local as many memory
references as possible.
 In the algorithm, a processor not only accesses n/p rows of A, but it also
accesses every element of B n/p times.
 Only a single addition and a single multiplication occur for every element of B
fetched.
 This is not a good ratio. {(2*n*n) *n/p}/{(n + n*n)*n/p} = 2n/(1+n)  2
 Implementation of this algorithm on a BBN TC2000 would yield poor speedup.
54
BLOCK MATRIX
MULTIPLICATION
Matrix Multiplication for Multiprocessors
55
Block Matrix Multiplication for Multiprocessors
• Assume that A and B are both n x n matrices, where n = 2k. Then A and B can be
thought of as conglomerates of four smaller matrices, each of size k x k.
K = 2, n= 4 K = 3, n= 6 K = 4, n= 8
56
Block Matrix Multiplication for Multiprocessors
K = 2, n= 6 K = 3, n= 9
57
Block Matrix Multiplication for Multiprocessors
 If we assign processes to do the block matrix multiplications, then the number of
multiplications and additions per matrix-element fetch increases.
 Assume that there are p = (n/k)2 processes (=4).
 The matrix multiplication is performed by dividing A and B into p blocks of size k X k.
 Each block multiplication requires 2k2 memory fetches, k3 addition and k3 multiplication.
 The number of arithmetic operations per memory access has risen from 2 to k = n/ 𝑝.
58
59
Block Matrix Multiplication for Multiprocessors
• Ostlund, Hibbard, and Whiteside
(1982) have implemented this matrix
multiplication algorithm on Cm* for
various matrix sizes. The results of
their experiment are shown in Fig.
• p = (n/k )2 k = n/ 𝑝.
60
Block Matrix Multiplication for Multiprocessors
61
ALGORITHMS FOR
MULTICOMPUTER
Matrix Multiplication for Multi-Computers
62
Row Column Oriented Algorithm
• Multiplying two n X n matrices A and B involves the computation of n2
dot products.
• Each dot product operation is in between a row of A and a column of B.
• At any moment in time every matrix element must be stored in the local
memory of exactly one processor.
• It is natural to partition A into rows and B into columns.
• Assume that n is a power of two and we are executing the algorithm on
an n-processor hypercube.
• To maximize grain size we want to parallelize the outmost for loop.
63
Row Column Oriented Algorithm
0
2
1
3
n = 4, Multiply two 4 X 4 matrices using 4-processor
hypercube multicomputer.
Each processor stores exactly one row of A and one
column of B
64
Row Column Oriented Algorithm
• A straightforward parallelization of the outer loop would demand that all
parallel processes first access column 0 of b, then column 1 of B, etc.
• This results in a sequence of broadcast steps, each having complexity (log n)
on a n-processor hypercube.
• Contention for shared resources can dramatically lower the performance of a
parallel algorithm.
• On a multicomputer, the processor that controls the variable must broadcast its
value to the other processors.
• If the order in which the processors access the data items is unimportant, we
can rewrite the parallel algorithm to eliminate the contention.
65
Row Column Oriented Algorithm
66
Row Column Oriented Algorithm
67
Row Column Oriented Algorithm
68
69
Row Column Oriented Algorithm
70
Row Column Oriented Algorithm
This algorithm achieves reasonable
performance on hypercube multi-computers.
Figure illustrates the speedup achieved by a
parallel implementation of the algorithm on
the nCUBE 3200.
Two noticeable point about the algorithm is-
 First, the communication time increases
linearly with the number of processors,
 Second, notice that the number of
computations performed per iteration is
inversely proportional to the number of
processors used.
71
Block Oriented Algorithm
• we are to multiply a matrix Al X m by matrix Bm X n.
• Assume p is an even power of 2. (4, 16, 64…..)
• Assume that l, m, and n are integer multiples of 𝑝.
• Processors are organized as a two-dimensional mesh with wraparound
Connections and give each processor a (l/ 𝑝 X m/ 𝑝) subsection of A
and (m/ 𝑝 X n/ 𝑝) subsection of B.
• P = 16 𝑝 = 4
• l= 4*2 = 8 m=4*3=12 n=4*4=16
• A8X12 X B12X16 = C8X16
• l/ 𝑝 =2 m/ 𝑝 = 3 n/ 𝑝 = 4
72
Block Oriented Algorithm
A8X12 B12X16 C8X16
73
Block Oriented Algorithm
A8X12–EachPhasSub-matrix A2X3 B12X16 - Each P has Sub-matrix B3X4 C8X16 - Each P has Sub-matrix C2X4
4X4 Mesh 4X4 Mesh 4X4 Mesh
74
Block Oriented Algorithm
• Two Algorithms: Block Matrix and Matrix Multiplication on Mesh have been combined.
• To determine the communication time required, we take into account that for each of
𝑝 - 1 iterations, every processor sends and receives a portion of matrix A and a portion
of matrix B.
• In addition, both the staggering and un-staggering of matrices A and B require 𝑝/2-1
iterations in which portions of A and B are sent and received.
• Unlike the SIMD algorithm, which requires 𝑝 -1 iterations for the staggering and un-
staggering steps, this MIMD algorithm requires 𝑝 /2 iterations, because some
processing elements can move blocks of A to the right while other processing elements
move blocks of A to the left, and some processing elements can move blocks of B down
while other processing elements move blocks of B up.
• No block begins more than 𝑝/2 moves away from its staggered position.
75
76
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00
Block Oriented Algorithm
• Hence the total communication time is-
• Both the black-oriented algorithm and the row-column-oriented
algorithms require the same number of computation steps. When does
the second algorithm require less communication time?
77
Block Oriented Algorithm
• Assume that we are multiplying two n x n matrices, where n is an integer
multiple of p.
78

Chapter 7: Matrix Multiplication

  • 1.
  • 2.
  • 3.
  • 4.
    Algorithms for ProcessorArrays Matrix Multiplication on the 2-D Mesh SIMD Model 4
  • 5.
    Matrix Multiplication onthe 2-D Mesh • Gentleman (1978) has shown that multiplication of two nXn matrices on the 2-D mesh SIMD model requires (n) data routing steps. A Lower Bound • Given a data item originally available at a single processor in some model of parallel computation, let the function σ(k) be the maximum number of processors to which the data can be transmitted in k or fewer data routing steps. • σ(0) = 1 σ(1) = 5 σ(2) = 13 σ(k) = 2k2 + 2k + 1 Definition 5
  • 6.
    MESH Network (k): bethe maximum number of processors to which the data can be transmitted in k or fewer data routing steps. (0): 1 (1): 5 (2): 13 (k): 2k2 +2k + 1 15 3 2 4 7 8 6 10 11 9 12 13 6
  • 7.
    Matrix Multiplication onthe 2-D Mesh Lemma: Suppose two nXn matrices A and B are to be multiplied, and assume that every element of A and B is stored exactly once and that no processing element contains more then one element of either matrix. If we ignore any data broadcasting facility, multiplying A and B to produce the nXn matrix C requires at least s data routing steps where σ(2s) ≥ n2. 7
  • 8.
    Matrix Multiplication onthe 2-D Mesh • Consider an arbitrary element Ci,j of the product matrix. • This element is the inner product of row i of matrix A and column j of matrix B. • There must be a path from the processors where each of these elements is stored, to the processor where the result Ci,j is stored. • Let s denote the length of the longest such path. The creation of matrix product C takes at least s data routing steps. 8
  • 9.
    Matrix Multiplication onthe 2-D Mesh • There is a path of length at most s from bu,v to the processor where Ci,v is found, and there is also a path of length at most s from ai,j to Ci,v. • Hence there is a path of length at most 2s from any element bu,v to every element ai,j. • Similarly these paths define a set of paths of length at most 2s from any element au,v to every element bi,j where 1<=i, j<=n. The n2 elements of A are stored in unique processors. There is a path of length 2s from any element of B to every element of A and from any element of A to every element of B. σ(2s) ≥ n2 9
  • 10.
  • 11.
    Matrix Multiplication onthe 2-D Mesh An Optimal Algorithm: Given a 2-D mesh SIMD model with wrap around connections, it is easy to devise an algorithm that uses n2 processors to multiply two n x n arrays in (n) time. Matrix multiplication requires n3 multiplication. The only way that n2 processing elements can complete the multiplication in (n) time is for (n2) processing elements to be contributing toward the result every step. 11
  • 12.
    Matrix Multiplication onthe 2-D Mesh A33,B33A32,B32A31,B31A30,B30 A23,B23A22,B22A21,B21A20,B20 A13,B13A12,B12A11,B11A10,B10 A03,B03A02,B02A01,B01A00,B00 Initial Condition 12
  • 13.
    Matrix Multiplication onthe 2-D Mesh A33A32A31A30 A23A22A21A20 A13A12A11A10 A03A02A01A00 13 B33B32B31B30 B23B22B21B20 B13B12B11B10 B03B02B01B00 C33C32C31C30 C23C22C21C20 C13C12C11C10 C03C02C01C00
  • 14.
    A00*B00 + A01*B10+ A02*B20 + A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33 Matrix Multiplication on the 2-D Mesh 14
  • 15.
    A33A32A31A30 A23A22A21A20 A13A12A11A10 A03A02A01A00 Initial Position of matricesA and B 15 B33B32B31B30 B23B22B21B20 B13B12B11B10 B03B02B01B00 A00*B00 + A01*B10 + A02*B20 + A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33
  • 16.
    B33 B32 B31 B30 B23 B22 B21 B20 B13 B12 B11 B10 B03 B02 B01 B00 B23 B13 B12 B03 B02 B01 Matrices A andB after Rotation 16 A33A32A31A30 A23A22A21A20 A13A12A11A10 A03A02A01A00 A32A31A30 A21A20 A10
  • 17.
    A00*B00 + A01*B10 + A02*B20+ A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33 A32,B23A31,B12A30,B01A33,B30 A21,B13A20,B02A23,B31A22,B20 A10,B03A13,B32A12,B21A11,B10 A03,B33A02,B22A01,B11A00,B00 17 Final Position of matrices A and B after Staggering
  • 18.
    A00*B00 + A01*B10+ A02*B20 + A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33 A32,B23A31,B12A30,B01A33,B30 A21,B13A20,B02A23,B31A22,B20 A10,B03A13,B32A12,B21A11,B10 A03,B33A02,B22A01,B11A00,B00 18 Final Position of matrices A and B after Staggering
  • 19.
    Matrix Multiplication onthe 2-D Mesh 19
  • 20.
    Matrix Multiplication onthe 2-D Mesh 20
  • 21.
    Matrix Multiplication onthe 2-D Mesh 21
  • 22.
    Matrix Multiplication onthe 2-D Mesh Global : n, k Local : a,b,c begin {Stagger mAtrices} for k = 1 to n-1 do for All p(i,j) where 1≤i,j ≤ n do if i > k then a = east(a) endif if j > k then b = south(b) endif endfor endfor {Compute dot product} for All p(i,j) where 1≤i,j ≤ n do c = a X b endfor for k = 1 to n-1 do for all p(i,j) where 1≤i,j ≤ n do a = east(a) b = south(b) c= c + a X b endfor endfor end  The first phase of the parallel algorithm staggers the two matrices.  The Second phase computes all the products ai,k X bk,j and accumulates the sums.  When phase 2 is complete, Ci,j = ai,k X bk,j 𝑛 𝑘=1 22
  • 23.
    1 0 23 4 -1 1 5 -2 -3 -4 2 -1 2 0 0 -1 1 2 -3 -5 -4 2 -2 3 -1 0 2 1 0 4 5 8 -1 14 16 9 7 26 17 7 14 -2 14 -9 -9 2 -1 23 A X B = C
  • 24.
    1 0 23 4 -1 1 5 -2 -3 -4 2 -1 2 0 0 -1 1 2 -3 -5 -4 2 -2 3 -1 0 2 1 0 4 5 24 1 0 2 3 -1 1 5 4 -4 2 -2 -3 0 -1 2 0 -1 -4 0 5 -5 -1 4 -3 3 0 2 -2 1 1 2 2 -1 0 0 15 5 -1 20 -12 -12 0 -4 6 0 -1 4 0
  • 25.
    25 0 2 31 1 5 4 -1 2 -2 -3 -4 -1 2 0 0 -5 -1 4 -3 3 0 2 -2 1 1 2 2 -1 -4 0 5 -1 -2 12 12 8 -1 28 -10 -10 -2 -10 -2 1 -9 4 0 1 0 2 3 -1 1 5 4 -4 2 -2 -3 0 -1 2 0 -1 -4 0 5 -5 -1 4 -3 3 0 2 -2 1 1 2 2 -1 0 0 15 5 -1 20 -12 -12 0 -4 6 0 -1 4 0
  • 26.
    26 2 3 10 5 4 -1 1 -2 -3 -4 2 2 0 0 -1 3 0 2 -2 1 1 2 2 -1 -4 0 5 -5 -1 4 -3 5 -2 14 12 13 3 26 -8 -8 10 -10 8 -9 -9 4 3 0 2 3 1 1 5 4 -1 2 -2 -3 -4 -1 2 0 0 -5 -1 4 -3 3 0 2 -2 1 1 2 2 -1 -4 0 5 -1 -2 12 12 8 -1 28 -10 -10 -2 -10 -2 1 -9 4 0
  • 27.
    27 3 1 02 4 -1 1 5 -3 -4 2 -2 0 0 -1 2 1 1 2 2 -1 -4 0 5 -5 -1 4 -3 3 0 2 -2 8 -1 14 16 9 7 26 17 7 14 -2 14 -9 -9 2 -1 2 3 1 0 5 4 -1 1 -2 -3 -4 2 2 0 0 -1 3 0 2 -2 1 1 2 2 -1 -4 0 5 -5 -1 4 -3 5 -2 14 12 13 3 26 -8 -8 10 -10 8 -9 -9 4 3
  • 28.
    1 0 23 4 -1 1 5 -2 -3 -4 2 -1 2 0 0 -1 1 2 -3 -5 -4 2 -2 3 -1 0 2 1 0 4 5 8 -1 14 16 9 7 26 17 7 14 -2 14 -9 -9 2 -1 28 A X B = C 8 -1 14 16 9 7 26 17 7 14 -2 14 -9 -9 2 -1
  • 29.
    MATRIX MULTIPLICATION on theHypercube SIMD Model 29
  • 30.
    Matrix Multiplication onHypercube No. of Processors Time Complexity Topology 1 (n3) SISD n (n2) Linear Array n2 (n) Mesh with Wrap around Connection n3 (log n) Hypercube 30
  • 31.
    Matrix Multiplication onHypercube A00*B00 + A01*B10 + A02*B20 + A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33 31
  • 32.
    100 101 110 111 000001 010 011 To multiply two n X n matrices No. of PEs = n3 = (2q)3 C2X2 = A2X2 X B2X2 n = 2 No. of PEs = 23 = 8 q = 1 Dimension = 3 Address PEs =X2 X1 X0 a00 a01 b00 b01 a00 b00 + a01 b10 a00. b01 + a01 b11 X = a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11 Matrix Multiplication on Hypercube 32
  • 33.
    Matrix Multiplication onHypercube  The processing elements should be thought of as filling an nxnxn lattice.  Processor P(x), where 0<= x<= 23q -1 has local memory locations a, b, c, s and t.  When the parallel algorithm begins execution, matrix elements ai,j and bi,j for 0<=i, j<= n -1, are stored in variables a and b of processor P(2q * i + j)  After the parallel algorithm is complete, matrix elements ci,j for 0<=i, j<= n -1, should be stored in variable c of processor P(2q * i + j). 33
  • 34.
    The (i,j)th elementsof matrix A and B are distributed over n2 distinct PEs (2q * i + j) where 0<=i,j<n a00,b00 P 000 a00 b00 001 a01 b01 010 a10 b10 011 a11 b11 100 101 110 111 a01,b01 a10,b10 a11,b11 BIT (m,l) : returns lth bit in binary representation of m BIT_Comp (m,l) : returns the value of the integer formed by complementing lth bit of m 100 101 110 111 000 001 010 011 34
  • 35.
    Matrix Multiplication onHypercube K=0 [2*i + j] = ai,j [2*i + j] = bi,j K=1 [4 + 2*i + j] = ai,j [4 + 2*i + j] = bi,j i, j K=0 K=1 0,0 0 = 000 4 = 100 0,1 1 = 001 5 = 101 1,0 2 = 010 6 = 110 1,1 3 = 011 7 = 111 35
  • 36.
    For all Pmwhere BIT(m,2) = 1 t = BIT_COMP(m,2) a = [t]a, b= [t]b P 000 a00 b00 001 a01 b01 010 a10 b10 011 a11 b11 100 a00 b00 101 a01 b01 110 a10 b10 111 a11 b11 100 101 110 111 000 001 010 011 a00,b00 a01,b01 a10,b10 a11,b11 a00,b00 a10,b10 a11,b11 a01,b01 Matrix Multiplication on Hypercube 36
  • 37.
    Matrix Multiplication onHypercube j=0 [4*k + 2*i ] = ai,k j=1 [4*k + 2*i + 1] = ai,k i,k J=0 J=1 00 0 = 000 1 = 001 01 4 = 010 5 = 101 10 2 = 010 3 = 011 11 6 = 110 7 = 111 37
  • 38.
    100 101 110 111 000 001 010011 a00 a01 a10 a11 a00 a01 a10 a11 For all Pm where BIT(m,0) <> BIT (m,2) t = BIT_COMP(m,0) a = [t]a P 000 a00 b00 001 a00 b01 010 a10 b10 011 a10 b11 100 a01 b00 101 a01 b01 110 a11 b10 111 a11 b11 a00 a10 a11 a01 Matrix Multiplication on Hypercube 38
  • 39.
    Matrix Multiplication onHypercube i=0 [4*k + j ] = bk,j i=1 [4*k + 2 + j] = bk,j k,j i=0 i=1 00 0 = 000 2 = 010 01 1 = 001 3 = 011 10 4 = 100 6 = 110 11 5 = 101 7 = 111 39
  • 40.
    For all Pmwhere BIT(m,1) <> BIT (m,2) t = BIT_COMP(m,1) b = [t]b P 000 a00 b00 001 a00 b01 010 a10 b00 011 a10 b01 100 a01 b10 101 a01 b11 110 a11 b10 111 a11 b11 100 101 110 111 000 001 010 011 b00 b01 b10 b11 b00 b10 b11 b01 b00 b01 b10 b11 Matrix Multiplication on Hypercube 40
  • 41.
    P 000 a00 b00 001a00 b01 010 a10 b00 011 a10 b01 100 a01 b10 101 a01 b11 110 a11 b10 111 a11 b11 100 101 110 111 000 001 010 011 a00 b00 a00 b01 a10 b00 a10 b01 a01 b10 a11 b10 a01 b11 a11 b11 a00 a01 b00 b01 a00 b00 + a01 b10 a00 b01 + a01 b11 X = a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11 C = a X b 41
  • 42.
    For all Pm t= BIT_COMP(m,2) s = [t]c c = c + s P 000 a00 b00 a01 b10 001 a00 b01 a01 b11 010 a10 b00 a11 b10 011 a10 b01 a11 b11 100 a01 b10 a00 b00 101 a01 b11 a00 b01 110 a11 b10 a10 b00 111 a11 b11 a10 b01 100 101 110 111 000 001 010 011 a00,b00 a00,b01 a10,b00 a10,b01 a01,b10 a11,b10 a11,b11 a01,b11 Matrix Multiplication on Hypercube 42
  • 43.
  • 44.
  • 45.
  • 46.
    MATRIX MULTIPLICATION on ShuffleExchange SIMD Model 46
  • 47.
    Matrix Multiplication onShuffle Exchange 47
  • 48.
  • 49.
    Matrix Multiplication forMultiprocessors • When there are a number of nested loops, all suitable for parallelization, which loop should be made parallel? A00*B00 + A01*B10 + A02*B20 + A03*B30 A00*B01 + A01*B11 + A02*B21 + A03*B31 A00*B02 + A01*B12 + A02*B22 + A03*B32 A00*B03 + A01*B13 + A02*B23 + A03*B33 A10*B00 + A11*B10 + A12*B20 + A13*B30 A10*B01 + A11*B11 + A12*B21 + A13*B31 A10*B02 + A11*B12 + A12*B22 + A13*B32 A10*B03 + A11*B13 + A12*B23 + A13*B33 A20*B00 + A21*B10 + A22*B20 + A23*B30 A20*B01 + A21*B11 + A22*B21 + A23*B31 A20*B02 + A21*B12 + A22*B22 + A23*B32 A20*B03 + A21*B13 + A22*B23 + A23*B33 A30*B00 + A31*B10 + A32*B20 + A33*B30 A30*B01 + A31*B11 + A32*B21 + A33*B31 A30*B02 + A31*B12 + A32*B22 + A33*B32 A30*B03 + A31*B13 + A32*B23 + A33*B33 49
  • 50.
    Matrix Multiplication forMultiprocessors 50
  • 51.
    Matrix Multiplication forMultiprocessors • In the case of matrix multiplication, we can parallelize the j or the i loop. • If j loop is parallelize, the parallel algorithm executes n synchronizations (one per iteration of the i loop) and the grain size of the parallel code is (n2/p). • If i loop is parallelize, the parallel algorithm executes only one synchronization and the grain size of the parallel code is (n3/p). • On most UMA multiprocessors the parallelized loop i version will execute faster. 51
  • 52.
    Matrix Multiplication forMultiprocessors 52
  • 53.
    Matrix Multiplication forMultiprocessors • Each processor computes n/p rows of C. • Time needed to compute a single row is n2. • The processes synchronize exactly once; synchronization overhead, then, is (p). • Hence the complexity of this parallel algorithm is (n3/p + p). • since there are only n rows, at most n processors may execute this algorithm. • If we ignore memory contention, we can expect speedup to be nearly linear. 53
  • 54.
    Matrix Multiplication forMultiprocessors  In UMA every global memory cell is an equal distance from every processor.  But on loosely coupled multiprocessors, where some matrix elements may be much easier to access than others, it is important to keep local as many memory references as possible.  In the algorithm, a processor not only accesses n/p rows of A, but it also accesses every element of B n/p times.  Only a single addition and a single multiplication occur for every element of B fetched.  This is not a good ratio. {(2*n*n) *n/p}/{(n + n*n)*n/p} = 2n/(1+n)  2  Implementation of this algorithm on a BBN TC2000 would yield poor speedup. 54
  • 55.
  • 56.
    Block Matrix Multiplicationfor Multiprocessors • Assume that A and B are both n x n matrices, where n = 2k. Then A and B can be thought of as conglomerates of four smaller matrices, each of size k x k. K = 2, n= 4 K = 3, n= 6 K = 4, n= 8 56
  • 57.
    Block Matrix Multiplicationfor Multiprocessors K = 2, n= 6 K = 3, n= 9 57
  • 58.
    Block Matrix Multiplicationfor Multiprocessors  If we assign processes to do the block matrix multiplications, then the number of multiplications and additions per matrix-element fetch increases.  Assume that there are p = (n/k)2 processes (=4).  The matrix multiplication is performed by dividing A and B into p blocks of size k X k.  Each block multiplication requires 2k2 memory fetches, k3 addition and k3 multiplication.  The number of arithmetic operations per memory access has risen from 2 to k = n/ 𝑝. 58
  • 59.
  • 60.
    Block Matrix Multiplicationfor Multiprocessors • Ostlund, Hibbard, and Whiteside (1982) have implemented this matrix multiplication algorithm on Cm* for various matrix sizes. The results of their experiment are shown in Fig. • p = (n/k )2 k = n/ 𝑝. 60
  • 61.
    Block Matrix Multiplicationfor Multiprocessors 61
  • 62.
  • 63.
    Row Column OrientedAlgorithm • Multiplying two n X n matrices A and B involves the computation of n2 dot products. • Each dot product operation is in between a row of A and a column of B. • At any moment in time every matrix element must be stored in the local memory of exactly one processor. • It is natural to partition A into rows and B into columns. • Assume that n is a power of two and we are executing the algorithm on an n-processor hypercube. • To maximize grain size we want to parallelize the outmost for loop. 63
  • 64.
    Row Column OrientedAlgorithm 0 2 1 3 n = 4, Multiply two 4 X 4 matrices using 4-processor hypercube multicomputer. Each processor stores exactly one row of A and one column of B 64
  • 65.
    Row Column OrientedAlgorithm • A straightforward parallelization of the outer loop would demand that all parallel processes first access column 0 of b, then column 1 of B, etc. • This results in a sequence of broadcast steps, each having complexity (log n) on a n-processor hypercube. • Contention for shared resources can dramatically lower the performance of a parallel algorithm. • On a multicomputer, the processor that controls the variable must broadcast its value to the other processors. • If the order in which the processors access the data items is unimportant, we can rewrite the parallel algorithm to eliminate the contention. 65
  • 66.
    Row Column OrientedAlgorithm 66
  • 67.
    Row Column OrientedAlgorithm 67
  • 68.
    Row Column OrientedAlgorithm 68
  • 69.
  • 70.
    Row Column OrientedAlgorithm 70
  • 71.
    Row Column OrientedAlgorithm This algorithm achieves reasonable performance on hypercube multi-computers. Figure illustrates the speedup achieved by a parallel implementation of the algorithm on the nCUBE 3200. Two noticeable point about the algorithm is-  First, the communication time increases linearly with the number of processors,  Second, notice that the number of computations performed per iteration is inversely proportional to the number of processors used. 71
  • 72.
    Block Oriented Algorithm •we are to multiply a matrix Al X m by matrix Bm X n. • Assume p is an even power of 2. (4, 16, 64…..) • Assume that l, m, and n are integer multiples of 𝑝. • Processors are organized as a two-dimensional mesh with wraparound Connections and give each processor a (l/ 𝑝 X m/ 𝑝) subsection of A and (m/ 𝑝 X n/ 𝑝) subsection of B. • P = 16 𝑝 = 4 • l= 4*2 = 8 m=4*3=12 n=4*4=16 • A8X12 X B12X16 = C8X16 • l/ 𝑝 =2 m/ 𝑝 = 3 n/ 𝑝 = 4 72
  • 73.
  • 74.
    Block Oriented Algorithm A8X12–EachPhasSub-matrixA2X3 B12X16 - Each P has Sub-matrix B3X4 C8X16 - Each P has Sub-matrix C2X4 4X4 Mesh 4X4 Mesh 4X4 Mesh 74
  • 75.
    Block Oriented Algorithm •Two Algorithms: Block Matrix and Matrix Multiplication on Mesh have been combined. • To determine the communication time required, we take into account that for each of 𝑝 - 1 iterations, every processor sends and receives a portion of matrix A and a portion of matrix B. • In addition, both the staggering and un-staggering of matrices A and B require 𝑝/2-1 iterations in which portions of A and B are sent and received. • Unlike the SIMD algorithm, which requires 𝑝 -1 iterations for the staggering and un- staggering steps, this MIMD algorithm requires 𝑝 /2 iterations, because some processing elements can move blocks of A to the right while other processing elements move blocks of A to the left, and some processing elements can move blocks of B down while other processing elements move blocks of B up. • No block begins more than 𝑝/2 moves away from its staggered position. 75
  • 76.
  • 77.
    Block Oriented Algorithm •Hence the total communication time is- • Both the black-oriented algorithm and the row-column-oriented algorithms require the same number of computation steps. When does the second algorithm require less communication time? 77
  • 78.
    Block Oriented Algorithm •Assume that we are multiplying two n x n matrices, where n is an integer multiple of p. 78