Chapter 7: Matrix Multiplication

Sequential Matrix Multiplication
2

Sequential Matrix Multiplication
3

Algorithms for Processor Arrays
Matrix Multiplication on the 2-D Mesh SIMD Model
4

Matrix Multiplication on the 2-D Mesh
• Gentleman (1978) has shown that multiplication of two nXn matrices on the 2-D mesh
SIMD model requires (n) data routing steps.
A Lower Bound
• Given a data item originally available at a single processor in some model of parallel
computation, let the function σ(k) be the maximum number of processors to which
the data can be transmitted in k or fewer data routing steps.
• σ(0) = 1 σ(1) = 5 σ(2) = 13 σ(k) = 2k2 + 2k + 1
Definition
5

MESH Network
(k): be the maximum number of
processors to which the data can
be transmitted in k or fewer data
routing steps.
(0): 1
(1): 5
(2): 13
(k): 2k2 +2k + 1
15 3
2
4
7
8
6
10
11
9
12
13
6

Lemma:
Suppose two nXn matrices A and B are to be multiplied, and
assume that every element of A and B is stored exactly once
and that no processing element contains more then one
element of either matrix.
If we ignore any data broadcasting facility, multiplying A and B
to produce the nXn matrix C requires at least s data routing
steps where σ(2s) ≥ n2.
7

• Consider an arbitrary element Ci,j of the product
matrix.
• This element is the inner product of row i of
matrix A and column j of matrix B.
• There must be a path from the processors where
each of these elements is stored, to the processor
where the result Ci,j is stored.
• Let s denote the length of the longest such path.
The creation of matrix
product C takes at least s
data routing steps.
8

• There is a path of length at most s from bu,v to the
processor where Ci,v is found, and there is also a path of
length at most s from ai,j to Ci,v.
• Hence there is a path of length at most 2s from any
element bu,v to every element ai,j.
• Similarly these paths define a set of paths of length at
most 2s from any element au,v to every element bi,j
where 1<=i, j<=n.
The n2 elements of A are stored in unique processors. There is a path of
length 2s from any element of B to every element of A and from any
element of A to every element of B. σ(2s) ≥ n2
9

An Optimal Algorithm:
Given a 2-D mesh SIMD model with wrap around
connections, it is easy to devise an algorithm that uses n2
processors to multiply two n x n arrays in (n) time.
Matrix multiplication requires n3 multiplication. The only
way that n2 processing elements can complete the
multiplication in (n) time is for (n2) processing elements
to be contributing toward the result every step.
11

A33,B33A32,B32A31,B31A30,B30
A23,B23A22,B22A21,B21A20,B20
A13,B13A12,B12A11,B11A10,B10
A03,B03A02,B02A01,B01A00,B00
Initial Condition
12

A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
13
B33B32B31B30
B23B22B21B20
B13B12B11B10
B03B02B01B00
C33C32C31C30
C23C22C21C20
C13C12C11C10
C03C02C01C00

A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 +
A12*B21 + A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
14

A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
Initial Position of
matrices A and B
15
B33B32B31B30
B23B22B21B20
B13B12B11B10
B03B02B01B00
A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33

B33
B32
B31
B30
B23
B22
B21
B20
B13
B12
B11
B10
B03
B02
B01
B00
B23
B13
B12
B03
B02
B01
Matrices A and B after Rotation
16
A33A32A31A30
A23A22A21A20
A13A12A11A10
A03A02A01A00
A32A31A30
A21A20
A10

A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00
17
Final Position of
matrices A and B
after Staggering

A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 + A12*B21
+ A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00
18
Final Position of matrices A and B after Staggering

19

20

21

Global : n, k
Local : a,b,c
begin {Stagger mAtrices}
for k = 1 to n-1 do
for All p(i,j) where 1≤i,j ≤ n do
if i > k then a = east(a) endif
if j > k then b = south(b) endif
endfor
endfor
{Compute dot product}
for All p(i,j) where 1≤i,j ≤ n do
c = a X b
endfor
for k = 1 to n-1 do
for all p(i,j) where 1≤i,j ≤ n do
a = east(a) b = south(b) c= c + a X b
endfor
endfor
end
 The first phase of the parallel
algorithm staggers the two
matrices.
 The Second phase computes all
the products ai,k X bk,j and
accumulates the sums.
 When phase 2 is complete, Ci,j =
ai,k X bk,j
𝑛
𝑘=1
22

1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
23
A X B = C

1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
24
1 0 2 3
-1 1 5 4
-4 2 -2 -3
0 -1 2 0
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 0 0 15
5 -1 20 -12
-12 0 -4 6
0 -1 4 0

25
0 2 3 1
1 5 4 -1
2 -2 -3 -4
-1 2 0 0
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 -4 0 5
-1 -2 12 12
8 -1 28 -10
-10 -2 -10 -2
1 -9 4 0
1 0 2 3
-1 1 5 4
-4 2 -2 -3
0 -1 2 0
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 0 0 15
5 -1 20 -12
-12 0 -4 6
0 -1 4 0

26
2 3 1 0
5 4 -1 1
-2 -3 -4 2
2 0 0 -1
3 0 2 -2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
5 -2 14 12
13 3 26 -8
-8 10 -10 8
-9 -9 4 3
0 2 3 1
1 5 4 -1
2 -2 -3 -4
-1 2 0 0
-5 -1 4 -3
3 0 2 -2
1 1 2 2
-1 -4 0 5
-1 -2 12 12
8 -1 28 -10
-10 -2 -10 -2
1 -9 4 0

27
3 1 0 2
4 -1 1 5
-3 -4 2 -2
0 0 -1 2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
3 0 2 -2
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
2 3 1 0
5 4 -1 1
-2 -3 -4 2
2 0 0 -1
3 0 2 -2
1 1 2 2
-1 -4 0 5
-5 -1 4 -3
5 -2 14 12
13 3 26 -8
-8 10 -10 8
-9 -9 4 3

1 0 2 3
4 -1 1 5
-2 -3 -4 2
-1 2 0 0
-1 1 2 -3
-5 -4 2 -2
3 -1 0 2
1 0 4 5
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1
28
A X B = C
8 -1 14 16
9 7 26 17
7 14 -2 14
-9 -9 2 -1

MATRIX MULTIPLICATION
on the Hypercube SIMD Model
29

Matrix Multiplication on Hypercube
No. of
Processors
Time
Complexity
Topology
1 (n3) SISD
n (n2) Linear Array
n2 (n) Mesh with Wrap around Connection
n3 (log n) Hypercube
30

A00*B00 + A01*B10 +
A02*B20 + A03*B30
A00*B01 + A01*B11 +
A02*B21 + A03*B31
A00*B02 + A01*B12 +
A02*B22 + A03*B32
A00*B03 + A01*B13 +
A02*B23 + A03*B33
A10*B00 + A11*B10 +
A12*B20 + A13*B30
A10*B01 + A11*B11 +
A12*B21 + A13*B31
A10*B02 + A11*B12 +
A12*B22 + A13*B32
A10*B03 + A11*B13 +
A12*B23 + A13*B33
A20*B00 + A21*B10 +
A22*B20 + A23*B30
A20*B01 + A21*B11 +
A22*B21 + A23*B31
A20*B02 + A21*B12 +
A22*B22 + A23*B32
A20*B03 + A21*B13 +
A22*B23 + A23*B33
A30*B00 + A31*B10 +
A32*B20 + A33*B30
A30*B01 + A31*B11 +
A32*B21 + A33*B31
A30*B02 + A31*B12 +
A32*B22 + A33*B32
A30*B03 + A31*B13 +
A32*B23 + A33*B33
31

100 101
110 111
000 001
010 011
To multiply two n X n matrices
No. of PEs = n3 = (2q)3
C2X2 = A2X2 X B2X2
n = 2
No. of PEs = 23 = 8
q = 1
Dimension = 3
Address PEs =X2 X1 X0
a00 a01 b00 b01 a00 b00 + a01 b10 a00. b01 + a01 b11
X =
a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11
32

 The processing elements should be thought of as filling an nxnxn lattice.
 Processor P(x), where 0<= x<= 23q -1 has local memory locations a, b, c, s
and t.
 When the parallel algorithm begins execution, matrix elements ai,j and bi,j
for 0<=i, j<= n -1, are stored in variables a and b of processor P(2q * i + j)
 After the parallel algorithm is complete, matrix elements ci,j for 0<=i, j<=
n -1, should be stored in variable c of processor P(2q * i + j).
33

The (i,j)th elements of matrix A and B are distributed over n2
distinct PEs (2q * i + j) where 0<=i,j<n
a00,b00
P
000 a00 b00
001 a01 b01
010 a10 b10
011 a11 b11
100
101
110
111
a01,b01
a10,b10 a11,b11
BIT (m,l) : returns lth bit in binary representation of m
BIT_Comp (m,l) : returns the value of the integer formed by complementing lth bit of m
100 101
110 111
000 001
010 011
34

K=0
[2*i + j] = ai,j
[2*i + j] = bi,j
K=1
[4 + 2*i + j] = ai,j
[4 + 2*i + j] = bi,j
i, j K=0 K=1
0,0 0 = 000 4 = 100
0,1 1 = 001 5 = 101
1,0 2 = 010 6 = 110
1,1 3 = 011 7 = 111
35

For all Pm where BIT(m,2) = 1
t = BIT_COMP(m,2)
a = [t]a, b= [t]b
P
000 a00 b00
001 a01 b01
010 a10 b10
011 a11 b11
100 a00 b00
101 a01 b01
110 a10 b10
111 a11 b11
100
101
110
111
000 001
010 011
a00,b00 a01,b01
a10,b10 a11,b11
a00,b00
a10,b10 a11,b11
a01,b01
36

j=0
[4*k + 2*i ] = ai,k
j=1
[4*k + 2*i + 1] = ai,k
i,k J=0 J=1
00 0 = 000 1 = 001
01 4 = 010 5 = 101
10 2 = 010 3 = 011
11 6 = 110 7 = 111
37

100 101
110
111
000 001
010 011
a00 a01
a10 a11
a00 a01
a10 a11
For all Pm where BIT(m,0) <> BIT (m,2)
t = BIT_COMP(m,0)
a = [t]a P
000 a00 b00
001 a00 b01
010 a10 b10
011 a10 b11
100 a01 b00
101 a01 b01
110 a11 b10
111 a11 b11
a00
a10
a11
a01
38

i=0
[4*k + j ] = bk,j
i=1
[4*k + 2 + j] = bk,j
k,j i=0 i=1
00 0 = 000 2 = 010
01 1 = 001 3 = 011
10 4 = 100 6 = 110
11 5 = 101 7 = 111
39

For all Pm where BIT(m,1) <> BIT (m,2)
t = BIT_COMP(m,1)
b = [t]b
P
000 a00 b00
001 a00 b01
010 a10 b00
011 a10 b01
100 a01 b10
101 a01 b11
110 a11 b10
111 a11 b11
100 101
110 111
000 001
010 011
b00 b01
b10 b11
b00
b10 b11
b01
b00 b01
b10 b11
40

P
000 a00 b00
001 a00 b01
010 a10 b00
011 a10 b01
100 a01 b10
101 a01 b11
110 a11 b10
111 a11 b11
100 101
110 111
000 001
010 011
a00
b00
a00
b01
a10
b00
a10
b01
a01
b10
a11
b10
a01
b11
a11
b11
a00 a01 b00 b01 a00 b00 + a01 b10 a00 b01 + a01 b11
X =
a10 a11 b10 b11 a10 b00 + a11 b10 a10 b01 + a11 b11
C = a X b
41

For all Pm
t = BIT_COMP(m,2)
s = [t]c c = c + s
P
000 a00 b00 a01 b10
001 a00 b01 a01 b11
010 a10 b00 a11 b10
011 a10 b01 a11 b11
100 a01 b10 a00 b00
101 a01 b11 a00 b01
110 a11 b10 a10 b00
111 a11 b11 a10 b01
100 101
110
111
000 001
010 011
a00,b00 a00,b01
a10,b00 a10,b01
a01,b10
a11,b10
a11,b11
a01,b11
42

44

45

on Shuffle Exchange SIMD Model
46

Matrix Multiplication on Shuffle Exchange
47

on Multiprocessors
48

Matrix Multiplication for Multiprocessors
• When there are a number of nested
loops, all suitable for parallelization,
which loop should be made parallel?
A00*B00 +
A01*B10 +
A02*B20 +
A03*B30
A00*B01 +
A01*B11 +
A02*B21 +
A03*B31
A00*B02 +
A01*B12 +
A02*B22 +
A03*B32
A00*B03 +
A01*B13 +
A02*B23 +
A03*B33
A10*B00 +
A11*B10 +
A12*B20 +
A13*B30
A10*B01 +
A11*B11 +
A12*B21 +
A13*B31
A10*B02 +
A11*B12 +
A12*B22 +
A13*B32
A10*B03 +
A11*B13 +
A12*B23 +
A13*B33
A20*B00 +
A21*B10 +
A22*B20 +
A23*B30
A20*B01 +
A21*B11 +
A22*B21 +
A23*B31
A20*B02 +
A21*B12 +
A22*B22 +
A23*B32
A20*B03 +
A21*B13 +
A22*B23 +
A23*B33
A30*B00 +
A31*B10 +
A32*B20 +
A33*B30
A30*B01 +
A31*B11 +
A32*B21 +
A33*B31
A30*B02 +
A31*B12 +
A32*B22 +
A33*B32
A30*B03 +
A31*B13 +
A32*B23 +
A33*B33
49

50

• In the case of matrix multiplication, we can parallelize the j or the i loop.
• If j loop is parallelize, the parallel algorithm executes n synchronizations
(one per iteration of the i loop) and the grain size of the parallel code is
(n2/p).
• If i loop is parallelize, the parallel algorithm executes only one
synchronization and the grain size of the parallel code is (n3/p).
• On most UMA multiprocessors the parallelized loop i version will
execute faster.
51

52

• Each processor computes n/p rows of C.
• Time needed to compute a single row is n2.
• The processes synchronize exactly once;
synchronization overhead, then, is (p).
• Hence the complexity of this parallel
algorithm is (n3/p + p).
• since there are only n rows, at most n
processors may execute this algorithm.
• If we ignore memory contention, we can
expect speedup to be nearly linear.
53

 In UMA every global memory cell is an equal distance from every processor.
 But on loosely coupled multiprocessors, where some matrix elements may be
much easier to access than others, it is important to keep local as many memory
references as possible.
 In the algorithm, a processor not only accesses n/p rows of A, but it also
accesses every element of B n/p times.
 Only a single addition and a single multiplication occur for every element of B
fetched.
 This is not a good ratio. {(2*n*n) *n/p}/{(n + n*n)*n/p} = 2n/(1+n)  2
 Implementation of this algorithm on a BBN TC2000 would yield poor speedup.
54

BLOCK MATRIX
MULTIPLICATION
55

Block Matrix Multiplication for Multiprocessors
• Assume that A and B are both n x n matrices, where n = 2k. Then A and B can be
thought of as conglomerates of four smaller matrices, each of size k x k.
K = 2, n= 4 K = 3, n= 6 K = 4, n= 8
56

K = 2, n= 6 K = 3, n= 9
57

 If we assign processes to do the block matrix multiplications, then the number of
multiplications and additions per matrix-element fetch increases.
 Assume that there are p = (n/k)2 processes (=4).
 The matrix multiplication is performed by dividing A and B into p blocks of size k X k.
 Each block multiplication requires 2k2 memory fetches, k3 addition and k3 multiplication.
 The number of arithmetic operations per memory access has risen from 2 to k = n/ 𝑝.
58

• Ostlund, Hibbard, and Whiteside
(1982) have implemented this matrix
multiplication algorithm on Cm* for
various matrix sizes. The results of
their experiment are shown in Fig.
• p = (n/k )2 k = n/ 𝑝.
60

61

ALGORITHMS FOR
MULTICOMPUTER
Matrix Multiplication for Multi-Computers
62

Row Column Oriented Algorithm
• Multiplying two n X n matrices A and B involves the computation of n2
dot products.
• Each dot product operation is in between a row of A and a column of B.
• At any moment in time every matrix element must be stored in the local
memory of exactly one processor.
• It is natural to partition A into rows and B into columns.
• Assume that n is a power of two and we are executing the algorithm on
an n-processor hypercube.
• To maximize grain size we want to parallelize the outmost for loop.
63

0
2
1
3
n = 4, Multiply two 4 X 4 matrices using 4-processor
hypercube multicomputer.
Each processor stores exactly one row of A and one
column of B
64

• A straightforward parallelization of the outer loop would demand that all
parallel processes first access column 0 of b, then column 1 of B, etc.
• This results in a sequence of broadcast steps, each having complexity (log n)
on a n-processor hypercube.
• Contention for shared resources can dramatically lower the performance of a
parallel algorithm.
• On a multicomputer, the processor that controls the variable must broadcast its
value to the other processors.
• If the order in which the processors access the data items is unimportant, we
can rewrite the parallel algorithm to eliminate the contention.
65

66

67

68

70

This algorithm achieves reasonable
performance on hypercube multi-computers.
Figure illustrates the speedup achieved by a
parallel implementation of the algorithm on
the nCUBE 3200.
Two noticeable point about the algorithm is-
 First, the communication time increases
linearly with the number of processors,
 Second, notice that the number of
computations performed per iteration is
inversely proportional to the number of
processors used.
71

Block Oriented Algorithm
• we are to multiply a matrix Al X m by matrix Bm X n.
• Assume p is an even power of 2. (4, 16, 64…..)
• Assume that l, m, and n are integer multiples of 𝑝.
• Processors are organized as a two-dimensional mesh with wraparound
Connections and give each processor a (l/ 𝑝 X m/ 𝑝) subsection of A
and (m/ 𝑝 X n/ 𝑝) subsection of B.
• P = 16 𝑝 = 4
• l= 4*2 = 8 m=4*3=12 n=4*4=16
• A8X12 X B12X16 = C8X16
• l/ 𝑝 =2 m/ 𝑝 = 3 n/ 𝑝 = 4
72

A8X12 B12X16 C8X16
73

A8X12–EachPhasSub-matrix A2X3 B12X16 - Each P has Sub-matrix B3X4 C8X16 - Each P has Sub-matrix C2X4
4X4 Mesh 4X4 Mesh 4X4 Mesh
74

• Two Algorithms: Block Matrix and Matrix Multiplication on Mesh have been combined.
• To determine the communication time required, we take into account that for each of
𝑝 - 1 iterations, every processor sends and receives a portion of matrix A and a portion
of matrix B.
• In addition, both the staggering and un-staggering of matrices A and B require 𝑝/2-1
iterations in which portions of A and B are sent and received.
• Unlike the SIMD algorithm, which requires 𝑝 -1 iterations for the staggering and un-
staggering steps, this MIMD algorithm requires 𝑝 /2 iterations, because some
processing elements can move blocks of A to the right while other processing elements
move blocks of A to the left, and some processing elements can move blocks of B down
while other processing elements move blocks of B up.
• No block begins more than 𝑝/2 moves away from its staggered position.
75

76
A32,B23A31,B12A30,B01A33,B30
A21,B13A20,B02A23,B31A22,B20
A10,B03A13,B32A12,B21A11,B10
A03,B33A02,B22A01,B11A00,B00

• Hence the total communication time is-
• Both the black-oriented algorithm and the row-column-oriented
algorithms require the same number of computation steps. When does
the second algorithm require less communication time?
77

• Assume that we are multiplying two n x n matrices, where n is an integer
multiple of p.
78

Chapter 7: Matrix Multiplication

More Related Content

What's hot

Similar to Chapter 7: Matrix Multiplication

More from Heman Pathak

Recently uploaded

In this document

Chapter 7: Matrix Multiplication