L07-1
6.5930/1
Hardware Architectures for Deep Learning
Vectorized Kernel Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 27, 2023
L07-2
Sze and Emer
Goals of Today’s Lecture
• Understand parallelism and improved efficiency through:
– loop unrolling, and
– vectorization
February 27, 2023
L07-3
Sze and Emer
Background Reading
• Vector architectures
– Computer Architecture: A Quantitative Approach,
6th edition, by Hennessy and Patterson
• Ch 4: p282-310, App G
• Ch 4: p262-288, App G
These books and their online/e-book versions are available through
MIT libraries.
February 27, 2023
L07-4
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
…
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
L07-5
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
…
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
L07-6
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
February 27, 2023
L07-7
Sze and Emer
Filter Memory Layout
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
H
C
M
W
H
C
1
W
Filter weight
Next input
channel
Next row
Next output
channel
L07-8
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw]
* f[CHWm + chw]
Offset to start of
current output
filter
Flattened
tensors
Loop invariant
hoisted and
strength reduced
L07-9
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-10
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-11
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-12
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-13
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-14
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-15
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-16
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-17
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-18
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-19
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-20
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-21
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-22
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-23
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-24
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-25
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-26
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-27
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-28
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-29
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W;
for chw in [0, C*H*W)
o[m] += i[chw]
* f[CHWm + chw]
Most of the time is
spent here!
L07-30
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-31
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-32
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-33
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-34
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-35
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-36
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-37
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-38
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
Index
calculation (f)
Loop count and
index calculation (i)
How many MACs/cycle (ignoring stalls)? ~ 1/6
Where is a major source of overhead?
L07-39
Sze and Emer
FC scalar computation
February 27, 2023
L07-40
Sze and Emer
Loop Unrolling (2chw)
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
CHWm += C*H*W
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1]
Index calculation
amortized since
i[chw+1] => &(i+1)[chw]
Loop overhead
amortized over
more computation
Operands accessed
in pairs
Operands accessed
in pairs
Step by 2
L07-41
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-42
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-43
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-44
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-45
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-46
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-47
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
Offset constant
reflects constant index
increment
Offset constant
reflects constant index
increment
L07-48
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-49
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
Offset constant
reflects constant index
increment
Offset constant
reflects constant index
increment
L07-50
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-51
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-52
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-53
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
How many MACs/cycle (ignoring stalls)? ~ 2/9
L07-54
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-55
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-56
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-57
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-58
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-59
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-60
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-61
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-62
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-63
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-64
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-65
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-66
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-67
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-68
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-69
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-70
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-71
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-72
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-73
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
Can we incorporate this “pairing” into the architecture? Of course
chw = C*H*W-2, C*H*W-1
L07-74
Sze and Emer
Vector Programming Model
February 27, 2023
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
Instructions
ADDV v3, v1, v2 v3
v2
v1
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
VLRMAX – number of elements in a vector register
VLR – number of elements to use in an instruction
L07-75
Sze and Emer
Vector Programming Model
February 27, 2023
v1
Vector Load and
Store Instructions
LDV v1, r1, r2
Base, r1 Stride, r2
Memory
Vector Register
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
L07-76
Sze and Emer
…
…
…
…
Compiler-based Vectorization
February 27, 2023
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Scalar code Vector code
for i in [0:N):
Z[i] = A[i] + B[i];
Compiler recognizes independent operations
with loop dependence analysis
T
i
m
e
L07-77
Sze and Emer
Loop Unrolled
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
CHWm = C*H*W*m
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1])
}
Will this vectorize? Not in the architecture presented so far
Sequential
loads
Sequential
loads
Reduction crosses
elements
L07-78
Sze and Emer
Parallel with animation
February 27, 2023
L07-79
Sze and Emer
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
No output is
dependent
on another output
(other than
commutative order)
L07-80
Sze and Emer
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
L07-81
Sze and Emer
for chw in [0, C*H*W, 2):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
L07-82
Sze and Emer
FC – Permuted/Unrolled
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
// Loops permuted
for chw in [0, C*H*W):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
Unrolled
calculation
L07-83
Sze and Emer
Parallel m animation
February 27, 2023
L07-84
Sze and Emer
// Loop invariant hoisting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
FC – Permuted/Unrolled/Hoisted
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
Same for all
calculations
Load hoisted
out of loop
L07-85
Sze and Emer
Fully Connection Computation
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
I[C0 H0 W0] I[C0 H0 W1] …
I[C0 H1 W0] I[C0 H1 W1] …
I[C0 H2 W0] I[C0 H2 W1] …
.
.
I[C1 H0 W0] I[C1 H0 W1] …
I[C1 H1 W0] I[C1 H1 W1] …
I[C1 H2 W0] I[C1 H2 W1] …
.
.
.
// Loop invariant hosting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw];
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
Weights needed together are far apart.
What can we do
Weights needed together are far apart.
What can we do?
Access with stride or rearrange memory
L07-86
Sze and Emer
FC – Layered Loops
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
// Level 2 loops
for chw in [0, C*H*W):
i_chw = i[chw]
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i_chw * f[CHW*(m1*VL+m0) + chw]
Level 0 is a set of
vector operations
Limit of m2 (M/VL)
times limit of m1 (LV)
is M
Limit of m1 (M/VL)
times limit of m0 (LV)
is M
m = m1*VL+m0
L07-87
Sze and Emer
Einsum Rank Splitting
February 27, 2023
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤
→ 𝐹𝐹𝑚𝑚1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 → 𝑂𝑂𝑚𝑚1×𝑉𝑉𝑉𝑉+𝑚𝑚0 → 𝑂𝑂𝑚𝑚1,𝑚𝑚0
→ 𝐹𝐹𝑚𝑚1×𝑉𝑉𝑉𝑉+𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
L07-88
Sze and Emer
FC – Layered Loops
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1][m0] += i[chw] * f[m1][m0][chw]
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
February 27, 2023
Flatten data structures
L07-89
Sze and Emer
FC – Layered Loops
February 27, 2023
// Level 2 loops
for chw in [0, C*H*W):
i_chw = i[chw]
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i_chw * f[VL*CWH*m1+CWH*m0+chw]
// Level 1 loops
m1VL = m1*VL
CHWVLm1_chw = CHW*VL*m1 + chw
parallel_for m0 in [0, VL):
o[m1VL+m0] += i_chw * f[CHWVLm1_chw + CHW*m0]
Invariant in inner loop!
Stride!
Hoist Loop
Invariant!
L07-90
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-91
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-92
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-93
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-94
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-95
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-96
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-97
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-98
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)? ~ VL/7
Can we unroll this to get even more? Yes
L07-99
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)? ~ VL/7
Can we unroll this to get even more? Yes
L07-100
Sze and Emer
FC – Layered Loops
February 27, 2023
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
// Level 2 loops
for m1 in [0, M/VL):
for chw in [0, C*H*W):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
No constraints
on loop
permutations!
Loop order
affects where
loop invariants
can be moved
L07-101
Sze and Emer
Vector ISA Attributes
• Compact
– one short instruction encodes N operations
– many implicit bookkeeping/control operations
• Expressive, hardware knows the N operations:
– are independent
– use the same functional unit
– access disjoint registers
– access registers in same pattern as previous instructions
– access a contiguous block of memory
(unit-stride load/store)
– access memory in a known pattern
(strided load/store)
February 27, 2023
Vector instructions make “explicit” many things that are “implicit” with standard instructions
L07-102
Sze and Emer
Vector ISA Hardware Implications
• Large amount of work per instruction
-> Less instruction fetch bandwidth requirements
-> Allows simplified instruction fetch design
• Implicit bookkeeping operations
-> Bookkeeping can run in parallel with main compute
• Disjoint vector element accesses
-> Banked rather than multi-ported register files
• No data dependence within a vector
-> Amenable to deeply pipelined/parallel designs
• Known regular memory access pattern
-> Allows for banked memory for higher bandwidth
February 27, 2023
L07-103
Sze and Emer
• Use deep pipeline (=> fast clock) to
execute element operations
• Simplifies control of deep pipeline because
elements in vector are independent (=> no
hazards!)
Six stage multiply pipeline
Vector Arithmetic Execution
February 27, 2023
V
1
V
2
V
3
V3 <- v1 * v2
L07-104
Sze and Emer
Vector Instruction Execution
February 27, 2023
ADDV C,A,B
Execution using
one pipelined
functional unit
Execution using
four pipelined
functional units
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
L07-105
Sze and Emer
Vector Unit Structure
February 27, 2023
Lane
Function
Unit
(Adder)
Vector
Registers
Memory Subsystem
Elements
0, 4, 8, …
Elements
1, 5, 9, …
Elements
2, 6, 10, …
Elements
3, 7, 11, …
Function
Unit
(Mult)
L07-106
Sze and Emer
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
February 27, 2023
Complete 24 operations/cycle while issuing 1 short instruction/cycle
load
load
mul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction
issue
L07-107
Sze and Emer
ISA Datatypes
FP32
FP16
Int32
Int16
Int8
S E M
1 8 23
S E M
1 5 10
M
31
S
S M
1
1 15
S M
1 7
Range Accuracy
10-38 – 1038 .000006%
6x10-5 - 6x104 .05%
0 – 2x109 ½
0 – 6x104 ½
0 – 127 ½
Image Source: B. Dally
February 27, 2023
L07-108
Sze and Emer
Intel – MMX/SSE/AVX
Width Int8 Int16 Int 32 Int64 FP16 FP32 FP64 Features
MMX 64 8 4 2 1
SSE 128 4
SSE2 128 16 8 4 2 4 2
SSE3 128 16 8 4 2 4 2 R
AVX 256 32 16 8 4 16 8 4
AVX2 256 32 16 8 4 16 8 4 GUMR
AVX3 512 64 32 16 8 ? 16 8 GUMRP
February 27, 2023
G: gather
U: unaligned
M: MAC
R: reductions/permutations
P: Predicate masks
Source: Myriad non-authoritative sources on web
L07-109
Sze and Emer
Python to C++ Chart
February 22, 2023
[Leiserson, There’s plenty of room at the top, Science, 2020]
L07-110
Sze and Emer
Next Lecture: Roofline Analysis and Transforms
Thank you!
February 27, 2023

L07.pdf

  • 1.
    L07-1 6.5930/1 Hardware Architectures forDeep Learning Vectorized Kernel Computation Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 27, 2023
  • 2.
    L07-2 Sze and Emer Goalsof Today’s Lecture • Understand parallelism and improved efficiency through: – loop unrolling, and – vectorization February 27, 2023
  • 3.
    L07-3 Sze and Emer BackgroundReading • Vector architectures – Computer Architecture: A Quantitative Approach, 6th edition, by Hennessy and Patterson • Ch 4: p282-310, App G • Ch 4: p262-288, App G These books and their online/e-book versions are available through MIT libraries. February 27, 2023
  • 4.
    L07-4 Sze and Emer inputfmaps H W C 1 Fully Connected Computation output fmaps … filters 1 1 1 February 27, 2023 H C 1 W H W C M
  • 5.
    L07-5 Sze and Emer inputfmaps H W C 1 Fully Connected Computation output fmaps … filters 1 1 1 February 27, 2023 H C 1 W H W C M
  • 6.
    L07-6 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum February 27, 2023
  • 7.
    L07-7 Sze and Emer FilterMemory Layout February 27, 2023 F[M0 C0 H0 W0] F[M0 C0 H0 W1] … F[M0 C0 H1 W0] F[M0 C0 H1 W1] … F[M0 C0 H2 W0] F[M0 C0 H2 W1] … . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] … F[M0 C1 H1 W0] F[M0 C1 H1 W1] … F[M0 C1 H2 W0] F[M0 C1 H2 W1] … . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] … F[M1 C0 H1 W0] F[M1 C0 H1 W1] … F[M1 C0 H2 W0] F[M1 C0 H2 W1] … . . . H C M W H C 1 W Filter weight Next input channel Next row Next output channel
  • 8.
    L07-8 Sze and Emer FlattenedFC Loops February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): o[m] = 0 CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] Offset to start of current output filter Flattened tensors Loop invariant hoisted and strength reduced
  • 9.
    L07-9 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 10.
    L07-10 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 11.
    L07-11 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 12.
    L07-12 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 13.
    L07-13 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 14.
    L07-14 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 15.
    L07-15 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 16.
    L07-16 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 17.
    L07-17 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 18.
    L07-18 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 19.
    L07-19 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 20.
    L07-20 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 21.
    L07-21 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0
  • 22.
    L07-22 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 23.
    L07-23 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 24.
    L07-24 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 25.
    L07-25 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 1
  • 26.
    L07-26 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 27.
    L07-27 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 28.
    L07-28 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 29.
    L07-29 Sze and Emer FlattenedFC Loops February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W; for chw in [0, C*H*W) o[m] += i[chw] * f[CHWm + chw] Most of the time is spent here!
  • 30.
    L07-30 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 31.
    L07-31 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 32.
    L07-32 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 33.
    L07-33 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 34.
    L07-34 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 35.
    L07-35 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 36.
    L07-36 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 37.
    L07-37 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 38.
    L07-38 Sze and Emer LoopIteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop Index calculation (f) Loop count and index calculation (i) How many MACs/cycle (ignoring stalls)? ~ 1/6 Where is a major source of overhead?
  • 39.
    L07-39 Sze and Emer FCscalar computation February 27, 2023
  • 40.
    L07-40 Sze and Emer LoopUnrolling (2chw) February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): CHWm += C*H*W for chw in [0, C*H*W, 2): o[m] += (i[chw] * f[CHWm + chw]) + (i[chw + 1] * f[CHWm + chw + 1] Index calculation amortized since i[chw+1] => &(i+1)[chw] Loop overhead amortized over more computation Operands accessed in pairs Operands accessed in pairs Step by 2
  • 41.
    L07-41 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 42.
    L07-42 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 43.
    L07-43 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 44.
    L07-44 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 45.
    L07-45 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 46.
    L07-46 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 47.
    L07-47 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop Offset constant reflects constant index increment Offset constant reflects constant index increment
  • 48.
    L07-48 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 49.
    L07-49 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop Offset constant reflects constant index increment Offset constant reflects constant index increment
  • 50.
    L07-50 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 51.
    L07-51 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 52.
    L07-52 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 53.
    L07-53 Sze and Emer FullyConnected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop How many MACs/cycle (ignoring stalls)? ~ 2/9
  • 54.
    L07-54 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 55.
    L07-55 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 56.
    L07-56 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 57.
    L07-57 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 58.
    L07-58 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 59.
    L07-59 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 60.
    L07-60 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 61.
    L07-61 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 62.
    L07-62 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 63.
    L07-63 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 64.
    L07-64 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 65.
    L07-65 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 66.
    L07-66 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 67.
    L07-67 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 68.
    L07-68 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 69.
    L07-69 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 70.
    L07-70 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 71.
    L07-71 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 72.
    L07-72 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 73.
    L07-73 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = February 27, 2023 Can we incorporate this “pairing” into the architecture? Of course chw = C*H*W-2, C*H*W-1
  • 74.
    L07-74 Sze and Emer VectorProgramming Model February 27, 2023 + + + + + + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register VLRMAX – number of elements in a vector register VLR – number of elements to use in an instruction
  • 75.
    L07-75 Sze and Emer VectorProgramming Model February 27, 2023 v1 Vector Load and Store Instructions LDV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register
  • 76.
    L07-76 Sze and Emer … … … … Compiler-basedVectorization February 27, 2023 Ld Ai Ld Bi Add St Zi Ld Ai+1 Ld Bi+1 Add St Zi+1 Ld Ai Ld Bi Add St Zi Ld Ai+1 Ld Bi+1 Add St Zi+1 Scalar code Vector code for i in [0:N): Z[i] = A[i] + B[i]; Compiler recognizes independent operations with loop dependence analysis T i m e
  • 77.
    L07-77 Sze and Emer LoopUnrolled February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): CHWm = C*H*W*m for chw in [0, C*H*W, 2): o[m] += (i[chw] * f[CHWm + chw]) + (i[chw + 1] * f[CHWm + chw + 1]) } Will this vectorize? Not in the architecture presented so far Sequential loads Sequential loads Reduction crosses elements
  • 78.
    L07-78 Sze and Emer Parallelwith animation February 27, 2023
  • 79.
    L07-79 Sze and Emer FullyConnected – Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1] No output is dependent on another output (other than commutative order)
  • 80.
    L07-80 Sze and Emer FullyConnected – Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1]
  • 81.
    L07-81 Sze and Emer forchw in [0, C*H*W, 2): for m in [0, M): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1] Fully Connected – Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1]
  • 82.
    L07-82 Sze and Emer FC– Permuted/Unrolled February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): for m in [0, M, 2): o[m] += i[chw] * f[CHW*m + chw] o[m+1] += i[chw] * f[CHW*(m+1) + chw] // Loops permuted for chw in [0, C*H*W): for m in [0, M): o[m] += i[chw] * f[CHW*m + chw] Unrolled calculation
  • 83.
    L07-83 Sze and Emer Parallelm animation February 27, 2023
  • 84.
    L07-84 Sze and Emer //Loop invariant hoisting of i[chw] for chw in [0, C*H*W): i_chw = i[chw] for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] FC – Permuted/Unrolled/Hoisted February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): for m in [0, M, 2): o[m] += i[chw] * f[CHW*m + chw] o[m+1] += i[chw] * f[CHW*(m+1) + chw] Same for all calculations Load hoisted out of loop
  • 85.
    L07-85 Sze and Emer FullyConnection Computation February 27, 2023 F[M0 C0 H0 W0] F[M0 C0 H0 W1] … F[M0 C0 H1 W0] F[M0 C0 H1 W1] … F[M0 C0 H2 W0] F[M0 C0 H2 W1] … . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] … F[M0 C1 H1 W0] F[M0 C1 H1 W1] … F[M0 C1 H2 W0] F[M0 C1 H2 W1] … . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] … F[M1 C0 H1 W0] F[M1 C0 H1 W1] … F[M1 C0 H2 W0] F[M1 C0 H2 W1] … . . . I[C0 H0 W0] I[C0 H0 W1] … I[C0 H1 W0] I[C0 H1 W1] … I[C0 H2 W0] I[C0 H2 W1] … . . I[C1 H0 W0] I[C1 H0 W1] … I[C1 H1 W0] I[C1 H1 W1] … I[C1 H2 W0] I[C1 H2 W1] … . . . // Loop invariant hosting of i[chw] for chw in [0, C*H*W): i_chw = i[chw]; for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] Weights needed together are far apart. What can we do Weights needed together are far apart. What can we do? Access with stride or rearrange memory
  • 86.
    L07-86 Sze and Emer FC– Layered Loops February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): i_chw = i[chw] for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] // Level 2 loops for chw in [0, C*H*W): i_chw = i[chw] for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i_chw * f[CHW*(m1*VL+m0) + chw] Level 0 is a set of vector operations Limit of m2 (M/VL) times limit of m1 (LV) is M Limit of m1 (M/VL) times limit of m0 (LV) is M m = m1*VL+m0
  • 87.
    L07-87 Sze and Emer EinsumRank Splitting February 27, 2023 𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 → 𝑂𝑂𝑚𝑚1×𝑉𝑉𝑉𝑉+𝑚𝑚0 → 𝑂𝑂𝑚𝑚1,𝑚𝑚0 → 𝐹𝐹𝑚𝑚1×𝑉𝑉𝑉𝑉+𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
  • 88.
    L07-88 Sze and Emer FC– Layered Loops // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1][m0] += i[chw] * f[m1][m0][chw] // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] February 27, 2023 Flatten data structures
  • 89.
    L07-89 Sze and Emer FC– Layered Loops February 27, 2023 // Level 2 loops for chw in [0, C*H*W): i_chw = i[chw] for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i_chw * f[VL*CWH*m1+CWH*m0+chw] // Level 1 loops m1VL = m1*VL CHWVLm1_chw = CHW*VL*m1 + chw parallel_for m0 in [0, VL): o[m1VL+m0] += i_chw * f[CHWVLm1_chw + CHW*m0] Invariant in inner loop! Stride! Hoist Loop Invariant!
  • 90.
    L07-90 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 91.
    L07-91 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 92.
    L07-92 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 93.
    L07-93 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 94.
    L07-94 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 95.
    L07-95 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 96.
    L07-96 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 97.
    L07-97 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 98.
    L07-98 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced How many MACs/cycle (ignoring stalls)? ~ VL/7 Can we unroll this to get even more? Yes
  • 99.
    L07-99 Sze and Emer FullConnected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced How many MACs/cycle (ignoring stalls)? ~ VL/7 Can we unroll this to get even more? Yes
  • 100.
    L07-100 Sze and Emer FC– Layered Loops February 27, 2023 // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] // Level 2 loops for m1 in [0, M/VL): for chw in [0, C*H*W): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] No constraints on loop permutations! Loop order affects where loop invariants can be moved
  • 101.
    L07-101 Sze and Emer VectorISA Attributes • Compact – one short instruction encodes N operations – many implicit bookkeeping/control operations • Expressive, hardware knows the N operations: – are independent – use the same functional unit – access disjoint registers – access registers in same pattern as previous instructions – access a contiguous block of memory (unit-stride load/store) – access memory in a known pattern (strided load/store) February 27, 2023 Vector instructions make “explicit” many things that are “implicit” with standard instructions
  • 102.
    L07-102 Sze and Emer VectorISA Hardware Implications • Large amount of work per instruction -> Less instruction fetch bandwidth requirements -> Allows simplified instruction fetch design • Implicit bookkeeping operations -> Bookkeeping can run in parallel with main compute • Disjoint vector element accesses -> Banked rather than multi-ported register files • No data dependence within a vector -> Amenable to deeply pipelined/parallel designs • Known regular memory access pattern -> Allows for banked memory for higher bandwidth February 27, 2023
  • 103.
    L07-103 Sze and Emer •Use deep pipeline (=> fast clock) to execute element operations • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline Vector Arithmetic Execution February 27, 2023 V 1 V 2 V 3 V3 <- v1 * v2
  • 104.
    L07-104 Sze and Emer VectorInstruction Execution February 27, 2023 ADDV C,A,B Execution using one pipelined functional unit Execution using four pipelined functional units C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27]
  • 105.
    L07-105 Sze and Emer VectorUnit Structure February 27, 2023 Lane Function Unit (Adder) Vector Registers Memory Subsystem Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Function Unit (Mult)
  • 106.
    L07-106 Sze and Emer VectorInstruction Parallelism Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes February 27, 2023 Complete 24 operations/cycle while issuing 1 short instruction/cycle load load mul mul add add Load Unit Multiply Unit Add Unit time Instruction issue
  • 107.
    L07-107 Sze and Emer ISADatatypes FP32 FP16 Int32 Int16 Int8 S E M 1 8 23 S E M 1 5 10 M 31 S S M 1 1 15 S M 1 7 Range Accuracy 10-38 – 1038 .000006% 6x10-5 - 6x104 .05% 0 – 2x109 ½ 0 – 6x104 ½ 0 – 127 ½ Image Source: B. Dally February 27, 2023
  • 108.
    L07-108 Sze and Emer Intel– MMX/SSE/AVX Width Int8 Int16 Int 32 Int64 FP16 FP32 FP64 Features MMX 64 8 4 2 1 SSE 128 4 SSE2 128 16 8 4 2 4 2 SSE3 128 16 8 4 2 4 2 R AVX 256 32 16 8 4 16 8 4 AVX2 256 32 16 8 4 16 8 4 GUMR AVX3 512 64 32 16 8 ? 16 8 GUMRP February 27, 2023 G: gather U: unaligned M: MAC R: reductions/permutations P: Predicate masks Source: Myriad non-authoritative sources on web
  • 109.
    L07-109 Sze and Emer Pythonto C++ Chart February 22, 2023 [Leiserson, There’s plenty of room at the top, Science, 2020]
  • 110.
    L07-110 Sze and Emer NextLecture: Roofline Analysis and Transforms Thank you! February 27, 2023