SlideShare a Scribd company logo
1 of 110
Download to read offline
L07-1
6.5930/1
Hardware Architectures for Deep Learning
Vectorized Kernel Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 27, 2023
L07-2
Sze and Emer
Goals of Todayโ€™s Lecture
โ€ข Understand parallelism and improved efficiency through:
โ€“ loop unrolling, and
โ€“ vectorization
February 27, 2023
L07-3
Sze and Emer
Background Reading
โ€ข Vector architectures
โ€“ Computer Architecture: A Quantitative Approach,
6th edition, by Hennessy and Patterson
โ€ข Ch 4: p282-310, App G
โ€ข Ch 4: p262-288, App G
These books and their online/e-book versions are available through
MIT libraries.
February 27, 2023
L07-4
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
โ€ฆ
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
L07-5
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
โ€ฆ
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
L07-6
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
โ€ข Matrixโ€“Vector Multiply:
โ€ข Multiply all inputs in all channels by a weight and sum
February 27, 2023
L07-7
Sze and Emer
Filter Memory Layout
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] โ€ฆ
F[M0 C0 H1 W0] F[M0 C0 H1 W1] โ€ฆ
F[M0 C0 H2 W0] F[M0 C0 H2 W1] โ€ฆ
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] โ€ฆ
F[M0 C1 H1 W0] F[M0 C1 H1 W1] โ€ฆ
F[M0 C1 H2 W0] F[M0 C1 H2 W1] โ€ฆ
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] โ€ฆ
F[M1 C0 H1 W0] F[M1 C0 H1 W1] โ€ฆ
F[M1 C0 H2 W0] F[M1 C0 H2 W1] โ€ฆ
.
.
.
H
C
M
W
H
C
1
W
Filter weight
Next input
channel
Next row
Next output
channel
L07-8
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw]
* f[CHWm + chw]
Offset to start of
current output
filter
Flattened
tensors
Loop invariant
hoisted and
strength reduced
L07-9
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-10
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-11
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-12
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-13
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-14
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-15
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-16
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-17
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-18
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H+W-1
L07-19
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-20
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-21
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0
L07-22
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-23
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-24
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-25
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 1
L07-26
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-27
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-28
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-1
L07-29
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W;
for chw in [0, C*H*W)
o[m] += i[chw]
* f[CHWm + chw]
Most of the time is
spent here!
L07-30
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-31
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-32
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-33
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-34
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-35
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-36
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-37
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
L07-38
Sze and Emer
Loop Iteration Overhead
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm)
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHWm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
add r2, r2, 1 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, 1
blt r1, M, mloop
Index
calculation (f)
Loop count and
index calculation (i)
How many MACs/cycle (ignoring stalls)? ~ 1/6
Where is a major source of overhead?
L07-39
Sze and Emer
FC scalar computation
February 27, 2023
L07-40
Sze and Emer
Loop Unrolling (2chw)
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
CHWm += C*H*W
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1]
Index calculation
amortized since
i[chw+1] => &(i+1)[chw]
Loop overhead
amortized over
more computation
Operands accessed
in pairs
Operands accessed
in pairs
Step by 2
L07-41
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-42
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-43
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-44
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-45
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-46
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-47
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
Offset constant
reflects constant index
increment
Offset constant
reflects constant index
increment
L07-48
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-49
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
Offset constant
reflects constant index
increment
Offset constant
reflects constant index
increment
L07-50
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-51
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-52
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
L07-53
Sze and Emer
Fully Connected - Unrolled
February 27, 2023
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0 # r2 holds chw
mv r8, 0 # r8 holds psum (o[m])
xloop: ld r4, i(r2) # r4 = i[chw]
add r5, r2, r3 # r5 = CHMm + chw
ld r6, f(r5) # r6 = f[CHWm + chw]
mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw]
ld r7, i+1(r2) # r7 = i[chw + 1]
ld r9, f+1(r5) # r9 = f[CHWm + chw + 1]
mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1]
add r2, r2, 2 # r2 = chw + 1
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
How many MACs/cycle (ignoring stalls)? ~ 2/9
L07-54
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-55
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-56
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-57
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-58
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-59
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-60
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-61
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-62
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-63
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-64
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-65
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-66
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 0,1
L07-67
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-68
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-69
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = 2,3
L07-70
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-71
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-72
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
L07-73
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
ร—
1
Output fmaps
M
=
February 27, 2023
Can we incorporate this โ€œpairingโ€ into the architecture? Of course
chw = C*H*W-2, C*H*W-1
L07-74
Sze and Emer
Vector Programming Model
February 27, 2023
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
Instructions
ADDV v3, v1, v2 v3
v2
v1
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
VLRMAX โ€“ number of elements in a vector register
VLR โ€“ number of elements to use in an instruction
L07-75
Sze and Emer
Vector Programming Model
February 27, 2023
v1
Vector Load and
Store Instructions
LDV v1, r1, r2
Base, r1 Stride, r2
Memory
Vector Register
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
L07-76
Sze and Emer
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
Compiler-based Vectorization
February 27, 2023
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Scalar code Vector code
for i in [0:N):
Z[i] = A[i] + B[i];
Compiler recognizes independent operations
with loop dependence analysis
T
i
m
e
L07-77
Sze and Emer
Loop Unrolled
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
CHWm = C*H*W*m
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1])
}
Will this vectorize? Not in the architecture presented so far
Sequential
loads
Sequential
loads
Reduction crosses
elements
L07-78
Sze and Emer
Parallel with animation
February 27, 2023
L07-79
Sze and Emer
Fully Connected โ€“ Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
No output is
dependent
on another output
(other than
commutative order)
L07-80
Sze and Emer
Fully Connected โ€“ Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
L07-81
Sze and Emer
for chw in [0, C*H*W, 2):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
Fully Connected โ€“ Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
L07-82
Sze and Emer
FC โ€“ Permuted/Unrolled
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
// Loops permuted
for chw in [0, C*H*W):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
Unrolled
calculation
L07-83
Sze and Emer
Parallel m animation
February 27, 2023
L07-84
Sze and Emer
// Loop invariant hoisting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
FC โ€“ Permuted/Unrolled/Hoisted
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
Same for all
calculations
Load hoisted
out of loop
L07-85
Sze and Emer
Fully Connection Computation
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] โ€ฆ
F[M0 C0 H1 W0] F[M0 C0 H1 W1] โ€ฆ
F[M0 C0 H2 W0] F[M0 C0 H2 W1] โ€ฆ
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] โ€ฆ
F[M0 C1 H1 W0] F[M0 C1 H1 W1] โ€ฆ
F[M0 C1 H2 W0] F[M0 C1 H2 W1] โ€ฆ
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] โ€ฆ
F[M1 C0 H1 W0] F[M1 C0 H1 W1] โ€ฆ
F[M1 C0 H2 W0] F[M1 C0 H2 W1] โ€ฆ
.
.
.
I[C0 H0 W0] I[C0 H0 W1] โ€ฆ
I[C0 H1 W0] I[C0 H1 W1] โ€ฆ
I[C0 H2 W0] I[C0 H2 W1] โ€ฆ
.
.
I[C1 H0 W0] I[C1 H0 W1] โ€ฆ
I[C1 H1 W0] I[C1 H1 W1] โ€ฆ
I[C1 H2 W0] I[C1 H2 W1] โ€ฆ
.
.
.
// Loop invariant hosting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw];
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
Weights needed together are far apart.
What can we do
Weights needed together are far apart.
What can we do?
Access with stride or rearrange memory
L07-86
Sze and Emer
FC โ€“ Layered Loops
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
// Level 2 loops
for chw in [0, C*H*W):
i_chw = i[chw]
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i_chw * f[CHW*(m1*VL+m0) + chw]
Level 0 is a set of
vector operations
Limit of m2 (M/VL)
times limit of m1 (LV)
is M
Limit of m1 (M/VL)
times limit of m0 (LV)
is M
m = m1*VL+m0
L07-87
Sze and Emer
Einsum Rank Splitting
February 27, 2023
๐‘‚๐‘‚๐‘š๐‘š1,๐‘š๐‘š0 = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š1,๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค
โ†’ ๐น๐น๐‘š๐‘š1,๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค
๐‘‚๐‘‚๐‘š๐‘š = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค
๐‘‚๐‘‚๐‘š๐‘š โ†’ ๐‘‚๐‘‚๐‘š๐‘š1ร—๐‘‰๐‘‰๐‘‰๐‘‰+๐‘š๐‘š0 โ†’ ๐‘‚๐‘‚๐‘š๐‘š1,๐‘š๐‘š0
โ†’ ๐น๐น๐‘š๐‘š1ร—๐‘‰๐‘‰๐‘‰๐‘‰+๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค
๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค
๐‘‚๐‘‚๐‘š๐‘š = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค
L07-88
Sze and Emer
FC โ€“ Layered Loops
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1][m0] += i[chw] * f[m1][m0][chw]
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
February 27, 2023
Flatten data structures
L07-89
Sze and Emer
FC โ€“ Layered Loops
February 27, 2023
// Level 2 loops
for chw in [0, C*H*W):
i_chw = i[chw]
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i_chw * f[VL*CWH*m1+CWH*m0+chw]
// Level 1 loops
m1VL = m1*VL
CHWVLm1_chw = CHW*VL*m1 + chw
parallel_for m0 in [0, VL):
o[m1VL+m0] += i_chw * f[CHWVLm1_chw + CHW*m0]
Invariant in inner loop!
Stride!
Hoist Loop
Invariant!
L07-90
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-91
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-92
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-93
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-94
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
L07-95
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-96
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-97
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
L07-98
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)? ~ VL/7
Can we unroll this to get even more? Yes
L07-99
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)? ~ VL/7
Can we unroll this to get even more? Yes
L07-100
Sze and Emer
FC โ€“ Layered Loops
February 27, 2023
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
// Level 2 loops
for m1 in [0, M/VL):
for chw in [0, C*H*W):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
No constraints
on loop
permutations!
Loop order
affects where
loop invariants
can be moved
L07-101
Sze and Emer
Vector ISA Attributes
โ€ข Compact
โ€“ one short instruction encodes N operations
โ€“ many implicit bookkeeping/control operations
โ€ข Expressive, hardware knows the N operations:
โ€“ are independent
โ€“ use the same functional unit
โ€“ access disjoint registers
โ€“ access registers in same pattern as previous instructions
โ€“ access a contiguous block of memory
(unit-stride load/store)
โ€“ access memory in a known pattern
(strided load/store)
February 27, 2023
Vector instructions make โ€œexplicitโ€ many things that are โ€œimplicitโ€ with standard instructions
L07-102
Sze and Emer
Vector ISA Hardware Implications
โ€ข Large amount of work per instruction
-> Less instruction fetch bandwidth requirements
-> Allows simplified instruction fetch design
โ€ข Implicit bookkeeping operations
-> Bookkeeping can run in parallel with main compute
โ€ข Disjoint vector element accesses
-> Banked rather than multi-ported register files
โ€ข No data dependence within a vector
-> Amenable to deeply pipelined/parallel designs
โ€ข Known regular memory access pattern
-> Allows for banked memory for higher bandwidth
February 27, 2023
L07-103
Sze and Emer
โ€ข Use deep pipeline (=> fast clock) to
execute element operations
โ€ข Simplifies control of deep pipeline because
elements in vector are independent (=> no
hazards!)
Six stage multiply pipeline
Vector Arithmetic Execution
February 27, 2023
V
1
V
2
V
3
V3 <- v1 * v2
L07-104
Sze and Emer
Vector Instruction Execution
February 27, 2023
ADDV C,A,B
Execution using
one pipelined
functional unit
Execution using
four pipelined
functional units
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
L07-105
Sze and Emer
Vector Unit Structure
February 27, 2023
Lane
Function
Unit
(Adder)
Vector
Registers
Memory Subsystem
Elements
0, 4, 8, โ€ฆ
Elements
1, 5, 9, โ€ฆ
Elements
2, 6, 10, โ€ฆ
Elements
3, 7, 11, โ€ฆ
Function
Unit
(Mult)
L07-106
Sze and Emer
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
โ€“ example machine has 32 elements per vector register and 8 lanes
February 27, 2023
Complete 24 operations/cycle while issuing 1 short instruction/cycle
load
load
mul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction
issue
L07-107
Sze and Emer
ISA Datatypes
FP32
FP16
Int32
Int16
Int8
S E M
1 8 23
S E M
1 5 10
M
31
S
S M
1
1 15
S M
1 7
Range Accuracy
10-38 โ€“ 1038 .000006%
6x10-5 - 6x104 .05%
0 โ€“ 2x109 ยฝ
0 โ€“ 6x104 ยฝ
0 โ€“ 127 ยฝ
Image Source: B. Dally
February 27, 2023
L07-108
Sze and Emer
Intel โ€“ MMX/SSE/AVX
Width Int8 Int16 Int 32 Int64 FP16 FP32 FP64 Features
MMX 64 8 4 2 1
SSE 128 4
SSE2 128 16 8 4 2 4 2
SSE3 128 16 8 4 2 4 2 R
AVX 256 32 16 8 4 16 8 4
AVX2 256 32 16 8 4 16 8 4 GUMR
AVX3 512 64 32 16 8 ? 16 8 GUMRP
February 27, 2023
G: gather
U: unaligned
M: MAC
R: reductions/permutations
P: Predicate masks
Source: Myriad non-authoritative sources on web
L07-109
Sze and Emer
Python to C++ Chart
February 22, 2023
[Leiserson, Thereโ€™s plenty of room at the top, Science, 2020]
L07-110
Sze and Emer
Next Lecture: Roofline Analysis and Transforms
Thank you!
February 27, 2023

More Related Content

Similar to L07.pdf

Horizontal alignment of Roads
Horizontal alignment of RoadsHorizontal alignment of Roads
Horizontal alignment of RoadsLatif Hyder Wadho
ย 
Econometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualEconometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualLewisSimmonss
ย 
Compiler worksheet
Compiler worksheetCompiler worksheet
Compiler worksheetArthyR3
ย 
Project Management Techniques
Project Management TechniquesProject Management Techniques
Project Management Techniquesproject management
ย 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
ย 
A scrupulous code review - 15 bugs in C++ code
A scrupulous code review - 15 bugs in C++ codeA scrupulous code review - 15 bugs in C++ code
A scrupulous code review - 15 bugs in C++ codePVS-Studio LLC
ย 
Macro
MacroMacro
MacroGoogle
ย 
R Language Introduction
R Language IntroductionR Language Introduction
R Language IntroductionKhaled Al-Shamaa
ย 
Lambdas and Generics (long version) - Bordeaux/Toulouse JUG
Lambdas and Generics (long version) - Bordeaux/Toulouse JUGLambdas and Generics (long version) - Bordeaux/Toulouse JUG
Lambdas and Generics (long version) - Bordeaux/Toulouse JUGHenri Tremblay
ย 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2Olga Scrivner
ย 
8085 data transfer instruction set
8085 data transfer instruction set8085 data transfer instruction set
8085 data transfer instruction setprashant1271
ย 
Rcpp11 genentech
Rcpp11 genentechRcpp11 genentech
Rcpp11 genentechRomain Francois
ย 
Final term 2012-2013 D1
Final term 2012-2013 D1Final term 2012-2013 D1
Final term 2012-2013 D1Ehab AL-Timimi
ย 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki ShiokawaSuurist
ย 
Lecture 1 python arithmetic (ewurc)
Lecture 1 python arithmetic (ewurc)Lecture 1 python arithmetic (ewurc)
Lecture 1 python arithmetic (ewurc)Al-Mamun Riyadh (Mun)
ย 
Chapter 1
Chapter 1Chapter 1
Chapter 1EasyStudy3
ย 

Similar to L07.pdf (20)

L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
ย 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
ย 
Horizontal alignment of Roads
Horizontal alignment of RoadsHorizontal alignment of Roads
Horizontal alignment of Roads
ย 
Econometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualEconometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions Manual
ย 
Compiler worksheet
Compiler worksheetCompiler worksheet
Compiler worksheet
ย 
Project Management Techniques
Project Management TechniquesProject Management Techniques
Project Management Techniques
ย 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
ย 
A scrupulous code review - 15 bugs in C++ code
A scrupulous code review - 15 bugs in C++ codeA scrupulous code review - 15 bugs in C++ code
A scrupulous code review - 15 bugs in C++ code
ย 
Macro
MacroMacro
Macro
ย 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
ย 
Lambdas and Generics (long version) - Bordeaux/Toulouse JUG
Lambdas and Generics (long version) - Bordeaux/Toulouse JUGLambdas and Generics (long version) - Bordeaux/Toulouse JUG
Lambdas and Generics (long version) - Bordeaux/Toulouse JUG
ย 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2
ย 
8085 data transfer instruction set
8085 data transfer instruction set8085 data transfer instruction set
8085 data transfer instruction set
ย 
L7 pointers
L7 pointersL7 pointers
L7 pointers
ย 
Rcpp11 genentech
Rcpp11 genentechRcpp11 genentech
Rcpp11 genentech
ย 
Final term 2012-2013 D1
Final term 2012-2013 D1Final term 2012-2013 D1
Final term 2012-2013 D1
ย 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki Shiokawa
ย 
Lecture 1 python arithmetic (ewurc)
Lecture 1 python arithmetic (ewurc)Lecture 1 python arithmetic (ewurc)
Lecture 1 python arithmetic (ewurc)
ย 
Bcsl 033 solve assignment
Bcsl 033 solve assignmentBcsl 033 solve assignment
Bcsl 033 solve assignment
ย 
Chapter 1
Chapter 1Chapter 1
Chapter 1
ย 

More from TRNHONGLINHBCHCM (17)

L06-handout.pdf
L06-handout.pdfL06-handout.pdf
L06-handout.pdf
ย 
L09-handout.pdf
L09-handout.pdfL09-handout.pdf
L09-handout.pdf
ย 
L06.pdf
L06.pdfL06.pdf
L06.pdf
ย 
L03.pdf
L03.pdfL03.pdf
L03.pdf
ย 
L11.pdf
L11.pdfL11.pdf
L11.pdf
ย 
L09.pdf
L09.pdfL09.pdf
L09.pdf
ย 
L04.pdf
L04.pdfL04.pdf
L04.pdf
ย 
L12.pdf
L12.pdfL12.pdf
L12.pdf
ย 
L01.pdf
L01.pdfL01.pdf
L01.pdf
ย 
L02.pdf
L02.pdfL02.pdf
L02.pdf
ย 
L14.pdf
L14.pdfL14.pdf
L14.pdf
ย 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
ย 
L15.pdf
L15.pdfL15.pdf
L15.pdf
ย 
L10.pdf
L10.pdfL10.pdf
L10.pdf
ย 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
ย 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
ย 
L10.pdf
L10.pdfL10.pdf
L10.pdf
ย 

Recently uploaded

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
ย 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
ย 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
ย 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
ย 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
ย 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
ย 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
ย 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
ย 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
ย 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
ย 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
ย 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
ย 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
ย 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
ย 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
ย 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
ย 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
ย 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
ย 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
ย 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
ย 

Recently uploaded (20)

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
ย 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
ย 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ย 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
ย 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
ย 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
ย 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
ย 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
ย 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
ย 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
ย 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
ย 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
ย 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
ย 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
ย 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
ย 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
ย 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
ย 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
ย 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
ย 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
ย 

L07.pdf

  • 1. L07-1 6.5930/1 Hardware Architectures for Deep Learning Vectorized Kernel Computation Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 27, 2023
  • 2. L07-2 Sze and Emer Goals of Todayโ€™s Lecture โ€ข Understand parallelism and improved efficiency through: โ€“ loop unrolling, and โ€“ vectorization February 27, 2023
  • 3. L07-3 Sze and Emer Background Reading โ€ข Vector architectures โ€“ Computer Architecture: A Quantitative Approach, 6th edition, by Hennessy and Patterson โ€ข Ch 4: p282-310, App G โ€ข Ch 4: p262-288, App G These books and their online/e-book versions are available through MIT libraries. February 27, 2023
  • 4. L07-4 Sze and Emer input fmaps H W C 1 Fully Connected Computation output fmaps โ€ฆ filters 1 1 1 February 27, 2023 H C 1 W H W C M
  • 5. L07-5 Sze and Emer input fmaps H W C 1 Fully Connected Computation output fmaps โ€ฆ filters 1 1 1 February 27, 2023 H C 1 W H W C M
  • 6. L07-6 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = โ€ข Matrixโ€“Vector Multiply: โ€ข Multiply all inputs in all channels by a weight and sum February 27, 2023
  • 7. L07-7 Sze and Emer Filter Memory Layout February 27, 2023 F[M0 C0 H0 W0] F[M0 C0 H0 W1] โ€ฆ F[M0 C0 H1 W0] F[M0 C0 H1 W1] โ€ฆ F[M0 C0 H2 W0] F[M0 C0 H2 W1] โ€ฆ . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] โ€ฆ F[M0 C1 H1 W0] F[M0 C1 H1 W1] โ€ฆ F[M0 C1 H2 W0] F[M0 C1 H2 W1] โ€ฆ . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] โ€ฆ F[M1 C0 H1 W0] F[M1 C0 H1 W1] โ€ฆ F[M1 C0 H2 W0] F[M1 C0 H2 W1] โ€ฆ . . . H C M W H C 1 W Filter weight Next input channel Next row Next output channel
  • 8. L07-8 Sze and Emer Flattened FC Loops February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): o[m] = 0 CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] Offset to start of current output filter Flattened tensors Loop invariant hoisted and strength reduced
  • 9. L07-9 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 10. L07-10 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 11. L07-11 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 12. L07-12 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 13. L07-13 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 14. L07-14 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 15. L07-15 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 16. L07-16 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 17. L07-17 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 18. L07-18 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H+W-1
  • 19. L07-19 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 20. L07-20 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 21. L07-21 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0
  • 22. L07-22 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 23. L07-23 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 24. L07-24 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 25. L07-25 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 1
  • 26. L07-26 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 27. L07-27 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 28. L07-28 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-1
  • 29. L07-29 Sze and Emer Flattened FC Loops February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W; for chw in [0, C*H*W) o[m] += i[chw] * f[CHWm + chw] Most of the time is spent here!
  • 30. L07-30 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 31. L07-31 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 32. L07-32 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 33. L07-33 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 34. L07-34 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 35. L07-35 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 36. L07-36 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 37. L07-37 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop
  • 38. L07-38 Sze and Emer Loop Iteration Overhead February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW (CHWm) mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHWm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] add r2, r2, 1 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, 1 blt r1, M, mloop Index calculation (f) Loop count and index calculation (i) How many MACs/cycle (ignoring stalls)? ~ 1/6 Where is a major source of overhead?
  • 39. L07-39 Sze and Emer FC scalar computation February 27, 2023
  • 40. L07-40 Sze and Emer Loop Unrolling (2chw) February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): CHWm += C*H*W for chw in [0, C*H*W, 2): o[m] += (i[chw] * f[CHWm + chw]) + (i[chw + 1] * f[CHWm + chw + 1] Index calculation amortized since i[chw+1] => &(i+1)[chw] Loop overhead amortized over more computation Operands accessed in pairs Operands accessed in pairs Step by 2
  • 41. L07-41 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 42. L07-42 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 43. L07-43 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 44. L07-44 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 45. L07-45 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 46. L07-46 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 47. L07-47 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop Offset constant reflects constant index increment Offset constant reflects constant index increment
  • 48. L07-48 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 49. L07-49 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop Offset constant reflects constant index increment Offset constant reflects constant index increment
  • 50. L07-50 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 51. L07-51 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 52. L07-52 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop
  • 53. L07-53 Sze and Emer Fully Connected - Unrolled February 27, 2023 mv r1, 0 # r1 holds m mloop: mul r3, r1, C*H*W # r3 holds m*CHW mv r2, 0 # r2 holds chw mv r8, 0 # r8 holds psum (o[m]) xloop: ld r4, i(r2) # r4 = i[chw] add r5, r2, r3 # r5 = CHMm + chw ld r6, f(r5) # r6 = f[CHWm + chw] mac r8, r4, r6 # r8 += i[chw] * f[CHWm+chw] ld r7, i+1(r2) # r7 = i[chw + 1] ld r9, f+1(r5) # r9 = f[CHWm + chw + 1] mac r8, r7, r9 # r11 += i[chw] * f[CHWm + chw +1] add r2, r2, 2 # r2 = chw + 1 blt r2, C*W*H, xloop st r8, o(r1) add r1, r1, 1 blt r1, M, mloop How many MACs/cycle (ignoring stalls)? ~ 2/9
  • 54. L07-54 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 55. L07-55 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 56. L07-56 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 57. L07-57 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 58. L07-58 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 59. L07-59 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 60. L07-60 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 61. L07-61 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 62. L07-62 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 63. L07-63 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 64. L07-64 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 65. L07-65 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 66. L07-66 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 0,1
  • 67. L07-67 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 68. L07-68 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 69. L07-69 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = 2,3
  • 70. L07-70 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 71. L07-71 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 72. L07-72 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 chw = C*H*W-2, C*H*W-1
  • 73. L07-73 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps ร— 1 Output fmaps M = February 27, 2023 Can we incorporate this โ€œpairingโ€ into the architecture? Of course chw = C*H*W-2, C*H*W-1
  • 74. L07-74 Sze and Emer Vector Programming Model February 27, 2023 + + + + + + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register VLRMAX โ€“ number of elements in a vector register VLR โ€“ number of elements to use in an instruction
  • 75. L07-75 Sze and Emer Vector Programming Model February 27, 2023 v1 Vector Load and Store Instructions LDV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register
  • 76. L07-76 Sze and Emer โ€ฆ โ€ฆ โ€ฆ โ€ฆ Compiler-based Vectorization February 27, 2023 Ld Ai Ld Bi Add St Zi Ld Ai+1 Ld Bi+1 Add St Zi+1 Ld Ai Ld Bi Add St Zi Ld Ai+1 Ld Bi+1 Add St Zi+1 Scalar code Vector code for i in [0:N): Z[i] = A[i] + B[i]; Compiler recognizes independent operations with loop dependence analysis T i m e
  • 77. L07-77 Sze and Emer Loop Unrolled February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): CHWm = C*H*W*m for chw in [0, C*H*W, 2): o[m] += (i[chw] * f[CHWm + chw]) + (i[chw + 1] * f[CHWm + chw + 1]) } Will this vectorize? Not in the architecture presented so far Sequential loads Sequential loads Reduction crosses elements
  • 78. L07-78 Sze and Emer Parallel with animation February 27, 2023
  • 79. L07-79 Sze and Emer Fully Connected โ€“ Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1] No output is dependent on another output (other than commutative order)
  • 80. L07-80 Sze and Emer Fully Connected โ€“ Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1]
  • 81. L07-81 Sze and Emer for chw in [0, C*H*W, 2): for m in [0, M): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1] Fully Connected โ€“ Loop Permutation February 27, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): for chw in [0, C*H*W, 2): o[m] += i[chw] * f[CHW*m + chw] o[m] += i[chw + 1] * f[CHW*m + chw + 1]
  • 82. L07-82 Sze and Emer FC โ€“ Permuted/Unrolled February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): for m in [0, M, 2): o[m] += i[chw] * f[CHW*m + chw] o[m+1] += i[chw] * f[CHW*(m+1) + chw] // Loops permuted for chw in [0, C*H*W): for m in [0, M): o[m] += i[chw] * f[CHW*m + chw] Unrolled calculation
  • 83. L07-83 Sze and Emer Parallel m animation February 27, 2023
  • 84. L07-84 Sze and Emer // Loop invariant hoisting of i[chw] for chw in [0, C*H*W): i_chw = i[chw] for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] FC โ€“ Permuted/Unrolled/Hoisted February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): for m in [0, M, 2): o[m] += i[chw] * f[CHW*m + chw] o[m+1] += i[chw] * f[CHW*(m+1) + chw] Same for all calculations Load hoisted out of loop
  • 85. L07-85 Sze and Emer Fully Connection Computation February 27, 2023 F[M0 C0 H0 W0] F[M0 C0 H0 W1] โ€ฆ F[M0 C0 H1 W0] F[M0 C0 H1 W1] โ€ฆ F[M0 C0 H2 W0] F[M0 C0 H2 W1] โ€ฆ . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] โ€ฆ F[M0 C1 H1 W0] F[M0 C1 H1 W1] โ€ฆ F[M0 C1 H2 W0] F[M0 C1 H2 W1] โ€ฆ . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] โ€ฆ F[M1 C0 H1 W0] F[M1 C0 H1 W1] โ€ฆ F[M1 C0 H2 W0] F[M1 C0 H2 W1] โ€ฆ . . . I[C0 H0 W0] I[C0 H0 W1] โ€ฆ I[C0 H1 W0] I[C0 H1 W1] โ€ฆ I[C0 H2 W0] I[C0 H2 W1] โ€ฆ . . I[C1 H0 W0] I[C1 H0 W1] โ€ฆ I[C1 H1 W0] I[C1 H1 W1] โ€ฆ I[C1 H2 W0] I[C1 H2 W1] โ€ฆ . . . // Loop invariant hosting of i[chw] for chw in [0, C*H*W): i_chw = i[chw]; for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] Weights needed together are far apart. What can we do Weights needed together are far apart. What can we do? Access with stride or rearrange memory
  • 86. L07-86 Sze and Emer FC โ€“ Layered Loops February 27, 2023 // Unrolled inner loop for chw in [0, C*H*W): i_chw = i[chw] for m in [0, M, 2): o[m] += i_chw * f[CHW*m + chw] o[m+1] += i_chw * f[CHW*(m+1) + chw] // Level 2 loops for chw in [0, C*H*W): i_chw = i[chw] for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i_chw * f[CHW*(m1*VL+m0) + chw] Level 0 is a set of vector operations Limit of m2 (M/VL) times limit of m1 (LV) is M Limit of m1 (M/VL) times limit of m0 (LV) is M m = m1*VL+m0
  • 87. L07-87 Sze and Emer Einsum Rank Splitting February 27, 2023 ๐‘‚๐‘‚๐‘š๐‘š1,๐‘š๐‘š0 = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š1,๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค โ†’ ๐น๐น๐‘š๐‘š1,๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค ๐‘‚๐‘‚๐‘š๐‘š = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค ๐‘‚๐‘‚๐‘š๐‘š โ†’ ๐‘‚๐‘‚๐‘š๐‘š1ร—๐‘‰๐‘‰๐‘‰๐‘‰+๐‘š๐‘š0 โ†’ ๐‘‚๐‘‚๐‘š๐‘š1,๐‘š๐‘š0 โ†’ ๐น๐น๐‘š๐‘š1ร—๐‘‰๐‘‰๐‘‰๐‘‰+๐‘š๐‘š0,๐‘๐‘โ„Ž๐‘ค๐‘ค ๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค ๐‘‚๐‘‚๐‘š๐‘š = ๐ผ๐ผ๐‘๐‘โ„Ž๐‘ค๐‘ค ร— ๐น๐น๐‘š๐‘š,๐‘๐‘โ„Ž๐‘ค๐‘ค
  • 88. L07-88 Sze and Emer FC โ€“ Layered Loops // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1][m0] += i[chw] * f[m1][m0][chw] // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] February 27, 2023 Flatten data structures
  • 89. L07-89 Sze and Emer FC โ€“ Layered Loops February 27, 2023 // Level 2 loops for chw in [0, C*H*W): i_chw = i[chw] for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i_chw * f[VL*CWH*m1+CWH*m0+chw] // Level 1 loops m1VL = m1*VL CHWVLm1_chw = CHW*VL*m1 + chw parallel_for m0 in [0, VL): o[m1VL+m0] += i_chw * f[CHWVLm1_chw + CHW*m0] Invariant in inner loop! Stride! Hoist Loop Invariant!
  • 90. L07-90 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 91. L07-91 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 92. L07-92 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 93. L07-93 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 94. L07-94 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop
  • 95. L07-95 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 96. L07-96 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 97. L07-97 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced
  • 98. L07-98 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced How many MACs/cycle (ignoring stalls)? ~ VL/7 Can we unroll this to get even more? Yes
  • 99. L07-99 Sze and Emer Full Connected - Vectorized February 27, 2023 mv r1, 0 # r1 holds chw add r4, 0 # r4 holds CHWVLm1_chw xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh] mv r2, 0 # r2 holds m1VL mloop: ldv v3, f(r4), CWH # v3 holds f[] ldv v5, o(r2), 1 # v5 holds o[] macv v5, v1, v3 # multiply f[] * i[] stv v5, o(r2), 1 # store o add r2, r2, VL # update m1VL add r4, r4, CHWVL # update CHWVLm1_chw blt r2, M, mloop add r1, r1, 1 # update chw add r4, r4, r1 # update CHWVLm1_chw blt r1, CWH, xloop Strength reduced How many MACs/cycle (ignoring stalls)? ~ VL/7 Can we unroll this to get even more? Yes
  • 100. L07-100 Sze and Emer FC โ€“ Layered Loops February 27, 2023 // Level 2 loops for chw in [0, C*H*W): for m1 in [0, M/VL): // Level 1 loops parallel_for m0 in VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] // Level 2 loops for m1 in [0, M/VL): for chw in [0, C*H*W): // Level 1 loops parallel_for m0 in [0, VL): o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw] No constraints on loop permutations! Loop order affects where loop invariants can be moved
  • 101. L07-101 Sze and Emer Vector ISA Attributes โ€ข Compact โ€“ one short instruction encodes N operations โ€“ many implicit bookkeeping/control operations โ€ข Expressive, hardware knows the N operations: โ€“ are independent โ€“ use the same functional unit โ€“ access disjoint registers โ€“ access registers in same pattern as previous instructions โ€“ access a contiguous block of memory (unit-stride load/store) โ€“ access memory in a known pattern (strided load/store) February 27, 2023 Vector instructions make โ€œexplicitโ€ many things that are โ€œimplicitโ€ with standard instructions
  • 102. L07-102 Sze and Emer Vector ISA Hardware Implications โ€ข Large amount of work per instruction -> Less instruction fetch bandwidth requirements -> Allows simplified instruction fetch design โ€ข Implicit bookkeeping operations -> Bookkeeping can run in parallel with main compute โ€ข Disjoint vector element accesses -> Banked rather than multi-ported register files โ€ข No data dependence within a vector -> Amenable to deeply pipelined/parallel designs โ€ข Known regular memory access pattern -> Allows for banked memory for higher bandwidth February 27, 2023
  • 103. L07-103 Sze and Emer โ€ข Use deep pipeline (=> fast clock) to execute element operations โ€ข Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline Vector Arithmetic Execution February 27, 2023 V 1 V 2 V 3 V3 <- v1 * v2
  • 104. L07-104 Sze and Emer Vector Instruction Execution February 27, 2023 ADDV C,A,B Execution using one pipelined functional unit Execution using four pipelined functional units C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27]
  • 105. L07-105 Sze and Emer Vector Unit Structure February 27, 2023 Lane Function Unit (Adder) Vector Registers Memory Subsystem Elements 0, 4, 8, โ€ฆ Elements 1, 5, 9, โ€ฆ Elements 2, 6, 10, โ€ฆ Elements 3, 7, 11, โ€ฆ Function Unit (Mult)
  • 106. L07-106 Sze and Emer Vector Instruction Parallelism Can overlap execution of multiple vector instructions โ€“ example machine has 32 elements per vector register and 8 lanes February 27, 2023 Complete 24 operations/cycle while issuing 1 short instruction/cycle load load mul mul add add Load Unit Multiply Unit Add Unit time Instruction issue
  • 107. L07-107 Sze and Emer ISA Datatypes FP32 FP16 Int32 Int16 Int8 S E M 1 8 23 S E M 1 5 10 M 31 S S M 1 1 15 S M 1 7 Range Accuracy 10-38 โ€“ 1038 .000006% 6x10-5 - 6x104 .05% 0 โ€“ 2x109 ยฝ 0 โ€“ 6x104 ยฝ 0 โ€“ 127 ยฝ Image Source: B. Dally February 27, 2023
  • 108. L07-108 Sze and Emer Intel โ€“ MMX/SSE/AVX Width Int8 Int16 Int 32 Int64 FP16 FP32 FP64 Features MMX 64 8 4 2 1 SSE 128 4 SSE2 128 16 8 4 2 4 2 SSE3 128 16 8 4 2 4 2 R AVX 256 32 16 8 4 16 8 4 AVX2 256 32 16 8 4 16 8 4 GUMR AVX3 512 64 32 16 8 ? 16 8 GUMRP February 27, 2023 G: gather U: unaligned M: MAC R: reductions/permutations P: Predicate masks Source: Myriad non-authoritative sources on web
  • 109. L07-109 Sze and Emer Python to C++ Chart February 22, 2023 [Leiserson, Thereโ€™s plenty of room at the top, Science, 2020]
  • 110. L07-110 Sze and Emer Next Lecture: Roofline Analysis and Transforms Thank you! February 27, 2023