High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
L08.pdf
1. L08-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Vectorized Kernel Computation
(continued)
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 1, 2023
2. L08-2
Sze and Emer
Sze and Emer
ISA Datatypes
FP32
FP16
Int32
Int16
Int8
S E M
1 8 23
S E M
1 5 10
M
31
S
S M
1
1 15
S M
1 7
Range Accuracy
10-38 – 1038 .000006%
6x10-5 - 6x104 .05%
0 – 2x109 ½
0 – 6x104 ½
0 – 127 ½
Image Source: B. Dally
March 1, 2023
4. L08-4
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Computational Transforms
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 1, 2023
5. L08-5
Sze and Emer
Sze and Emer
FC – Vector Operation Counts
March 1, 2023
// Level 2 loops
for m in [0, M):
for chw1 in [0, C*H*W/L):
// Level 1 loops
parallel_for chw0 in [0, L):
o[m] += i[L*chw1+chw0] * f[C*H*W*m + L*chw1+chw0];
L == Lanes
Vector operation
on L lanes
Order: m, chw1, chw0 Factors: M, C*H*W/L, L
6. L08-6
Sze and Emer
Sze and Emer
FC – Vector Operation Counts
March 1, 2023
How many MACs? M *(C*H*W/L) * L = M*C*H*W
How many reads of “inputs” C*H*W*M
How many reads of “weights” C*H*W*M
How many writes of “outputs” M
Measuring reads/writes in
units of 32-bit integers
// Level 2 loops
for m in [0, M):
for chw1 in [0, C*H*W/L):
// Level 1 loops
parallel_for chw0 in [0, L):
o[m] += i[L*chw1+chw0] * f[C*H*W*m + L*chw1+chw0];
L == Lanes
7. L08-7
Sze and Emer
Sze and Emer
Compute Intensity (MACs/Read)
March 1, 2023
MACs/Read? C∗H∗W∗M
C∗H∗W∗M C∗H∗W∗M
~
1
2
If system can support 1 Read/MAC
will system run at full throttle?
No
// Level 2 loops
for m in [0, M):
for chw2 in [0, C*H*W/L:
// Level 1 loops
parallel_for chw1 in [0, L):
o[m] += i[L*chw2+chw1] * f[C*H*W*m + L*chw2+chw1];
L == Lanes
8. L08-8
Sze and Emer
Sze and Emer
Roofline Model
0
2
4
6
8
10
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
MACs/Cycle
Compute Intensity (MACs/read)
8 MAC Lanes with 1 Read/MAC .
March 1, 2023
MACs/cycle limited by
number of lanes
Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures."
Communications of the ACM 52.4 (2009): 65-76.
9. L08-9
Sze and Emer
Sze and Emer
Roofline Model
0
2
4
6
8
10
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
MACs/Cycle
Compute Intensity (MACs/read)
8 MAC Lanes with 1 Read/MAC
March 1, 2023
Where will the previous slide’s code be? Compute intensity = 1/2
MACs/cycle limited by
number of lanes
MACs/cycle limited
by operand delivery
10. L08-10
Sze and Emer
Sze and Emer
Roofline Model
0
2
4
6
8
10
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
MACs/Cycle
Compute Intensity (MACs/read)
8 MAC Lanes with 1 Read/MAC
March 1, 2023
Where will the previous slide’s code be? Compute intensity = 1/2
o
How can we change the compute intensity? Code changes, e.g., different splitting,
loop inversions?
MACs/cycle limited by
number of lanes
MACs/cycle limited
by operand delivery
11. L08-11
Sze and Emer
Sze and Emer
FC – Splitting/Reordered
// Level 2 loops
for m1 in [0, M/L):
for chw in [0, C*H*W):
// Level 1 loops
parallel_for m0 in [0, L):
o[m1*L+m1] += i[chw] * f[CHW*(m1*L+m0) + chw]
L == Lanes
Partiion
“m”
Partition
“m”
Order: m1, chw, m0 Factors: M/L, C*H*W, L
12. L08-12
Sze and Emer
Sze and Emer
FC – Reordered + loop invariant hoisted
March 1, 2023
// Level 2 loops
for m2 in [0, M/L):
for chw in [0, C*H*W):
// Level 1 loops
parallel_for m1 in [0, L):
o[m2*L+m1] += i[chw] * f[CHW*(m2*L+m1) + chw]
// Level 2 loops
for m2 in [0, M/L):
for chw in [0, C*H*W):
i_chw = i[chw]
// Level 1 loops
for chw in [0, C*H*W):
o[m2*L+m1] += i_chw * f[CHW*(m2*L+m1) + chw]
L == Lanes
L == Lanes
13. L08-13
Sze and Emer
Sze and Emer
FC – Operation Counts
March 1, 2023
How many MACs? C*H*W*M
How many reads of “inputs” C*H*W*M/L
How many reads of “weights” C*H*W*M
How many writes of “outputs” M
Measuring reads/writes in units of 32-bit integers
// Level 2 loops
for m1 in [0, M/L):
for chw in [0, C*H*W):
i_chw = i[chw]
// Level 1 loops
for chw in [0, C*H*W):
o[m1*L+m0] += i_chw * f[CHW*(m1*L+m0) + chw]
L == Lanes
14. L08-14
Sze and Emer
Sze and Emer
Compute Intensity
March 1, 2023
MACs/Read? C∗H∗W∗M
C∗H∗W∗M C∗H∗W∗M/L
~
𝐿
1 + 𝐿
If system can support 1 Read/MAC
will system run at full throttle?
No
// Level 2 loops
for m1 in [0, M/L):
for chw in [0, C*H*W):
i_chw = i[chw]
// Level 1 loops
for chw in [0, C*H*W):
o[m1*L+m0] += i_chw * f[CHW*(m1*L+m0) + chw]
15. L08-15
Sze and Emer
Sze and Emer
Roofline Model
0
2
4
6
8
10
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
MACs/Cycle
Compute Intensity (MACs/read)
1 MAC/Read - 8 lanes
March 1, 2023
Where will the previous slide’s code be? Compute intensity = L/(1+L) = 8/9
Why might points be below the line? Other overheads (e.g. instructions, stalls)
x
o
Is being on the flat part always best? Not necessarily…
16. L08-16
Sze and Emer
Sze and Emer
Computation Transformations
• Goal: Bitwise same result, but reduce
number of operations
• Focuses mostly on compute
March 1, 2023
17. L08-17
Sze and Emer
Sze and Emer
Gauss’s Multiplication Algorithm
4 multiplications + 3 additions
March 1, 2023
18. L08-18
Sze and Emer
Sze and Emer
Strassen
P1 = a(f – h)
P2 = (a + b)h
P3 = (c + d)e
P4 = d(g – e)
P5 = (a + d)(e + h)
P6 = (b - d)(g + h)
P7 = (a – c)(e + f)
8 multiplications + 4 additions
7 multiplications + 18 additions
7 multiplications + 13 additions (for constant B matrix – weights)
[Cong et al., ICANN, 2014]
March 1, 2023
19. L08-19
Sze and Emer
Sze and Emer
Strassen
Comes at the price of reduced numerical stability and
requires significantly more memory
N
Naïve
Strassen
Complexity
Image Source: http://www.stoimen.com/blog/2012/11/26/computer-algorithms-strassens-matrix-multiplication/
• Reduce the complexity of matrix multiplication
from Θ(N3) to Θ(N2.807) by reducing
multiplications
March 1, 2023
20. L08-20
Sze and Emer
Sze and Emer
Python to C++ Chart
March 1, 2023
[Leiserson, There’s plenty of room at the top, Science, 2020]
21. L08-21
Sze and Emer
Sze and Emer
Tensor Computations
March 1, 2023
Matrix Multiply
CONV Layer
22. L08-22
Sze and Emer
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
C
C
filters
P
Q
H
C
H
W
C
P
1 1
N
N
W Q
March 1, 2023
Batch Size (N)
23. L08-23
Sze and Emer
Sze and Emer
CONV Layer Implementation
Naïve 7-layer for-loop implementation:
for n in [0..N):
for m in [0..M):
for q in [0..Q):
for p in [0..P):
O[n][m][p][q] = B[m];
for c in [0..C):
for r in [0..R):
for s in [0..S):
O[n][m][p][q] += I[n][c][Up+r][Uq+s]
× F[m][c][r][s];
O[n][m][p][q] = Activation(O[n][m][p][q]);
for each output fmap value
convolve
a
window
and
apply
activatio
n
March 1, 2023
24. L08-24
Sze and Emer
Sze and Emer
Winograd 1D – F(2,3)
• Targeting convolutions instead of matrix multiply
• Notation: F(size of output, filter size)
[Lavin et al., CVPR 2016]
6 multiplications + 4 additions
filter
March 1, 2023
inputs outputs
F(2,3) =
25. L08-25
Sze and Emer
Sze and Emer
Winograd 1D – F(2,3)
• Targeting convolutions instead of matrix multiply
• Notation: F(size of output, filter size)
[Lavin et al., CVPR 2016]
filter
March 1, 2023
inputs outputs
F(2,3) =
4 multiplications + 12 additions + 2 shifts
4 multiplications + 8 additions (for constant weights)
33. L08-33
Sze and Emer
Sze and Emer
Winograd Summary
• Winograd is an optimized computation for
convolutions
• It can significantly reduce multiplies
– For example, for 3x3 filter by 2.25X
• But, each filter size (and output size) is a
different computation.
March 1, 2023
34. L08-34
Sze and Emer
Sze and Emer
Winograd as a Transform
Transform inputs
Element-wise
Multiplication
Transform output
[Lavin, ArXiv 2015]
Note: GfGT can be precomputed
March 1, 2023
filter
input
35. L08-35
Sze and Emer
Sze and Emer
R
filter (weights)
S
Fast Fourier Transform (FFT) Flow
E
F
input fmap output fmap
H
W
an output
activation
* =
FFT(W)
F
F
T
FFT(I)
X = FFT(O)
F
F
T
I
F
F
T
March 1, 2023
36. L08-36
Sze and Emer
Sze and Emer
FFT Overview
• Convert filter and input to frequency domain to make convolution
a simple multiply then convert back to space domain.
• Convert direct convolution O(No
2Nf
2) computation to O(No
2log2No)
• Note that computational benefit of FFT decreases with decreasing
size of filter
March 1, 2023
[Mathieu, ArXiv 2013], [Vasilache, ArXiv 2014]
37. L08-37
Sze and Emer
Sze and Emer
FFT Costs
• Input and Filter matrices are ‘0-completed’,
– i.e., expanded to size P+R-1 x Q+S-1
• Frequency domain matrices are same dimensions as
input, but complex.
• FFT often reduces computation, but requires much more
memory space and bandwidth
March 1, 2023
38. L08-38
Sze and Emer
Sze and Emer
Optimization opportunities
• FFT of real matrix is symmetric allowing one to save ½ the computes
• Filters can be pre-computed and stored, but convolutional filter in
frequency domain is much larger than in space domain
• Can reuse frequency domain version of input for creating different
output channels to avoid FFT re-computations
• Can accumulate across channels before performing inverse
transform to reduce number of IFFT
March 1, 2023
39. L08-39
Sze and Emer
Sze and Emer
cuDNN: Speed up with Transformations
Source: Nvidia
March 1, 2023
40. L08-40
Sze and Emer
Sze and Emer
UCNN – Convolution (Simplified)
March 1, 2023
i00 i01 i02 i03
i10 i11 i12 i13
i20 i21 i22 i23
i30 i31 i32 i33
a b
c d
o00 o01
o10 o11
Filters Input Fmap Output Fmap
* =
e f
g h
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑒 𝑖 +𝑓 𝑖 + 𝑔 𝑖 + h 𝑖
7 additions
8 multiplications
[Hegde, ISCA 2018]
41. L08-41
Sze and Emer
Sze and Emer
UCNN – Convolution (Simplified)
March 1, 2023
i00 i01 i02 i03
i10 i11 i12 i13
i20 i21 i22 i23
i30 i31 i32 i33
a b
c d
o00 o01
o10 o11
Filters Input Fmap Output Fmap
* =
b a
g h
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑒 𝑖 +𝑓 𝑖 + 𝑔 𝑖 + h 𝑖
7 additions
8 multiplications
[Hegde, ISCA 2018]
42. L08-42
Sze and Emer
Sze and Emer
UCNN – Convolution (Simplified)
March 1, 2023
i00 i01 i02 i03
i10 i11 i12 i13
i20 i21 i22 i23
i30 i31 i32 i33
a b
c d
o00 o01
o10 o11
Filters Input Fmap Output Fmap
* =
b a
g h
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + b 𝑖 +𝑎 𝑖 + 𝑔 𝑖 + h 𝑖
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑒 𝑖 +𝑓 𝑖 + 𝑔 𝑖 + h 𝑖
7 additions
8 multiplications
[Hegde, ISCA 2018]
43. L08-43
Sze and Emer
Sze and Emer
UCNN – Convolution (Simplified)
March 1, 2023
i00 i01 i02 i03
i10 i11 i12 i13
i20 i21 i22 i23
i30 i31 i32 i33
a b
c d
o00 o01
o10 o11
Filters Input Fmap Output Fmap
* =
b a
g h
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + b 𝑖 +𝑎 𝑖 + 𝑔 𝑖 + h 𝑖
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑒 𝑖 +𝑓 𝑖 + 𝑔 𝑖 + h 𝑖
𝑜 = (𝑎 + 𝑏)𝑖 +(𝑎 + 𝑏) 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑔 𝑖 + h 𝑖
7 additions
8 multiplications
[Hegde, ISCA 2018]
44. L08-44
Sze and Emer
Sze and Emer
UCNN – Convolution (Simplified)
March 1, 2023
i00 i01 i02 i03
i10 i11 i12 i13
i20 i21 i22 i23
i30 i31 i32 i33
a b
c d
o00 o01
o10 o11
Filters Input Fmap Output Fmap
* =
b a
g h
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + b 𝑖 +𝑎 𝑖 + 𝑔 𝑖 + h 𝑖
𝑜 = 𝑎 𝑖 +𝑏 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑒 𝑖 +𝑓 𝑖 + 𝑔 𝑖 + h 𝑖
𝑜 = (𝑎 + 𝑏)𝑖 +(𝑎 + 𝑏) 𝑖 + c 𝑖 + 𝑑 𝑖 + 𝑔 𝑖 + h 𝑖
6 additions
6 multiplications
7 additions
8 multiplications
[Hegde, ISCA 2018]
103. L08-103
Sze and Emer
Sze and Emer
Convolution (CONV) Layer
M
CRS
CRS
(H-R+1)(W-S+1)N
Filters Input fmaps
×
Output fmaps
M
=
(H-R+1)(W-S+1)N
• Dimensions of matrices for matrix multiply in
convolution layers with batch size N
March 1, 2023 N=2 in example
104. L08-104
Sze and Emer
Sze and Emer
1-D Toeplitz Convolution Einsum
March 1, 2023
Simplify to 1-D with N=1, C=1, M=1, U=1
Break into two steps