L06.pdf

L06-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Kernel Computation -
Impact of Memory Hierarchy
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 22, 2023

L06-2
Sze and Emer
Sze and Emer
Goals of Today’s Lecture
• Understand impact of memory hierarchy
– Overview of caches
– Structuring algorithms to work well in caches using tiling
– Storage technologies
February 22, 2023

L06-3
Sze and Emer
Sze and Emer
Readings for this Week
• Efficient Processing of Deep Neural Networks
– Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7
.
February 22, 2023

L06-4
Sze and Emer
Sze and Emer
Simple Pipelined µArchitecture
February 22, 2023
PC
I
M
E
M
IR GPR
X
Y
+
*
D
M
E
M
Warning: Objects in PowerPoint may
be larger than they appear
What are consequences of putting
large memory (e.g., megabytes)
directly in pipeline?
Long latency => dependency stalls
Large energy consumption

L06-5
Sze and Emer
Sze and Emer
Pipelined µArchitecture with Caches
February 22, 2023
PC
I
$
IR GPR
X
Y
+
* D
$
Memory
Instruction cache (I$) and data cache (D$) hold memory data
for reuse in small energy efficient buffer

L06-6
Sze and Emer
Sze and Emer
Direct Mapped Cache
February 22, 2023
Tag Data Block
V
=
Offset
Tag Index
t k b
t
HIT
Data Word or Byte
2k
lines
Block number Block offset
Valid bit
indicates data
block is valid
Data block consists
of multiple words
Valid and tag
match means
data is in cache Offset selects desired word
Address partitioned
into multiple fields
Index
picks
row

L06-7
Sze and Emer
Sze and Emer
Cache Operation
February 22, 2023
Look at data address, search cache tags to find match.
Then if…
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait …
Return data to processor
and update cache
Metric: Hit Rate = #Hits / (#Hits + #Misses)

L06-8
Sze and Emer
Sze and Emer
Treatment of Writes
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache in processor pipeline
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
February 22, 2023

L06-9
Sze and Emer
Sze and Emer
Cache Locality
Caches implicitly try to optimize data movement by trying to exploit
two common properties of memory references:
– Spatial Locality: If a location is referenced it is likely that locations near it
will be referenced in the near future.
• Exploited by having block size larger than a word, which also amortizes fill
overheads by getting more bytes with one access
– Temporal Locality: If a location is referenced it is likely to be referenced
again in the near future.
• Exploited by holding blocks for future access
February 22, 2023

L06-10
Sze and Emer
Sze and Emer
Fully Connected (FC) Computation
February 22, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += CHW
for chw in [0, CHW):
o[m] += i[chw] * f[CHWm + chw]
M iterations
C*H*W iterations
M*C*H*W loads
M*C*H*W loads of each
weight and input activation

L06-11
Sze and Emer
Sze and Emer
Impact of spatial locality
February 22, 2023
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
Hit rate of long sequential reference
streams due to spatial locality?
15/16 => ~94%

L06-12
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]…
Not drawn to scale
Weight locality…
Spatial? Yes
Input activation locality…
Spatial? Yes
CHW
M*CHW
m=0 1 2 3
Temporal? No. No reuse!
Temporal? It depends.

L06-13
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023

L06-14
Sze and Emer
Sze and Emer
Amount of temporal locality
• Typical layer size:
– H, W = 256 C = 128
February 22, 2023
Size of input activations? 256x256x128x4 => 32MB
What does this imply for
input activation hit rate?
No temporal locality
since 32MB > 64K bytes
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block

L06-15
Sze and Emer
Sze and Emer
Computational Intensity – Naïve FC
Number MACS: M*C*H*W
Input activation accesses M*C*H*W
Filter weight accesses M*C*H*W
Output activation accesses M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
2 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~
1
2
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊

L06-16
Sze and Emer
Sze and Emer
Computational Intensity – Ideal FC
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses: M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
1 +
1
𝑀𝑀 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~ 1
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
February 22, 2023
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊

L06-17
Sze and Emer
Sze and Emer
Einsum for strip mined FC
𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
February 22, 2023
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0
𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1
𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0

L06-18
Sze and Emer
Sze and Emer
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Fully Connected – Strip Mined
February 22, 2023
for m in [0, M):
Just considering input activations,
what value should CHW0 be?
Less than cache size
Inner loop
working set = X
Inner loop working
set = CHW0
CHW1*CHW0 =
C*H*W

L06-19
Sze and Emer
Sze and Emer
FC - Strip Mined Data Reference Pattern
February 22, 2023
Untiled Tiled
Cache
Hits?
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]
…
Not drawn to scale
CHW
M*CHW

L06-20
Sze and Emer
Sze and Emer
Matrix-Vector Multiply – Strip Mined
February 22, 2023

L06-21
Sze and Emer
Sze and Emer
Computational Intensity – Strip Mined
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses M
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀
=
1
1+
1
𝑀𝑀
+
1
𝐶𝐶×𝐻𝐻×𝑊𝑊
~ 1
February 22, 2023
// Level 1
for m in [0, M):
// Level 0
chw = CHW0*chw1+chw0

L06-22
Sze and Emer
Sze and Emer
Associative Cache
February 22, 2023
Tag Data Block
V
=
Block
Offset
Tag Index
t
k
b
HIT
Tag Data Block
V
Data
Word
or Byte
=
t
Allows multiple
streams to be
resident at
same time
Pick data from
‘way’ that ‘hits’
Pick data from
‘way’ that ‘hits’

L06-23
Sze and Emer
Sze and Emer
Cache Miss Pipeline Diagram
February 22, 2023
ld r6, w(r5)
mul r7,r4,r6
Time (cycles)
IF ID RF EX D$ WB
IF ID RF EX D$ WB
ld r6, w(r5)
mul r7,r4,r6
IF ID RF EX D$ MISS - MEM WB
IF ID RF EX D$ WB
HIT
MISS
stall
stall

L06-24
Sze and Emer
Sze and Emer
Avoiding Cache Miss Stalls
• Reorganize code so loads are far ahead of use
– Requires huge amount of unrolling
– Consumes lots of registers
• Add ‘prefetch’ instructions that just load cache
– Consumes instruction issue slots
• Add hardware that automatically loads cache
February 22, 2023

L06-25
Sze and Emer
Sze and Emer
Hardware Data Prefetching
February 22, 2023
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access

L06-26
Sze and Emer
Sze and Emer
Multi-level Caches
February 22, 2023
• A memory cannot be large and fast
• Add level of cache to reduce miss penalty
– Each level can have longer latency than level above
– So, increase sizes of cache at each level
CPU L1 L2 DRAM
Metrics:
Local miss rate = misses in cache/ accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions

L06-27
Sze and Emer
Sze and Emer
Contemporary CPU Cache Hierarchy
February 22, 2023

L06-28
Sze and Emer
Sze and Emer
H
W
C
N
FC Layer – Multichannel
February 22, 2023
…
M
…
input fmaps
output fmaps
…
filters
H
C
1
1 1
1
1
N
W 1
H
C
1
W
H
W
C
M

L06-29
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=

L06-30
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply

L06-31
Sze and Emer
Sze and Emer
FC Einsum Notation
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤

L06-32
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=

L06-33
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
●●●●
●●●●

L06-34
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=

L06-35
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
How much temporal locality for naïve implementation? None
●●●●
●●●●
●
●
●
●
●
●
●
●

L06-36
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply
February 22, 2023

L06-37
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply Tiled
February 22, 2023

L06-38
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache

L06-39
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1

L06-40
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
●●●●
●●●●
*Dotted line means partial result

L06-41
Sze and Emer
Sze and Emer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1

L06-42
Sze and Emer
Sze and Emer
Einsum for tiled FC
February 22, 2023
𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0

L06-43
Sze and Emer
Sze and Emer
February 22, 2023
• Implementation: Matrix Multiplication (GEMM)
• CPU: OpenBLAS, Intel MKL, etc
• GPU: cuBLAS, cuDNN, etc
• Library will note shape of the matrix multiply
and select implementation optimized for that
shape.
• Optimization usually involves proper tiling to
storage hierarchy

L06-44
Sze and Emer
Sze and Emer
Tradeoffs in Memories
February 22, 2023

L06-45
Sze and Emer
Sze and Emer
Overview of Memories
Memory consist of arrays of cells that hold a value.
• Types of Memories/Storage
– Latches/Flip Flops (Registers)
– SRAM (Register File, Caches)
– DRAM (Main Memory)
– Flash (Storage)
February 22, 2023

L06-46
Sze and Emer
Sze and Emer
Elements of Memory Operation
Implementations vary based on:
– How a memory cell holds a value?
– How is a value obtained from a memory cell?
– How is a value set in a memory cell?
– How is array constructed out of individual cells?
• Results in tradeoffs between cost, density, speed, energy and
power consumption
February 22, 2023

L06-47
Sze and Emer
Sze and Emer
Latches/Flip Flops
• Fast and low latency
• Located with logic
February 22, 2023
D$
PC I$ IR GPR
X
Y
+
*
Example from CPU pipeline
D-flip flop
Image source: 6.111

L06-48
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
• Fast and low latency
• Located with logic
• Not very dense
– 10+ transistors per bit
– Usually use for arrays smaller than 0.5kB
February 22, 2023
Array of Flip flops
D-flip flop
Image source: 6.111
Read
Address
[A2:A0]

L06-49
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
February 22, 2023
Array of Flip flops
Read
Address
[A2:A0]
PC I$ IR GPR
X
Y
+
*
D$

L06-50
Sze and Emer
Sze and Emer
SRAM
• Higher density than register
– Usually, 6 transistors per bit-cell
• Less robust and slower than latches/flip-flop
February 22, 2023
Bit cell size 0.75um2 in 14nm
IC wafer

L06-51
Sze and Emer
Sze and Emer
SRAM (kB – MB)
February 22, 2023
Address
[Ak:A0]

L06-52
Sze and Emer
Sze and Emer
SRAM
February 22, 2023
PC I$ IR GPR
X
Y
+
*
D$

L06-53
Sze and Emer
Sze and Emer
SRAM Power Dominated by Bit Line
February 22, 2023
56%
6%
15%
22%
Bit-lines (BL)
Word-line (WL)
Sensing Ntwk.
Other
Measured SRAM Power Breakdown
@VDD=0.6V
Larger array  Longer bit-lines
 Higher capacitance  Higher power
Image Source: Mahmut Sinangil

L06-54
Sze and Emer
Sze and Emer
DRAM
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
February 22, 2023

L06-55
Sze and Emer
Sze and Emer
DRAM (GB)
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
– Usually off-chip (except eDRAM – which is pricey!)
– Off-chip interconnect has much higher capacitance
February 22, 2023
p
J
nJ

L06-56
Sze and Emer
Sze and Emer
Flash (100GB to TB)
• More dense than DRAM
• Non-volatile
– Needs high powered write (change VTH of transistor)
February 22, 2023

L06-57
Sze and Emer
Sze and Emer
Flash Memory
Multi-levels
cell (MLC)
48 layer, Ternary
level cell (TLC)
Aug 2015
256 Gb per die (for SSD)
Single Level
Cell (SLC)
Single Level Cell (SLC)
Multi-levels cell (MLC)
February 22, 2023

L06-58
Sze and Emer
Sze and Emer
Memory Tradeoffs
February 22, 2023
Density
Function of circuit type
(smaller → denser)
Cost/bit
Function of circuit type (smaller → cheaper)
Energy/access/bit
Function of total capacity
(smaller → less energy)
and circuit type
(smaller → less energy)
Latency
Function of circuit type
(smaller → slower)
and total capacity
(smaller → faster)
Bandwidth
Increases with
parallelism
Most attributes tend to improve with technology scaling,
lower voltage and sometimes smaller capacitors

L06-59
Sze and Emer
Sze and Emer
Summary
• Reduce main memory access with caches
– Main memory (i.e., DRAM) is slow and has high energy consumption
– Exploits spatial and temporal locality
• Tiling to reduce cache misses
– Possible since processing order does not affect result (MACs are commutative)
– Add levels to loop nest to improve temporal locality
– Size of tile depends on cache size and cache associativity
• Tradeoffs in storage technology
– Various tradeoffs in cost, speed, energy, capacity…
– Different technologies appropriate at different spots in the design
February 22, 2023

L06-60
Sze and Emer
Sze and Emer
Next Lecture: Vectorization
Thank you!
February 22, 2023

L06.pdf

Recommended

Recommended

More Related Content

Similar to L06.pdf

Similar to L06.pdf (20)

More from TRNHONGLINHBCHCM

More from TRNHONGLINHBCHCM (12)

Recently uploaded

Recently uploaded (20)

L06.pdf