SlideShare a Scribd company logo
1 of 60
Download to read offline
L06-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Kernel Computation -
Impact of Memory Hierarchy
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 22, 2023
L06-2
Sze and Emer
Sze and Emer
Goals of Today’s Lecture
• Understand impact of memory hierarchy
– Overview of caches
– Structuring algorithms to work well in caches using tiling
– Storage technologies
February 22, 2023
L06-3
Sze and Emer
Sze and Emer
Readings for this Week
• Efficient Processing of Deep Neural Networks
– Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7
.
February 22, 2023
L06-4
Sze and Emer
Sze and Emer
Simple Pipelined µArchitecture
February 22, 2023
PC
I
M
E
M
IR GPR
X
Y
+
*
D
M
E
M
Warning: Objects in PowerPoint may
be larger than they appear
What are consequences of putting
large memory (e.g., megabytes)
directly in pipeline?
Long latency => dependency stalls
Large energy consumption
L06-5
Sze and Emer
Sze and Emer
Pipelined µArchitecture with Caches
February 22, 2023
PC
I
$
IR GPR
X
Y
+
* D
$
Memory
Instruction cache (I$) and data cache (D$) hold memory data
for reuse in small energy efficient buffer
L06-6
Sze and Emer
Sze and Emer
Direct Mapped Cache
February 22, 2023
Tag Data Block
V
=
Offset
Tag Index
t k b
t
HIT
Data Word or Byte
2k
lines
Block number Block offset
Valid bit
indicates data
block is valid
Data block consists
of multiple words
Valid and tag
match means
data is in cache Offset selects desired word
Address partitioned
into multiple fields
Index
picks
row
L06-7
Sze and Emer
Sze and Emer
Cache Operation
February 22, 2023
Look at data address, search cache tags to find match.
Then if…
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait …
Return data to processor
and update cache
Metric: Hit Rate = #Hits / (#Hits + #Misses)
L06-8
Sze and Emer
Sze and Emer
Treatment of Writes
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache in processor pipeline
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
February 22, 2023
L06-9
Sze and Emer
Sze and Emer
Cache Locality
Caches implicitly try to optimize data movement by trying to exploit
two common properties of memory references:
– Spatial Locality: If a location is referenced it is likely that locations near it
will be referenced in the near future.
• Exploited by having block size larger than a word, which also amortizes fill
overheads by getting more bytes with one access
– Temporal Locality: If a location is referenced it is likely to be referenced
again in the near future.
• Exploited by holding blocks for future access
February 22, 2023
L06-10
Sze and Emer
Sze and Emer
Fully Connected (FC) Computation
February 22, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += CHW
for chw in [0, CHW):
o[m] += i[chw] * f[CHWm + chw]
M iterations
C*H*W iterations
M*C*H*W loads
M*C*H*W loads of each
weight and input activation
L06-11
Sze and Emer
Sze and Emer
Impact of spatial locality
February 22, 2023
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
Hit rate of long sequential reference
streams due to spatial locality?
15/16 => ~94%
L06-12
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]…
Not drawn to scale
Weight locality…
Spatial? Yes
Input activation locality…
Spatial? Yes
CHW
M*CHW
m=0 1 2 3
Temporal? No. No reuse!
Temporal? It depends.
L06-13
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
L06-14
Sze and Emer
Sze and Emer
Amount of temporal locality
• Typical layer size:
– H, W = 256 C = 128
February 22, 2023
Size of input activations? 256x256x128x4 => 32MB
What does this imply for
input activation hit rate?
No temporal locality
since 32MB > 64K bytes
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
L06-15
Sze and Emer
Sze and Emer
Computational Intensity – Naïve FC
Number MACS: M*C*H*W
Input activation accesses M*C*H*W
Filter weight accesses M*C*H*W
Output activation accesses M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
2 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~
1
2
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-16
Sze and Emer
Sze and Emer
Computational Intensity – Ideal FC
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses: M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
1 +
1
𝑀𝑀 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~ 1
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-17
Sze and Emer
Sze and Emer
Einsum for strip mined FC
𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
February 22, 2023
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0
𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1
𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
L06-18
Sze and Emer
Sze and Emer
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Fully Connected – Strip Mined
February 22, 2023
for m in [0, M):
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHW*m + chw]
Just considering input activations,
what value should CHW0 be?
Less than cache size
Inner loop
working set = X
Inner loop working
set = CHW0
CHW1*CHW0 =
C*H*W
L06-19
Sze and Emer
Sze and Emer
FC - Strip Mined Data Reference Pattern
February 22, 2023
Untiled Tiled
Cache
Hits?
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]
…
Not drawn to scale
CHW
M*CHW
L06-20
Sze and Emer
Sze and Emer
Matrix-Vector Multiply – Strip Mined
February 22, 2023
L06-21
Sze and Emer
Sze and Emer
Computational Intensity – Strip Mined
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses M
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀
=
1
1+
1
𝑀𝑀
+
1
𝐶𝐶×𝐻𝐻×𝑊𝑊
~ 1
February 22, 2023
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Computational Intensity =
L06-22
Sze and Emer
Sze and Emer
Associative Cache
February 22, 2023
Tag Data Block
V
=
Block
Offset
Tag Index
t
k
b
HIT
Tag Data Block
V
Data
Word
or Byte
=
t
Allows multiple
streams to be
resident at
same time
Pick data from
‘way’ that ‘hits’
Pick data from
‘way’ that ‘hits’
L06-23
Sze and Emer
Sze and Emer
Cache Miss Pipeline Diagram
February 22, 2023
ld r6, w(r5)
mul r7,r4,r6
Time (cycles)
IF ID RF EX D$ WB
IF ID RF EX D$ WB
ld r6, w(r5)
mul r7,r4,r6
IF ID RF EX D$ MISS - MEM WB
IF ID RF EX D$ WB
HIT
MISS
stall
stall
L06-24
Sze and Emer
Sze and Emer
Avoiding Cache Miss Stalls
• Reorganize code so loads are far ahead of use
– Requires huge amount of unrolling
– Consumes lots of registers
• Add ‘prefetch’ instructions that just load cache
– Consumes instruction issue slots
• Add hardware that automatically loads cache
February 22, 2023
L06-25
Sze and Emer
Sze and Emer
Hardware Data Prefetching
February 22, 2023
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access
L06-26
Sze and Emer
Sze and Emer
Multi-level Caches
February 22, 2023
• A memory cannot be large and fast
• Add level of cache to reduce miss penalty
– Each level can have longer latency than level above
– So, increase sizes of cache at each level
CPU L1 L2 DRAM
Metrics:
Local miss rate = misses in cache/ accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
L06-27
Sze and Emer
Sze and Emer
Contemporary CPU Cache Hierarchy
February 22, 2023
L06-28
Sze and Emer
Sze and Emer
H
W
C
N
FC Layer – Multichannel
February 22, 2023
…
M
…
input fmaps
output fmaps
…
filters
H
C
1
1 1
1
1
N
W 1
H
C
1
W
H
W
C
M
L06-29
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
L06-30
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-31
Sze and Emer
Sze and Emer
FC Einsum Notation
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
L06-32
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-33
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
●●●●
●●●●
L06-34
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-35
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
How much temporal locality for naïve implementation? None
●●●●
●●●●
●
●
●
●
●
●
●
●
L06-36
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply
February 22, 2023
L06-37
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply Tiled
February 22, 2023
L06-38
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-39
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-40
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
●●●●
●●●●
*Dotted line means partial result
L06-41
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-42
Sze and Emer
Sze and Emer
Einsum for tiled FC
February 22, 2023
𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
L06-43
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
• Implementation: Matrix Multiplication (GEMM)
• CPU: OpenBLAS, Intel MKL, etc
• GPU: cuBLAS, cuDNN, etc
• Library will note shape of the matrix multiply
and select implementation optimized for that
shape.
• Optimization usually involves proper tiling to
storage hierarchy
L06-44
Sze and Emer
Sze and Emer
Tradeoffs in Memories
February 22, 2023
L06-45
Sze and Emer
Sze and Emer
Overview of Memories
Memory consist of arrays of cells that hold a value.
• Types of Memories/Storage
– Latches/Flip Flops (Registers)
– SRAM (Register File, Caches)
– DRAM (Main Memory)
– Flash (Storage)
February 22, 2023
L06-46
Sze and Emer
Sze and Emer
Elements of Memory Operation
Implementations vary based on:
– How a memory cell holds a value?
– How is a value obtained from a memory cell?
– How is a value set in a memory cell?
– How is array constructed out of individual cells?
• Results in tradeoffs between cost, density, speed, energy and
power consumption
February 22, 2023
L06-47
Sze and Emer
Sze and Emer
Latches/Flip Flops
• Fast and low latency
• Located with logic
February 22, 2023
D$
PC I$ IR GPR
X
Y
+
*
Example from CPU pipeline
D-flip flop
Image source: 6.111
L06-48
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
• Fast and low latency
• Located with logic
• Not very dense
– 10+ transistors per bit
– Usually use for arrays smaller than 0.5kB
February 22, 2023
Array of Flip flops
D-flip flop
Image source: 6.111
Read
Address
[A2:A0]
L06-49
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
February 22, 2023
Array of Flip flops
Read
Address
[A2:A0]
PC I$ IR GPR
X
Y
+
*
D$
L06-50
Sze and Emer
Sze and Emer
SRAM
• Higher density than register
– Usually, 6 transistors per bit-cell
• Less robust and slower than latches/flip-flop
February 22, 2023
Bit cell size 0.75um2 in 14nm
IC wafer
L06-51
Sze and Emer
Sze and Emer
SRAM (kB – MB)
February 22, 2023
Address
[Ak:A0]
L06-52
Sze and Emer
Sze and Emer
SRAM
February 22, 2023
PC I$ IR GPR
X
Y
+
*
D$
L06-53
Sze and Emer
Sze and Emer
SRAM Power Dominated by Bit Line
February 22, 2023
56%
6%
15%
22%
Bit-lines (BL)
Word-line (WL)
Sensing Ntwk.
Other
Measured SRAM Power Breakdown
@VDD=0.6V
Larger array  Longer bit-lines
 Higher capacitance  Higher power
Image Source: Mahmut Sinangil
L06-54
Sze and Emer
Sze and Emer
DRAM
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
February 22, 2023
L06-55
Sze and Emer
Sze and Emer
DRAM (GB)
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
– Usually off-chip (except eDRAM – which is pricey!)
– Off-chip interconnect has much higher capacitance
February 22, 2023
p
J
nJ
L06-56
Sze and Emer
Sze and Emer
Flash (100GB to TB)
• More dense than DRAM
• Non-volatile
– Needs high powered write (change VTH of transistor)
February 22, 2023
L06-57
Sze and Emer
Sze and Emer
Flash Memory
Multi-levels
cell (MLC)
48 layer, Ternary
level cell (TLC)
Aug 2015
256 Gb per die (for SSD)
Single Level
Cell (SLC)
Single Level Cell (SLC)
Multi-levels cell (MLC)
February 22, 2023
L06-58
Sze and Emer
Sze and Emer
Memory Tradeoffs
February 22, 2023
Density
Function of circuit type
(smaller → denser)
Cost/bit
Function of circuit type (smaller → cheaper)
Energy/access/bit
Function of total capacity
(smaller → less energy)
and circuit type
(smaller → less energy)
Latency
Function of circuit type
(smaller → slower)
and total capacity
(smaller → faster)
Bandwidth
Increases with
parallelism
Most attributes tend to improve with technology scaling,
lower voltage and sometimes smaller capacitors
L06-59
Sze and Emer
Sze and Emer
Summary
• Reduce main memory access with caches
– Main memory (i.e., DRAM) is slow and has high energy consumption
– Exploits spatial and temporal locality
• Tiling to reduce cache misses
– Possible since processing order does not affect result (MACs are commutative)
– Add levels to loop nest to improve temporal locality
– Size of tile depends on cache size and cache associativity
• Tradeoffs in storage technology
– Various tradeoffs in cost, speed, energy, capacity…
– Different technologies appropriate at different spots in the design
February 22, 2023
L06-60
Sze and Emer
Sze and Emer
Next Lecture: Vectorization
Thank you!
February 22, 2023

More Related Content

Similar to L06.pdf (20)

L04.pdf
L04.pdfL04.pdf
L04.pdf
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
Write miss
Write missWrite miss
Write miss
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
 
lec16-memory.ppt
lec16-memory.pptlec16-memory.ppt
lec16-memory.ppt
 
Computer structurepowerpoint
Computer structurepowerpointComputer structurepowerpoint
Computer structurepowerpoint
 
Cache memory
Cache memoryCache memory
Cache memory
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
 
cashe introduction, and heirarchy basics
cashe introduction, and heirarchy basicscashe introduction, and heirarchy basics
cashe introduction, and heirarchy basics
 
Cache memory
Cache memoryCache memory
Cache memory
 
CA UNIT I.pptx
CA UNIT I.pptxCA UNIT I.pptx
CA UNIT I.pptx
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
pramod
pramodpramod
pramod
 

More from TRNHONGLINHBCHCM (12)

L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 

Recently uploaded

Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 

Recently uploaded (20)

Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 

L06.pdf

  • 1. L06-1 Sze and Emer 6.5930/1 Hardware Architectures for Deep Learning Kernel Computation - Impact of Memory Hierarchy Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 22, 2023
  • 2. L06-2 Sze and Emer Sze and Emer Goals of Today’s Lecture • Understand impact of memory hierarchy – Overview of caches – Structuring algorithms to work well in caches using tiling – Storage technologies February 22, 2023
  • 3. L06-3 Sze and Emer Sze and Emer Readings for this Week • Efficient Processing of Deep Neural Networks – Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7 . February 22, 2023
  • 4. L06-4 Sze and Emer Sze and Emer Simple Pipelined µArchitecture February 22, 2023 PC I M E M IR GPR X Y + * D M E M Warning: Objects in PowerPoint may be larger than they appear What are consequences of putting large memory (e.g., megabytes) directly in pipeline? Long latency => dependency stalls Large energy consumption
  • 5. L06-5 Sze and Emer Sze and Emer Pipelined µArchitecture with Caches February 22, 2023 PC I $ IR GPR X Y + * D $ Memory Instruction cache (I$) and data cache (D$) hold memory data for reuse in small energy efficient buffer
  • 6. L06-6 Sze and Emer Sze and Emer Direct Mapped Cache February 22, 2023 Tag Data Block V = Offset Tag Index t k b t HIT Data Word or Byte 2k lines Block number Block offset Valid bit indicates data block is valid Data block consists of multiple words Valid and tag match means data is in cache Offset selects desired word Address partitioned into multiple fields Index picks row
  • 7. L06-7 Sze and Emer Sze and Emer Cache Operation February 22, 2023 Look at data address, search cache tags to find match. Then if… Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS Read block of data from Main Memory Wait … Return data to processor and update cache Metric: Hit Rate = #Hits / (#Hits + #Misses)
  • 8. L06-8 Sze and Emer Sze and Emer Treatment of Writes • Cache hit: – write through: write both cache & memory • generally higher traffic but simplifies cache in processor pipeline – write back: write cache only (memory is written only when the entry is evicted) • a dirty bit per block can further reduce the traffic • Cache miss: – no write allocate: only write to main memory – write allocate (aka fetch on write): fetch into cache • Common combinations: – write through and no write allocate – write back with write allocate February 22, 2023
  • 9. L06-9 Sze and Emer Sze and Emer Cache Locality Caches implicitly try to optimize data movement by trying to exploit two common properties of memory references: – Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. • Exploited by having block size larger than a word, which also amortizes fill overheads by getting more bytes with one access – Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. • Exploited by holding blocks for future access February 22, 2023
  • 10. L06-10 Sze and Emer Sze and Emer Fully Connected (FC) Computation February 22, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): o[m] = 0 CHWm += CHW for chw in [0, CHW): o[m] += i[chw] * f[CHWm + chw] M iterations C*H*W iterations M*C*H*W loads M*C*H*W loads of each weight and input activation
  • 11. L06-11 Sze and Emer Sze and Emer Impact of spatial locality February 22, 2023 • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block Hit rate of long sequential reference streams due to spatial locality? 15/16 => ~94%
  • 12. L06-12 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023 F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1]… Not drawn to scale Weight locality… Spatial? Yes Input activation locality… Spatial? Yes CHW M*CHW m=0 1 2 3 Temporal? No. No reuse! Temporal? It depends.
  • 13. L06-13 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023
  • 14. L06-14 Sze and Emer Sze and Emer Amount of temporal locality • Typical layer size: – H, W = 256 C = 128 February 22, 2023 Size of input activations? 256x256x128x4 => 32MB What does this imply for input activation hit rate? No temporal locality since 32MB > 64K bytes • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block
  • 15. L06-15 Sze and Emer Sze and Emer Computational Intensity – Naïve FC Number MACS: M*C*H*W Input activation accesses M*C*H*W Filter weight accesses M*C*H*W Output activation accesses M 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 = 1 2 + 1 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 ~ 1 2 CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 16. L06-16 Sze and Emer Sze and Emer Computational Intensity – Ideal FC Number MACS: M*C*H*W Input activation accesses: C*H*W Filter weight accesses: M*C*H*W Output activation accesses: M 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 = 1 1 + 1 𝑀𝑀 + 1 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 ~ 1 CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 17. L06-17 Sze and Emer Sze and Emer Einsum for strip mined FC 𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 February 22, 2023 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
  • 18. L06-18 Sze and Emer Sze and Emer // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Fully Connected – Strip Mined February 22, 2023 for m in [0, M): for chw in [0, C*H*W): o[m] += i[chw] * f[CHW*m + chw] Just considering input activations, what value should CHW0 be? Less than cache size Inner loop working set = X Inner loop working set = CHW0 CHW1*CHW0 = C*H*W
  • 19. L06-19 Sze and Emer Sze and Emer FC - Strip Mined Data Reference Pattern February 22, 2023 Untiled Tiled Cache Hits? F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1] … Not drawn to scale CHW M*CHW
  • 20. L06-20 Sze and Emer Sze and Emer Matrix-Vector Multiply – Strip Mined February 22, 2023
  • 21. L06-21 Sze and Emer Sze and Emer Computational Intensity – Strip Mined Number MACS: M*C*H*W Input activation accesses: C*H*W Filter weight accesses: M*C*H*W Output activation accesses M 𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊 𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀 = 1 1+ 1 𝑀𝑀 + 1 𝐶𝐶×𝐻𝐻×𝑊𝑊 ~ 1 February 22, 2023 // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Computational Intensity =
  • 22. L06-22 Sze and Emer Sze and Emer Associative Cache February 22, 2023 Tag Data Block V = Block Offset Tag Index t k b HIT Tag Data Block V Data Word or Byte = t Allows multiple streams to be resident at same time Pick data from ‘way’ that ‘hits’ Pick data from ‘way’ that ‘hits’
  • 23. L06-23 Sze and Emer Sze and Emer Cache Miss Pipeline Diagram February 22, 2023 ld r6, w(r5) mul r7,r4,r6 Time (cycles) IF ID RF EX D$ WB IF ID RF EX D$ WB ld r6, w(r5) mul r7,r4,r6 IF ID RF EX D$ MISS - MEM WB IF ID RF EX D$ WB HIT MISS stall stall
  • 24. L06-24 Sze and Emer Sze and Emer Avoiding Cache Miss Stalls • Reorganize code so loads are far ahead of use – Requires huge amount of unrolling – Consumes lots of registers • Add ‘prefetch’ instructions that just load cache – Consumes instruction issue slots • Add hardware that automatically loads cache February 22, 2023
  • 25. L06-25 Sze and Emer Sze and Emer Hardware Data Prefetching February 22, 2023 • Prefetch-on-miss: – Prefetch b + 1 upon miss on b • One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is accessed – Can extend to N block lookahead • Strided prefetch – If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
  • 26. L06-26 Sze and Emer Sze and Emer Multi-level Caches February 22, 2023 • A memory cannot be large and fast • Add level of cache to reduce miss penalty – Each level can have longer latency than level above – So, increase sizes of cache at each level CPU L1 L2 DRAM Metrics: Local miss rate = misses in cache/ accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions
  • 27. L06-27 Sze and Emer Sze and Emer Contemporary CPU Cache Hierarchy February 22, 2023
  • 28. L06-28 Sze and Emer Sze and Emer H W C N FC Layer – Multichannel February 22, 2023 … M … input fmaps output fmaps … filters H C 1 1 1 1 1 N W 1 H C 1 W H W C M
  • 29. L06-29 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M =
  • 30. L06-30 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 31. L06-31 Sze and Emer Sze and Emer FC Einsum Notation February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = 𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
  • 32. L06-32 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 33. L06-33 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply ●●●● ●●●●
  • 34. L06-34 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 35. L06-35 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply How much temporal locality for naïve implementation? None ●●●● ●●●● ● ● ● ● ● ● ● ●
  • 36. L06-36 Sze and Emer Sze and Emer Matrix-Matrix Multiply February 22, 2023
  • 37. L06-37 Sze and Emer Sze and Emer Matrix-Matrix Multiply Tiled February 22, 2023
  • 38. L06-38 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 39. L06-39 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 40. L06-40 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache ●●●● ●●●● *Dotted line means partial result
  • 41. L06-41 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 42. L06-42 Sze and Emer Sze and Emer Einsum for tiled FC February 22, 2023 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0 𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
  • 43. L06-43 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 • Implementation: Matrix Multiplication (GEMM) • CPU: OpenBLAS, Intel MKL, etc • GPU: cuBLAS, cuDNN, etc • Library will note shape of the matrix multiply and select implementation optimized for that shape. • Optimization usually involves proper tiling to storage hierarchy
  • 44. L06-44 Sze and Emer Sze and Emer Tradeoffs in Memories February 22, 2023
  • 45. L06-45 Sze and Emer Sze and Emer Overview of Memories Memory consist of arrays of cells that hold a value. • Types of Memories/Storage – Latches/Flip Flops (Registers) – SRAM (Register File, Caches) – DRAM (Main Memory) – Flash (Storage) February 22, 2023
  • 46. L06-46 Sze and Emer Sze and Emer Elements of Memory Operation Implementations vary based on: – How a memory cell holds a value? – How is a value obtained from a memory cell? – How is a value set in a memory cell? – How is array constructed out of individual cells? • Results in tradeoffs between cost, density, speed, energy and power consumption February 22, 2023
  • 47. L06-47 Sze and Emer Sze and Emer Latches/Flip Flops • Fast and low latency • Located with logic February 22, 2023 D$ PC I$ IR GPR X Y + * Example from CPU pipeline D-flip flop Image source: 6.111
  • 48. L06-48 Sze and Emer Sze and Emer Latches/Flip Flops (< 0.5 kB) • Fast and low latency • Located with logic • Not very dense – 10+ transistors per bit – Usually use for arrays smaller than 0.5kB February 22, 2023 Array of Flip flops D-flip flop Image source: 6.111 Read Address [A2:A0]
  • 49. L06-49 Sze and Emer Sze and Emer Latches/Flip Flops (< 0.5 kB) February 22, 2023 Array of Flip flops Read Address [A2:A0] PC I$ IR GPR X Y + * D$
  • 50. L06-50 Sze and Emer Sze and Emer SRAM • Higher density than register – Usually, 6 transistors per bit-cell • Less robust and slower than latches/flip-flop February 22, 2023 Bit cell size 0.75um2 in 14nm IC wafer
  • 51. L06-51 Sze and Emer Sze and Emer SRAM (kB – MB) February 22, 2023 Address [Ak:A0]
  • 52. L06-52 Sze and Emer Sze and Emer SRAM February 22, 2023 PC I$ IR GPR X Y + * D$
  • 53. L06-53 Sze and Emer Sze and Emer SRAM Power Dominated by Bit Line February 22, 2023 56% 6% 15% 22% Bit-lines (BL) Word-line (WL) Sensing Ntwk. Other Measured SRAM Power Breakdown @VDD=0.6V Larger array  Longer bit-lines  Higher capacitance  Higher power Image Source: Mahmut Sinangil
  • 54. L06-54 Sze and Emer Sze and Emer DRAM • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process February 22, 2023
  • 55. L06-55 Sze and Emer Sze and Emer DRAM (GB) • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process – Usually off-chip (except eDRAM – which is pricey!) – Off-chip interconnect has much higher capacitance February 22, 2023 p J nJ
  • 56. L06-56 Sze and Emer Sze and Emer Flash (100GB to TB) • More dense than DRAM • Non-volatile – Needs high powered write (change VTH of transistor) February 22, 2023
  • 57. L06-57 Sze and Emer Sze and Emer Flash Memory Multi-levels cell (MLC) 48 layer, Ternary level cell (TLC) Aug 2015 256 Gb per die (for SSD) Single Level Cell (SLC) Single Level Cell (SLC) Multi-levels cell (MLC) February 22, 2023
  • 58. L06-58 Sze and Emer Sze and Emer Memory Tradeoffs February 22, 2023 Density Function of circuit type (smaller → denser) Cost/bit Function of circuit type (smaller → cheaper) Energy/access/bit Function of total capacity (smaller → less energy) and circuit type (smaller → less energy) Latency Function of circuit type (smaller → slower) and total capacity (smaller → faster) Bandwidth Increases with parallelism Most attributes tend to improve with technology scaling, lower voltage and sometimes smaller capacitors
  • 59. L06-59 Sze and Emer Sze and Emer Summary • Reduce main memory access with caches – Main memory (i.e., DRAM) is slow and has high energy consumption – Exploits spatial and temporal locality • Tiling to reduce cache misses – Possible since processing order does not affect result (MACs are commutative) – Add levels to loop nest to improve temporal locality – Size of tile depends on cache size and cache associativity • Tradeoffs in storage technology – Various tradeoffs in cost, speed, energy, capacity… – Different technologies appropriate at different spots in the design February 22, 2023
  • 60. L06-60 Sze and Emer Sze and Emer Next Lecture: Vectorization Thank you! February 22, 2023