SlideShare a Scribd company logo
L06-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Kernel Computation -
Impact of Memory Hierarchy
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 22, 2023
L06-2
Sze and Emer
Sze and Emer
Goals of Today’s Lecture
• Understand impact of memory hierarchy
– Overview of caches
– Structuring algorithms to work well in caches using tiling
– Storage technologies
February 22, 2023
L06-3
Sze and Emer
Sze and Emer
Readings for this Week
• Efficient Processing of Deep Neural Networks
– Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7
.
February 22, 2023
L06-4
Sze and Emer
Sze and Emer
Simple Pipelined µArchitecture
February 22, 2023
PC
I
M
E
M
IR GPR
X
Y
+
*
D
M
E
M
Warning: Objects in PowerPoint may
be larger than they appear
What are consequences of putting
large memory (e.g., megabytes)
directly in pipeline?
Long latency => dependency stalls
Large energy consumption
L06-5
Sze and Emer
Sze and Emer
Pipelined µArchitecture with Caches
February 22, 2023
PC
I
$
IR GPR
X
Y
+
* D
$
Memory
Instruction cache (I$) and data cache (D$) hold memory data
for reuse in small energy efficient buffer
L06-6
Sze and Emer
Sze and Emer
Direct Mapped Cache
February 22, 2023
Tag Data Block
V
=
Offset
Tag Index
t k b
t
HIT
Data Word or Byte
2k
lines
Block number Block offset
Valid bit
indicates data
block is valid
Data block consists
of multiple words
Valid and tag
match means
data is in cache Offset selects desired word
Address partitioned
into multiple fields
Index
picks
row
L06-7
Sze and Emer
Sze and Emer
Cache Operation
February 22, 2023
Look at data address, search cache tags to find match.
Then if…
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait …
Return data to processor
and update cache
Metric: Hit Rate = #Hits / (#Hits + #Misses)
L06-8
Sze and Emer
Sze and Emer
Treatment of Writes
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache in processor pipeline
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
February 22, 2023
L06-9
Sze and Emer
Sze and Emer
Cache Locality
Caches implicitly try to optimize data movement by trying to exploit
two common properties of memory references:
– Spatial Locality: If a location is referenced it is likely that locations near it
will be referenced in the near future.
• Exploited by having block size larger than a word, which also amortizes fill
overheads by getting more bytes with one access
– Temporal Locality: If a location is referenced it is likely to be referenced
again in the near future.
• Exploited by holding blocks for future access
February 22, 2023
L06-10
Sze and Emer
Sze and Emer
Fully Connected (FC) Computation
February 22, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += CHW
for chw in [0, CHW):
o[m] += i[chw] * f[CHWm + chw]
M iterations
C*H*W iterations
M*C*H*W loads
M*C*H*W loads of each
weight and input activation
L06-11
Sze and Emer
Sze and Emer
Impact of spatial locality
February 22, 2023
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
Hit rate of long sequential reference
streams due to spatial locality?
15/16 => ~94%
L06-12
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]…
Not drawn to scale
Weight locality…
Spatial? Yes
Input activation locality…
Spatial? Yes
CHW
M*CHW
m=0 1 2 3
Temporal? No. No reuse!
Temporal? It depends.
L06-13
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
L06-14
Sze and Emer
Sze and Emer
Amount of temporal locality
• Typical layer size:
– H, W = 256 C = 128
February 22, 2023
Size of input activations? 256x256x128x4 => 32MB
What does this imply for
input activation hit rate?
No temporal locality
since 32MB > 64K bytes
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
L06-15
Sze and Emer
Sze and Emer
Computational Intensity – Naïve FC
Number MACS: M*C*H*W
Input activation accesses M*C*H*W
Filter weight accesses M*C*H*W
Output activation accesses M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
2 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~
1
2
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-16
Sze and Emer
Sze and Emer
Computational Intensity – Ideal FC
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses: M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
1 +
1
𝑀𝑀 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~ 1
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-17
Sze and Emer
Sze and Emer
Einsum for strip mined FC
𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
February 22, 2023
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0
𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1
𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
L06-18
Sze and Emer
Sze and Emer
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Fully Connected – Strip Mined
February 22, 2023
for m in [0, M):
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHW*m + chw]
Just considering input activations,
what value should CHW0 be?
Less than cache size
Inner loop
working set = X
Inner loop working
set = CHW0
CHW1*CHW0 =
C*H*W
L06-19
Sze and Emer
Sze and Emer
FC - Strip Mined Data Reference Pattern
February 22, 2023
Untiled Tiled
Cache
Hits?
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]
…
Not drawn to scale
CHW
M*CHW
L06-20
Sze and Emer
Sze and Emer
Matrix-Vector Multiply – Strip Mined
February 22, 2023
L06-21
Sze and Emer
Sze and Emer
Computational Intensity – Strip Mined
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses M
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀
=
1
1+
1
𝑀𝑀
+
1
𝐶𝐶×𝐻𝐻×𝑊𝑊
~ 1
February 22, 2023
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Computational Intensity =
L06-22
Sze and Emer
Sze and Emer
Associative Cache
February 22, 2023
Tag Data Block
V
=
Block
Offset
Tag Index
t
k
b
HIT
Tag Data Block
V
Data
Word
or Byte
=
t
Allows multiple
streams to be
resident at
same time
Pick data from
‘way’ that ‘hits’
Pick data from
‘way’ that ‘hits’
L06-23
Sze and Emer
Sze and Emer
Cache Miss Pipeline Diagram
February 22, 2023
ld r6, w(r5)
mul r7,r4,r6
Time (cycles)
IF ID RF EX D$ WB
IF ID RF EX D$ WB
ld r6, w(r5)
mul r7,r4,r6
IF ID RF EX D$ MISS - MEM WB
IF ID RF EX D$ WB
HIT
MISS
stall
stall
L06-24
Sze and Emer
Sze and Emer
Avoiding Cache Miss Stalls
• Reorganize code so loads are far ahead of use
– Requires huge amount of unrolling
– Consumes lots of registers
• Add ‘prefetch’ instructions that just load cache
– Consumes instruction issue slots
• Add hardware that automatically loads cache
February 22, 2023
L06-25
Sze and Emer
Sze and Emer
Hardware Data Prefetching
February 22, 2023
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access
L06-26
Sze and Emer
Sze and Emer
Multi-level Caches
February 22, 2023
• A memory cannot be large and fast
• Add level of cache to reduce miss penalty
– Each level can have longer latency than level above
– So, increase sizes of cache at each level
CPU L1 L2 DRAM
Metrics:
Local miss rate = misses in cache/ accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
L06-27
Sze and Emer
Sze and Emer
Contemporary CPU Cache Hierarchy
February 22, 2023
L06-28
Sze and Emer
Sze and Emer
H
W
C
N
FC Layer – Multichannel
February 22, 2023
…
M
…
input fmaps
output fmaps
…
filters
H
C
1
1 1
1
1
N
W 1
H
C
1
W
H
W
C
M
L06-29
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
L06-30
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-31
Sze and Emer
Sze and Emer
FC Einsum Notation
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
L06-32
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-33
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
●●●●
●●●●
L06-34
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-35
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
How much temporal locality for naïve implementation? None
●●●●
●●●●
●
●
●
●
●
●
●
●
L06-36
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply
February 22, 2023
L06-37
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply Tiled
February 22, 2023
L06-38
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-39
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-40
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
●●●●
●●●●
*Dotted line means partial result
L06-41
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-42
Sze and Emer
Sze and Emer
Einsum for tiled FC
February 22, 2023
𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
L06-43
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
• Implementation: Matrix Multiplication (GEMM)
• CPU: OpenBLAS, Intel MKL, etc
• GPU: cuBLAS, cuDNN, etc
• Library will note shape of the matrix multiply
and select implementation optimized for that
shape.
• Optimization usually involves proper tiling to
storage hierarchy
L06-44
Sze and Emer
Sze and Emer
Tradeoffs in Memories
February 22, 2023
L06-45
Sze and Emer
Sze and Emer
Overview of Memories
Memory consist of arrays of cells that hold a value.
• Types of Memories/Storage
– Latches/Flip Flops (Registers)
– SRAM (Register File, Caches)
– DRAM (Main Memory)
– Flash (Storage)
February 22, 2023
L06-46
Sze and Emer
Sze and Emer
Elements of Memory Operation
Implementations vary based on:
– How a memory cell holds a value?
– How is a value obtained from a memory cell?
– How is a value set in a memory cell?
– How is array constructed out of individual cells?
• Results in tradeoffs between cost, density, speed, energy and
power consumption
February 22, 2023
L06-47
Sze and Emer
Sze and Emer
Latches/Flip Flops
• Fast and low latency
• Located with logic
February 22, 2023
D$
PC I$ IR GPR
X
Y
+
*
Example from CPU pipeline
D-flip flop
Image source: 6.111
L06-48
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
• Fast and low latency
• Located with logic
• Not very dense
– 10+ transistors per bit
– Usually use for arrays smaller than 0.5kB
February 22, 2023
Array of Flip flops
D-flip flop
Image source: 6.111
Read
Address
[A2:A0]
L06-49
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
February 22, 2023
Array of Flip flops
Read
Address
[A2:A0]
PC I$ IR GPR
X
Y
+
*
D$
L06-50
Sze and Emer
Sze and Emer
SRAM
• Higher density than register
– Usually, 6 transistors per bit-cell
• Less robust and slower than latches/flip-flop
February 22, 2023
Bit cell size 0.75um2 in 14nm
IC wafer
L06-51
Sze and Emer
Sze and Emer
SRAM (kB – MB)
February 22, 2023
Address
[Ak:A0]
L06-52
Sze and Emer
Sze and Emer
SRAM
February 22, 2023
PC I$ IR GPR
X
Y
+
*
D$
L06-53
Sze and Emer
Sze and Emer
SRAM Power Dominated by Bit Line
February 22, 2023
56%
6%
15%
22%
Bit-lines (BL)
Word-line (WL)
Sensing Ntwk.
Other
Measured SRAM Power Breakdown
@VDD=0.6V
Larger array  Longer bit-lines
 Higher capacitance  Higher power
Image Source: Mahmut Sinangil
L06-54
Sze and Emer
Sze and Emer
DRAM
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
February 22, 2023
L06-55
Sze and Emer
Sze and Emer
DRAM (GB)
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
– Usually off-chip (except eDRAM – which is pricey!)
– Off-chip interconnect has much higher capacitance
February 22, 2023
p
J
nJ
L06-56
Sze and Emer
Sze and Emer
Flash (100GB to TB)
• More dense than DRAM
• Non-volatile
– Needs high powered write (change VTH of transistor)
February 22, 2023
L06-57
Sze and Emer
Sze and Emer
Flash Memory
Multi-levels
cell (MLC)
48 layer, Ternary
level cell (TLC)
Aug 2015
256 Gb per die (for SSD)
Single Level
Cell (SLC)
Single Level Cell (SLC)
Multi-levels cell (MLC)
February 22, 2023
L06-58
Sze and Emer
Sze and Emer
Memory Tradeoffs
February 22, 2023
Density
Function of circuit type
(smaller → denser)
Cost/bit
Function of circuit type (smaller → cheaper)
Energy/access/bit
Function of total capacity
(smaller → less energy)
and circuit type
(smaller → less energy)
Latency
Function of circuit type
(smaller → slower)
and total capacity
(smaller → faster)
Bandwidth
Increases with
parallelism
Most attributes tend to improve with technology scaling,
lower voltage and sometimes smaller capacitors
L06-59
Sze and Emer
Sze and Emer
Summary
• Reduce main memory access with caches
– Main memory (i.e., DRAM) is slow and has high energy consumption
– Exploits spatial and temporal locality
• Tiling to reduce cache misses
– Possible since processing order does not affect result (MACs are commutative)
– Add levels to loop nest to improve temporal locality
– Size of tile depends on cache size and cache associativity
• Tradeoffs in storage technology
– Various tradeoffs in cost, speed, energy, capacity…
– Different technologies appropriate at different spots in the design
February 22, 2023
L06-60
Sze and Emer
Sze and Emer
Next Lecture: Vectorization
Thank you!
February 22, 2023

More Related Content

Similar to L06.pdf

L04.pdf
L04.pdfL04.pdf
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
eSAT Publishing House
 
Write miss
Write missWrite miss
Write miss
marangburu42
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
TRNHONGLINHBCHCM
 
lec16-memory.ppt
lec16-memory.pptlec16-memory.ppt
lec16-memory.ppt
AshokRachapalli1
 
Computer structurepowerpoint
Computer structurepowerpointComputer structurepowerpoint
Computer structurepowerpoint
hamid ali
 
Cache memory
Cache memoryCache memory
Cache memory
Ansari Maviya
 
L05.pdf
L05.pdfL05.pdf
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
rameshreddybattini
 
cashe introduction, and heirarchy basics
cashe introduction, and heirarchy basicscashe introduction, and heirarchy basics
cashe introduction, and heirarchy basics
vedangmanuvarmaneo
 
Cache memory
Cache memoryCache memory
Cache memory
Shailesh Tanwar
 
CA UNIT I.pptx
CA UNIT I.pptxCA UNIT I.pptx
CA UNIT I.pptx
ssuser9dbd7e
 
Cache recap
Cache recapCache recap
Cache recap
Fraboni Ec
 
Cache recap
Cache recapCache recap
Cache recap
James Wong
 
Cache recap
Cache recapCache recap
Cache recap
Hoang Nguyen
 

Similar to L06.pdf (20)

L04.pdf
L04.pdfL04.pdf
L04.pdf
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
Write miss
Write missWrite miss
Write miss
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
 
lec16-memory.ppt
lec16-memory.pptlec16-memory.ppt
lec16-memory.ppt
 
Computer structurepowerpoint
Computer structurepowerpointComputer structurepowerpoint
Computer structurepowerpoint
 
Cache memory
Cache memoryCache memory
Cache memory
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
 
cashe introduction, and heirarchy basics
cashe introduction, and heirarchy basicscashe introduction, and heirarchy basics
cashe introduction, and heirarchy basics
 
Cache memory
Cache memoryCache memory
Cache memory
 
CA UNIT I.pptx
CA UNIT I.pptxCA UNIT I.pptx
CA UNIT I.pptx
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
pramod
pramodpramod
pramod
 

More from TRNHONGLINHBCHCM

L07.pdf
L07.pdfL07.pdf
L03.pdf
L03.pdfL03.pdf
L11.pdf
L11.pdfL11.pdf
L12.pdf
L12.pdfL12.pdf
L01.pdf
L01.pdfL01.pdf
L02.pdf
L02.pdfL02.pdf
L14.pdf
L14.pdfL14.pdf
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
TRNHONGLINHBCHCM
 
L08.pdf
L08.pdfL08.pdf
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
TRNHONGLINHBCHCM
 
L13.pdf
L13.pdfL13.pdf
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
TRNHONGLINHBCHCM
 

More from TRNHONGLINHBCHCM (12)

L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 

Recently uploaded

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 

Recently uploaded (20)

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 

L06.pdf

  • 1. L06-1 Sze and Emer 6.5930/1 Hardware Architectures for Deep Learning Kernel Computation - Impact of Memory Hierarchy Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 22, 2023
  • 2. L06-2 Sze and Emer Sze and Emer Goals of Today’s Lecture • Understand impact of memory hierarchy – Overview of caches – Structuring algorithms to work well in caches using tiling – Storage technologies February 22, 2023
  • 3. L06-3 Sze and Emer Sze and Emer Readings for this Week • Efficient Processing of Deep Neural Networks – Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7 . February 22, 2023
  • 4. L06-4 Sze and Emer Sze and Emer Simple Pipelined µArchitecture February 22, 2023 PC I M E M IR GPR X Y + * D M E M Warning: Objects in PowerPoint may be larger than they appear What are consequences of putting large memory (e.g., megabytes) directly in pipeline? Long latency => dependency stalls Large energy consumption
  • 5. L06-5 Sze and Emer Sze and Emer Pipelined µArchitecture with Caches February 22, 2023 PC I $ IR GPR X Y + * D $ Memory Instruction cache (I$) and data cache (D$) hold memory data for reuse in small energy efficient buffer
  • 6. L06-6 Sze and Emer Sze and Emer Direct Mapped Cache February 22, 2023 Tag Data Block V = Offset Tag Index t k b t HIT Data Word or Byte 2k lines Block number Block offset Valid bit indicates data block is valid Data block consists of multiple words Valid and tag match means data is in cache Offset selects desired word Address partitioned into multiple fields Index picks row
  • 7. L06-7 Sze and Emer Sze and Emer Cache Operation February 22, 2023 Look at data address, search cache tags to find match. Then if… Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS Read block of data from Main Memory Wait … Return data to processor and update cache Metric: Hit Rate = #Hits / (#Hits + #Misses)
  • 8. L06-8 Sze and Emer Sze and Emer Treatment of Writes • Cache hit: – write through: write both cache & memory • generally higher traffic but simplifies cache in processor pipeline – write back: write cache only (memory is written only when the entry is evicted) • a dirty bit per block can further reduce the traffic • Cache miss: – no write allocate: only write to main memory – write allocate (aka fetch on write): fetch into cache • Common combinations: – write through and no write allocate – write back with write allocate February 22, 2023
  • 9. L06-9 Sze and Emer Sze and Emer Cache Locality Caches implicitly try to optimize data movement by trying to exploit two common properties of memory references: – Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. • Exploited by having block size larger than a word, which also amortizes fill overheads by getting more bytes with one access – Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. • Exploited by holding blocks for future access February 22, 2023
  • 10. L06-10 Sze and Emer Sze and Emer Fully Connected (FC) Computation February 22, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations CHWm = -C*H*W for m in [0, M): o[m] = 0 CHWm += CHW for chw in [0, CHW): o[m] += i[chw] * f[CHWm + chw] M iterations C*H*W iterations M*C*H*W loads M*C*H*W loads of each weight and input activation
  • 11. L06-11 Sze and Emer Sze and Emer Impact of spatial locality February 22, 2023 • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block Hit rate of long sequential reference streams due to spatial locality? 15/16 => ~94%
  • 12. L06-12 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023 F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1]… Not drawn to scale Weight locality… Spatial? Yes Input activation locality… Spatial? Yes CHW M*CHW m=0 1 2 3 Temporal? No. No reuse! Temporal? It depends.
  • 13. L06-13 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023
  • 14. L06-14 Sze and Emer Sze and Emer Amount of temporal locality • Typical layer size: – H, W = 256 C = 128 February 22, 2023 Size of input activations? 256x256x128x4 => 32MB What does this imply for input activation hit rate? No temporal locality since 32MB > 64K bytes • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block
  • 15. L06-15 Sze and Emer Sze and Emer Computational Intensity – Naïve FC Number MACS: M*C*H*W Input activation accesses M*C*H*W Filter weight accesses M*C*H*W Output activation accesses M 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 = 1 2 + 1 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 ~ 1 2 CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 16. L06-16 Sze and Emer Sze and Emer Computational Intensity – Ideal FC Number MACS: M*C*H*W Input activation accesses: C*H*W Filter weight accesses: M*C*H*W Output activation accesses: M 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 = 1 1 + 1 𝑀𝑀 + 1 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 ~ 1 CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 17. L06-17 Sze and Emer Sze and Emer Einsum for strip mined FC 𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 February 22, 2023 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
  • 18. L06-18 Sze and Emer Sze and Emer // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Fully Connected – Strip Mined February 22, 2023 for m in [0, M): for chw in [0, C*H*W): o[m] += i[chw] * f[CHW*m + chw] Just considering input activations, what value should CHW0 be? Less than cache size Inner loop working set = X Inner loop working set = CHW0 CHW1*CHW0 = C*H*W
  • 19. L06-19 Sze and Emer Sze and Emer FC - Strip Mined Data Reference Pattern February 22, 2023 Untiled Tiled Cache Hits? F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1] … Not drawn to scale CHW M*CHW
  • 20. L06-20 Sze and Emer Sze and Emer Matrix-Vector Multiply – Strip Mined February 22, 2023
  • 21. L06-21 Sze and Emer Sze and Emer Computational Intensity – Strip Mined Number MACS: M*C*H*W Input activation accesses: C*H*W Filter weight accesses: M*C*H*W Output activation accesses M 𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊 𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀 = 1 1+ 1 𝑀𝑀 + 1 𝐶𝐶×𝐻𝐻×𝑊𝑊 ~ 1 February 22, 2023 // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Computational Intensity =
  • 22. L06-22 Sze and Emer Sze and Emer Associative Cache February 22, 2023 Tag Data Block V = Block Offset Tag Index t k b HIT Tag Data Block V Data Word or Byte = t Allows multiple streams to be resident at same time Pick data from ‘way’ that ‘hits’ Pick data from ‘way’ that ‘hits’
  • 23. L06-23 Sze and Emer Sze and Emer Cache Miss Pipeline Diagram February 22, 2023 ld r6, w(r5) mul r7,r4,r6 Time (cycles) IF ID RF EX D$ WB IF ID RF EX D$ WB ld r6, w(r5) mul r7,r4,r6 IF ID RF EX D$ MISS - MEM WB IF ID RF EX D$ WB HIT MISS stall stall
  • 24. L06-24 Sze and Emer Sze and Emer Avoiding Cache Miss Stalls • Reorganize code so loads are far ahead of use – Requires huge amount of unrolling – Consumes lots of registers • Add ‘prefetch’ instructions that just load cache – Consumes instruction issue slots • Add hardware that automatically loads cache February 22, 2023
  • 25. L06-25 Sze and Emer Sze and Emer Hardware Data Prefetching February 22, 2023 • Prefetch-on-miss: – Prefetch b + 1 upon miss on b • One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is accessed – Can extend to N block lookahead • Strided prefetch – If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
  • 26. L06-26 Sze and Emer Sze and Emer Multi-level Caches February 22, 2023 • A memory cannot be large and fast • Add level of cache to reduce miss penalty – Each level can have longer latency than level above – So, increase sizes of cache at each level CPU L1 L2 DRAM Metrics: Local miss rate = misses in cache/ accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions
  • 27. L06-27 Sze and Emer Sze and Emer Contemporary CPU Cache Hierarchy February 22, 2023
  • 28. L06-28 Sze and Emer Sze and Emer H W C N FC Layer – Multichannel February 22, 2023 … M … input fmaps output fmaps … filters H C 1 1 1 1 1 N W 1 H C 1 W H W C M
  • 29. L06-29 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M =
  • 30. L06-30 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 31. L06-31 Sze and Emer Sze and Emer FC Einsum Notation February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = 𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
  • 32. L06-32 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 33. L06-33 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply ●●●● ●●●●
  • 34. L06-34 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 35. L06-35 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply How much temporal locality for naïve implementation? None ●●●● ●●●● ● ● ● ● ● ● ● ●
  • 36. L06-36 Sze and Emer Sze and Emer Matrix-Matrix Multiply February 22, 2023
  • 37. L06-37 Sze and Emer Sze and Emer Matrix-Matrix Multiply Tiled February 22, 2023
  • 38. L06-38 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 39. L06-39 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 40. L06-40 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache ●●●● ●●●● *Dotted line means partial result
  • 41. L06-41 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 42. L06-42 Sze and Emer Sze and Emer Einsum for tiled FC February 22, 2023 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0 𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
  • 43. L06-43 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 • Implementation: Matrix Multiplication (GEMM) • CPU: OpenBLAS, Intel MKL, etc • GPU: cuBLAS, cuDNN, etc • Library will note shape of the matrix multiply and select implementation optimized for that shape. • Optimization usually involves proper tiling to storage hierarchy
  • 44. L06-44 Sze and Emer Sze and Emer Tradeoffs in Memories February 22, 2023
  • 45. L06-45 Sze and Emer Sze and Emer Overview of Memories Memory consist of arrays of cells that hold a value. • Types of Memories/Storage – Latches/Flip Flops (Registers) – SRAM (Register File, Caches) – DRAM (Main Memory) – Flash (Storage) February 22, 2023
  • 46. L06-46 Sze and Emer Sze and Emer Elements of Memory Operation Implementations vary based on: – How a memory cell holds a value? – How is a value obtained from a memory cell? – How is a value set in a memory cell? – How is array constructed out of individual cells? • Results in tradeoffs between cost, density, speed, energy and power consumption February 22, 2023
  • 47. L06-47 Sze and Emer Sze and Emer Latches/Flip Flops • Fast and low latency • Located with logic February 22, 2023 D$ PC I$ IR GPR X Y + * Example from CPU pipeline D-flip flop Image source: 6.111
  • 48. L06-48 Sze and Emer Sze and Emer Latches/Flip Flops (< 0.5 kB) • Fast and low latency • Located with logic • Not very dense – 10+ transistors per bit – Usually use for arrays smaller than 0.5kB February 22, 2023 Array of Flip flops D-flip flop Image source: 6.111 Read Address [A2:A0]
  • 49. L06-49 Sze and Emer Sze and Emer Latches/Flip Flops (< 0.5 kB) February 22, 2023 Array of Flip flops Read Address [A2:A0] PC I$ IR GPR X Y + * D$
  • 50. L06-50 Sze and Emer Sze and Emer SRAM • Higher density than register – Usually, 6 transistors per bit-cell • Less robust and slower than latches/flip-flop February 22, 2023 Bit cell size 0.75um2 in 14nm IC wafer
  • 51. L06-51 Sze and Emer Sze and Emer SRAM (kB – MB) February 22, 2023 Address [Ak:A0]
  • 52. L06-52 Sze and Emer Sze and Emer SRAM February 22, 2023 PC I$ IR GPR X Y + * D$
  • 53. L06-53 Sze and Emer Sze and Emer SRAM Power Dominated by Bit Line February 22, 2023 56% 6% 15% 22% Bit-lines (BL) Word-line (WL) Sensing Ntwk. Other Measured SRAM Power Breakdown @VDD=0.6V Larger array  Longer bit-lines  Higher capacitance  Higher power Image Source: Mahmut Sinangil
  • 54. L06-54 Sze and Emer Sze and Emer DRAM • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process February 22, 2023
  • 55. L06-55 Sze and Emer Sze and Emer DRAM (GB) • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process – Usually off-chip (except eDRAM – which is pricey!) – Off-chip interconnect has much higher capacitance February 22, 2023 p J nJ
  • 56. L06-56 Sze and Emer Sze and Emer Flash (100GB to TB) • More dense than DRAM • Non-volatile – Needs high powered write (change VTH of transistor) February 22, 2023
  • 57. L06-57 Sze and Emer Sze and Emer Flash Memory Multi-levels cell (MLC) 48 layer, Ternary level cell (TLC) Aug 2015 256 Gb per die (for SSD) Single Level Cell (SLC) Single Level Cell (SLC) Multi-levels cell (MLC) February 22, 2023
  • 58. L06-58 Sze and Emer Sze and Emer Memory Tradeoffs February 22, 2023 Density Function of circuit type (smaller → denser) Cost/bit Function of circuit type (smaller → cheaper) Energy/access/bit Function of total capacity (smaller → less energy) and circuit type (smaller → less energy) Latency Function of circuit type (smaller → slower) and total capacity (smaller → faster) Bandwidth Increases with parallelism Most attributes tend to improve with technology scaling, lower voltage and sometimes smaller capacitors
  • 59. L06-59 Sze and Emer Sze and Emer Summary • Reduce main memory access with caches – Main memory (i.e., DRAM) is slow and has high energy consumption – Exploits spatial and temporal locality • Tiling to reduce cache misses – Possible since processing order does not affect result (MACs are commutative) – Add levels to loop nest to improve temporal locality – Size of tile depends on cache size and cache associativity • Tradeoffs in storage technology – Various tradeoffs in cost, speed, energy, capacity… – Different technologies appropriate at different spots in the design February 22, 2023
  • 60. L06-60 Sze and Emer Sze and Emer Next Lecture: Vectorization Thank you! February 22, 2023