SlideShare a Scribd company logo
1 of 58
Download to read offline
L06-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Kernel Computation -
Impact of Memory Hierarchy
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 22, 2023
L06-2
Sze and Emer
Sze and Emer
Goals of Today’s Lecture
• Understand impact of memory hierarchy
– Overview of caches
– Structuring algorithms to work well in caches using tiling
– Storage technologies
February 22, 2023
L06-3
Sze and Emer
Sze and Emer
Readings for this Week
• Efficient Processing of Deep Neural Networks
– Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7
.
February 22, 2023
L06-4
Sze and Emer
Sze and Emer
Simple Pipelined µArchitecture
February 22, 2023
PC
I
M
E
M
IR GPR
X
Y
+
*
D
M
E
M
Warning: Objects in PowerPoint may
be larger than they appear
What are consequences of putting
large memory (e.g., megabytes)
directly in pipeline?
L06-5
Sze and Emer
Sze and Emer
Pipelined µArchitecture with Caches
February 22, 2023
PC
I
$
IR GPR
X
Y
+
* D
$
Memory
Instruction cache (I$) and data cache (D$) hold memory data
for reuse in small energy efficient buffer
L06-6
Sze and Emer
Sze and Emer
Direct Mapped Cache
February 22, 2023
Tag Data Block
V
=
Offset
Tag Index
t k b
t
HIT
Data Word or Byte
2k
lines
Block number Block offset
Valid bit
indicates data
block is valid
Data block consists
of multiple words
Valid and tag
match means
data is in cache Offset selects desired word
Address partitioned
into multiple fields
Index
picks
row
L06-7
Sze and Emer
Sze and Emer
Cache Operation
February 22, 2023
Look at data address, search cache tags to find match.
Then if…
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait …
Return data to processor
and update cache
Metric: Hit Rate = #Hits / (#Hits + #Misses)
L06-8
Sze and Emer
Sze and Emer
Treatment of Writes
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache in processor pipeline
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
February 22, 2023
L06-9
Sze and Emer
Sze and Emer
Cache Locality
Caches implicitly try to optimize data movement by trying to exploit
two common properties of memory references:
– Spatial Locality: If a location is referenced it is likely that locations near it
will be referenced in the near future.
• Exploited by having block size larger than a word, which also amortizes fill
overheads by getting more bytes with one access
– Temporal Locality: If a location is referenced it is likely to be referenced
again in the near future.
• Exploited by holding blocks for future access
February 22, 2023
L06-10
Sze and Emer
Sze and Emer
Impact of spatial locality
February 22, 2023
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
Hit rate of long sequential reference
streams due to spatial locality?
L06-11
Sze and Emer
Sze and Emer
Fully Connected (FC) Computation
February 22, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
o[m] = 0
CHWm = CHW*m
for chw in [0, CHW):
o[m] += i[chw] * f[CHWm + chw]
M iterations
C*H*W iterations
M*C*H*W loads
M*C*H*W loads of each
weight and input activation
L06-12
Sze and Emer
Sze and Emer
FC - Computational Intensity
Number MACS:
Amount of input activation data:
Amount of filter weight data:
Amount of output activations
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-13
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]…
Not drawn to scale
Weight locality…
Spatial?
What about temporal locality?
Input activation locality…
Spatial?
CHW
M*CHW
m=0 1 2 3
L06-14
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
L06-15
Sze and Emer
Sze and Emer
Amount of temporal locality
• Typical layer size:
– H, W = 256 C = 128
February 22, 2023
Size of input activations?
What does this imply for hit rate?
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
L06-16
Sze and Emer
Sze and Emer
FC - Computational Intensity - Ideal
Number MACS:
Amount of input activation data:
Amount of filter weight data:
Amount of output activations
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
L06-17
Sze and Emer
Sze and Emer
Einsum for strip mined FC
𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
February 22, 2023
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0
𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1
𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
L06-18
Sze and Emer
Sze and Emer
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Fully Connected – Strip Mined
February 22, 2023
for m in [0, M):
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHW*m + chw]
Just considering input activations,
what value should CHW0 be?
Inner loop
working set = X
Inner loop working
set = CHW0
CHW1*CHW0 =
C*H*W
L06-19
Sze and Emer
Sze and Emer
FC - Strip Mined Data Reference Pattern
February 22, 2023
Untiled Tiled
Cache
Hits?
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]
…
Not drawn to scale
CHW
M*CHW
L06-20
Sze and Emer
Sze and Emer
Matrix-Vector Multiply – Strip Mined
February 22, 2023
L06-21
Sze and Emer
Sze and Emer
Computational Intensity – Strip Mined
Number MACS:
Amount of input activation data:
Amount of filter weight data:
Amount of output activations
February 22, 2023
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Computational Intensity =
L06-22
Sze and Emer
Sze and Emer
Associative Cache
February 22, 2023
Tag Data Block
V
=
Block
Offset
Tag Index
t
k
b
HIT
Tag Data Block
V
Data
Word
or Byte
=
t
Allows multiple
streams to be
resident at
same time
Pick data from
‘way’ that ‘hits’
Pick data from
‘way’ that ‘hits’
L06-23
Sze and Emer
Sze and Emer
Cache Miss Pipeline Diagram
February 22, 2023
ld r6, w(r5)
mul r7,r4,r6
Time (cycles)
IF ID RF EX D$ WB
IF ID RF EX D$ WB
ld r6, w(r5)
mul r7,r4,r6
IF ID RF EX D$ MISS - MEM WB
IF ID RF EX D$ WB
HIT
MISS
stall
stall
L06-24
Sze and Emer
Sze and Emer
Avoiding Cache Miss Stalls
• Reorganize code so loads are far ahead of use
– Requires huge amount of unrolling
– Consumes lots of registers
• Add ‘prefetch’ instructions that just load cache
– Consumes instruction issue slots
• Add hardware that automatically loads cache
February 22, 2023
L06-25
Sze and Emer
Sze and Emer
Hardware Data Prefetching
February 22, 2023
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access
L06-26
Sze and Emer
Sze and Emer
Multi-level Caches
February 22, 2023
• A memory cannot be large and fast
• Add level of cache to reduce miss penalty
– Each level can have longer latency than level above
– So, increase sizes of cache at each level
CPU L1 L2 DRAM
Metrics:
Local miss rate = misses in cache/ accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
L06-27
Sze and Emer
Sze and Emer
Contemporary CPU Cache Hierarchy
February 22, 2023
L06-28
Sze and Emer
Sze and Emer
H
W
C
N
FC Layer – Multichannel
February 22, 2023
…
M
…
input fmaps
output fmaps
…
filters
H
C
1
1 1
1
1
N
W 1
H
C
1
W
H
W
C
M
L06-29
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
L06-30
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-31
Sze and Emer
Sze and Emer
FC Einsum Notation
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
L06-32
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-33
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
●●●●
●●●●
L06-34
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
L06-35
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
How much temporal locality for naïve implementation? None
●●●●
●●●●
●
●
●
●
●
●
●
●
L06-36
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply
February 22, 2023
L06-37
Sze and Emer
Sze and Emer
Matrix-Matrix Multiply Tiled
February 22, 2023
L06-38
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-39
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-40
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
●●●●
●●●●
*Dotted line means partial result
L06-41
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
L06-42
Sze and Emer
Sze and Emer
Einsum for tiled FC
February 22, 2023
𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
L06-43
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
• Implementation: Matrix Multiplication (GEMM)
• CPU: OpenBLAS, Intel MKL, etc
• GPU: cuBLAS, cuDNN, etc
• Library will note shape of the matrix multiply
and select implementation optimized for that
shape.
• Optimization usually involves proper tiling to
storage hierarchy
L06-44
Sze and Emer
Sze and Emer
Tradeoffs in Memories
February 22, 2023
L06-45
Sze and Emer
Sze and Emer
Overview of Memories
Memory consist of arrays of cells that hold a value.
• Types of Memories/Storage
– Latches/Flip Flops (Registers)
– SRAM (Register File, Caches)
– DRAM (Main Memory)
– Flash (Storage)
February 22, 2023
L06-46
Sze and Emer
Sze and Emer
Elements of Memory Operation
Implementations vary based on:
– How a memory cell holds a value?
– How is a value obtained from a memory cell?
– How is a value set in a memory cell?
– How is array constructed out of individual cells?
• Results in tradeoffs between cost, density, speed, energy and
power consumption
February 22, 2023
L06-47
Sze and Emer
Sze and Emer
Latches/Flip Flops
• Fast and low latency
• Located with logic
February 22, 2023
PC I$ IR GPR
X
Y
+
*
D$
Example from CPU pipeline
D-flip flop
Image source: 6.111
L06-48
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
• Fast and low latency
• Located with logic
• Not very dense
– 10+ transistors per bit
– Usually use for arrays smaller than 0.5kB
February 22, 2023
Array of Flip flops
D-flip flop
Image source: 6.111
Read
Address
[A2:A0]
L06-49
Sze and Emer
Sze and Emer
SRAM
• Higher density than register
– Usually, 6 transistors per bit-cell
• Less robust and slower than latches/flip-flop
February 22, 2023
Bit cell size 0.75um2 in 14nm
IC wafer
L06-50
Sze and Emer
Sze and Emer
SRAM (kB – MB)
February 22, 2023
Address
[Ak:A0]
L06-51
Sze and Emer
Sze and Emer
SRAM Power Dominated by Bit Line
February 22, 2023
56%
6%
15%
22%
Bit-lines (BL)
Word-line (WL)
Sensing Ntwk.
Other
Measured SRAM Power Breakdown
@VDD=0.6V
Larger array  Longer bit-lines
 Higher capacitance  Higher power
Image Source: Mahmut Sinangil
L06-52
Sze and Emer
Sze and Emer
DRAM
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
February 22, 2023
L06-53
Sze and Emer
Sze and Emer
DRAM (GB)
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
– Usually off-chip (except eDRAM – which is pricey!)
– Off-chip interconnect has much higher capacitance
February 22, 2023
p
J
nJ
L06-54
Sze and Emer
Sze and Emer
Flash (100GB to TB)
• More dense than DRAM
• Non-volatile
– Needs high powered write (change VTH of transistor)
February 22, 2023
L06-55
Sze and Emer
Sze and Emer
Flash Memory
Multi-levels
cell (MLC)
48 layer, Ternary
level cell (TLC)
Aug 2015
256 Gb per die (for SSD)
Single Level
Cell (SLC)
Single Level Cell (SLC)
Multi-levels cell (MLC)
February 22, 2023
L06-56
Sze and Emer
Sze and Emer
Memory Tradeoffs
February 22, 2023
Density
Function of circuit type
(smaller → denser)
Cost/bit
Function of circuit type (smaller → cheaper)
Energy/access/bit
Function of total capacity
(smaller → less energy)
and circuit type
(smaller → less energy)
Latency
Function of circuit type
(smaller → slower)
and total capacity
(smaller → faster)
Bandwidth
Increases with
parallelism
Most attributes tend to improve with technology scaling,
lower voltage and sometimes smaller capacitors
L06-57
Sze and Emer
Sze and Emer
Summary
• Reduce main memory access with caches
– Main memory (i.e., DRAM) is slow and has high energy consumption
– Exploits spatial and temporal locality
• Tiling to reduce cache misses
– Possible since processing order does not affect result (MACs are commutative)
– Add levels to loop nest to improve temporal locality
– Size of tile depends on cache size and cache associativity
• Tradeoffs in storage technology
– Various tradeoffs in cost, speed, energy, capacity…
– Different technologies appropriate at different spots in the design
February 22, 2023
L06-58
Sze and Emer
Sze and Emer
Next Lecture: Vectorization
Thank you!
February 22, 2023

More Related Content

Similar to L06-handout.pdf

CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
rameshreddybattini
 

Similar to L06-handout.pdf (20)

L10.pdf
L10.pdfL10.pdf
L10.pdf
 
NUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV SystemsNUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV Systems
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
CMOS VLSI PROJECT || CMOS 3-Bit Binary to Square of the given Input || MULTIP...
 
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftData Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
pramod
pramodpramod
pramod
 
Cache memory
Cache memoryCache memory
Cache memory
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
CA UNIT I.pptx
CA UNIT I.pptxCA UNIT I.pptx
CA UNIT I.pptx
 
CS304PC:Computer Organization and Architecture Session 31 Multiprogramming.pptx
CS304PC:Computer Organization and Architecture  Session 31 Multiprogramming.pptxCS304PC:Computer Organization and Architecture  Session 31 Multiprogramming.pptx
CS304PC:Computer Organization and Architecture Session 31 Multiprogramming.pptx
 
Vlsi physical design (Back End Process)
Vlsi physical design (Back End Process)Vlsi physical design (Back End Process)
Vlsi physical design (Back End Process)
 
Computer structurepowerpoint
Computer structurepowerpointComputer structurepowerpoint
Computer structurepowerpoint
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache.pptx
Cache.pptxCache.pptx
Cache.pptx
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 

More from TRNHONGLINHBCHCM (8)

L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 

Recently uploaded

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 

L06-handout.pdf

  • 1. L06-1 Sze and Emer 6.5930/1 Hardware Architectures for Deep Learning Kernel Computation - Impact of Memory Hierarchy Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 22, 2023
  • 2. L06-2 Sze and Emer Sze and Emer Goals of Today’s Lecture • Understand impact of memory hierarchy – Overview of caches – Structuring algorithms to work well in caches using tiling – Storage technologies February 22, 2023
  • 3. L06-3 Sze and Emer Sze and Emer Readings for this Week • Efficient Processing of Deep Neural Networks – Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7 . February 22, 2023
  • 4. L06-4 Sze and Emer Sze and Emer Simple Pipelined µArchitecture February 22, 2023 PC I M E M IR GPR X Y + * D M E M Warning: Objects in PowerPoint may be larger than they appear What are consequences of putting large memory (e.g., megabytes) directly in pipeline?
  • 5. L06-5 Sze and Emer Sze and Emer Pipelined µArchitecture with Caches February 22, 2023 PC I $ IR GPR X Y + * D $ Memory Instruction cache (I$) and data cache (D$) hold memory data for reuse in small energy efficient buffer
  • 6. L06-6 Sze and Emer Sze and Emer Direct Mapped Cache February 22, 2023 Tag Data Block V = Offset Tag Index t k b t HIT Data Word or Byte 2k lines Block number Block offset Valid bit indicates data block is valid Data block consists of multiple words Valid and tag match means data is in cache Offset selects desired word Address partitioned into multiple fields Index picks row
  • 7. L06-7 Sze and Emer Sze and Emer Cache Operation February 22, 2023 Look at data address, search cache tags to find match. Then if… Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS Read block of data from Main Memory Wait … Return data to processor and update cache Metric: Hit Rate = #Hits / (#Hits + #Misses)
  • 8. L06-8 Sze and Emer Sze and Emer Treatment of Writes • Cache hit: – write through: write both cache & memory • generally higher traffic but simplifies cache in processor pipeline – write back: write cache only (memory is written only when the entry is evicted) • a dirty bit per block can further reduce the traffic • Cache miss: – no write allocate: only write to main memory – write allocate (aka fetch on write): fetch into cache • Common combinations: – write through and no write allocate – write back with write allocate February 22, 2023
  • 9. L06-9 Sze and Emer Sze and Emer Cache Locality Caches implicitly try to optimize data movement by trying to exploit two common properties of memory references: – Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. • Exploited by having block size larger than a word, which also amortizes fill overheads by getting more bytes with one access – Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. • Exploited by holding blocks for future access February 22, 2023
  • 10. L06-10 Sze and Emer Sze and Emer Impact of spatial locality February 22, 2023 • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block Hit rate of long sequential reference streams due to spatial locality?
  • 11. L06-11 Sze and Emer Sze and Emer Fully Connected (FC) Computation February 22, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for m in [0, M): o[m] = 0 CHWm = CHW*m for chw in [0, CHW): o[m] += i[chw] * f[CHWm + chw] M iterations C*H*W iterations M*C*H*W loads M*C*H*W loads of each weight and input activation
  • 12. L06-12 Sze and Emer Sze and Emer FC - Computational Intensity Number MACS: Amount of input activation data: Amount of filter weight data: Amount of output activations CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 13. L06-13 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023 F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1]… Not drawn to scale Weight locality… Spatial? What about temporal locality? Input activation locality… Spatial? CHW M*CHW m=0 1 2 3
  • 14. L06-14 Sze and Emer Sze and Emer FC – Data Reference Pattern February 22, 2023
  • 15. L06-15 Sze and Emer Sze and Emer Amount of temporal locality • Typical layer size: – H, W = 256 C = 128 February 22, 2023 Size of input activations? What does this imply for hit rate? • Typical in-pipeline cache size – 64K bytes => 16K FP32 words – 64 byte blocks => 16 FP32 words/block
  • 16. L06-16 Sze and Emer Sze and Emer FC - Computational Intensity - Ideal Number MACS: Amount of input activation data: Amount of filter weight data: Amount of output activations CHWm = -C*H*W; for m in [0, M): o[m] = 0; CHWm += C*H*W for chw in [0, C*H*W): o[m] += i[chw] * f[CHWm + chw] February 22, 2023 Computational Intensity = 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Computational Intensity =
  • 17. L06-17 Sze and Emer Sze and Emer Einsum for strip mined FC 𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 February 22, 2023 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
  • 18. L06-18 Sze and Emer Sze and Emer // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Fully Connected – Strip Mined February 22, 2023 for m in [0, M): for chw in [0, C*H*W): o[m] += i[chw] * f[CHW*m + chw] Just considering input activations, what value should CHW0 be? Inner loop working set = X Inner loop working set = CHW0 CHW1*CHW0 = C*H*W
  • 19. L06-19 Sze and Emer Sze and Emer FC - Strip Mined Data Reference Pattern February 22, 2023 Untiled Tiled Cache Hits? F[M0 ------] F[M1 ------] F[M2 ------] F[M3 ------] F[M4 ------] I[C0 H0 W0] I[C0 H0 W1] … Not drawn to scale CHW M*CHW
  • 20. L06-20 Sze and Emer Sze and Emer Matrix-Vector Multiply – Strip Mined February 22, 2023
  • 21. L06-21 Sze and Emer Sze and Emer Computational Intensity – Strip Mined Number MACS: Amount of input activation data: Amount of filter weight data: Amount of output activations February 22, 2023 // Level 1 for chw1 in [0, CHW1): for m in [0, M): // Level 0 for chw0 in [0, CHW0): chw = CHW0*chw1+chw0 o[m] += i[chw] * f[CHW*m + chw] Computational Intensity =
  • 22. L06-22 Sze and Emer Sze and Emer Associative Cache February 22, 2023 Tag Data Block V = Block Offset Tag Index t k b HIT Tag Data Block V Data Word or Byte = t Allows multiple streams to be resident at same time Pick data from ‘way’ that ‘hits’ Pick data from ‘way’ that ‘hits’
  • 23. L06-23 Sze and Emer Sze and Emer Cache Miss Pipeline Diagram February 22, 2023 ld r6, w(r5) mul r7,r4,r6 Time (cycles) IF ID RF EX D$ WB IF ID RF EX D$ WB ld r6, w(r5) mul r7,r4,r6 IF ID RF EX D$ MISS - MEM WB IF ID RF EX D$ WB HIT MISS stall stall
  • 24. L06-24 Sze and Emer Sze and Emer Avoiding Cache Miss Stalls • Reorganize code so loads are far ahead of use – Requires huge amount of unrolling – Consumes lots of registers • Add ‘prefetch’ instructions that just load cache – Consumes instruction issue slots • Add hardware that automatically loads cache February 22, 2023
  • 25. L06-25 Sze and Emer Sze and Emer Hardware Data Prefetching February 22, 2023 • Prefetch-on-miss: – Prefetch b + 1 upon miss on b • One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is accessed – Can extend to N block lookahead • Strided prefetch – If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
  • 26. L06-26 Sze and Emer Sze and Emer Multi-level Caches February 22, 2023 • A memory cannot be large and fast • Add level of cache to reduce miss penalty – Each level can have longer latency than level above – So, increase sizes of cache at each level CPU L1 L2 DRAM Metrics: Local miss rate = misses in cache/ accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions
  • 27. L06-27 Sze and Emer Sze and Emer Contemporary CPU Cache Hierarchy February 22, 2023
  • 28. L06-28 Sze and Emer Sze and Emer H W C N FC Layer – Multichannel February 22, 2023 … M … input fmaps output fmaps … filters H C 1 1 1 1 1 N W 1 H C 1 W H W C M
  • 29. L06-29 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M =
  • 30. L06-30 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 31. L06-31 Sze and Emer Sze and Emer FC Einsum Notation February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = 𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
  • 32. L06-32 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 33. L06-33 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply ●●●● ●●●●
  • 34. L06-34 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply
  • 35. L06-35 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply How much temporal locality for naïve implementation? None ●●●● ●●●● ● ● ● ● ● ● ● ●
  • 36. L06-36 Sze and Emer Sze and Emer Matrix-Matrix Multiply February 22, 2023
  • 37. L06-37 Sze and Emer Sze and Emer Matrix-Matrix Multiply Tiled February 22, 2023
  • 38. L06-38 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 39. L06-39 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 40. L06-40 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache ●●●● ●●●● *Dotted line means partial result
  • 41. L06-41 Sze and Emer Sze and Emer Tiled Fully-Connected (FC) Layer February 22, 2023 M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1, 1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache
  • 42. L06-42 Sze and Emer Sze and Emer Einsum for tiled FC February 22, 2023 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1 𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0 𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
  • 43. L06-43 Sze and Emer Sze and Emer Fully-Connected (FC) Layer February 22, 2023 • Implementation: Matrix Multiplication (GEMM) • CPU: OpenBLAS, Intel MKL, etc • GPU: cuBLAS, cuDNN, etc • Library will note shape of the matrix multiply and select implementation optimized for that shape. • Optimization usually involves proper tiling to storage hierarchy
  • 44. L06-44 Sze and Emer Sze and Emer Tradeoffs in Memories February 22, 2023
  • 45. L06-45 Sze and Emer Sze and Emer Overview of Memories Memory consist of arrays of cells that hold a value. • Types of Memories/Storage – Latches/Flip Flops (Registers) – SRAM (Register File, Caches) – DRAM (Main Memory) – Flash (Storage) February 22, 2023
  • 46. L06-46 Sze and Emer Sze and Emer Elements of Memory Operation Implementations vary based on: – How a memory cell holds a value? – How is a value obtained from a memory cell? – How is a value set in a memory cell? – How is array constructed out of individual cells? • Results in tradeoffs between cost, density, speed, energy and power consumption February 22, 2023
  • 47. L06-47 Sze and Emer Sze and Emer Latches/Flip Flops • Fast and low latency • Located with logic February 22, 2023 PC I$ IR GPR X Y + * D$ Example from CPU pipeline D-flip flop Image source: 6.111
  • 48. L06-48 Sze and Emer Sze and Emer Latches/Flip Flops (< 0.5 kB) • Fast and low latency • Located with logic • Not very dense – 10+ transistors per bit – Usually use for arrays smaller than 0.5kB February 22, 2023 Array of Flip flops D-flip flop Image source: 6.111 Read Address [A2:A0]
  • 49. L06-49 Sze and Emer Sze and Emer SRAM • Higher density than register – Usually, 6 transistors per bit-cell • Less robust and slower than latches/flip-flop February 22, 2023 Bit cell size 0.75um2 in 14nm IC wafer
  • 50. L06-50 Sze and Emer Sze and Emer SRAM (kB – MB) February 22, 2023 Address [Ak:A0]
  • 51. L06-51 Sze and Emer Sze and Emer SRAM Power Dominated by Bit Line February 22, 2023 56% 6% 15% 22% Bit-lines (BL) Word-line (WL) Sensing Ntwk. Other Measured SRAM Power Breakdown @VDD=0.6V Larger array  Longer bit-lines  Higher capacitance  Higher power Image Source: Mahmut Sinangil
  • 52. L06-52 Sze and Emer Sze and Emer DRAM • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process February 22, 2023
  • 53. L06-53 Sze and Emer Sze and Emer DRAM (GB) • Higher density than SRAM – 1 transistor per bit-cell – Needs periodic refresh • Special device process – Usually off-chip (except eDRAM – which is pricey!) – Off-chip interconnect has much higher capacitance February 22, 2023 p J nJ
  • 54. L06-54 Sze and Emer Sze and Emer Flash (100GB to TB) • More dense than DRAM • Non-volatile – Needs high powered write (change VTH of transistor) February 22, 2023
  • 55. L06-55 Sze and Emer Sze and Emer Flash Memory Multi-levels cell (MLC) 48 layer, Ternary level cell (TLC) Aug 2015 256 Gb per die (for SSD) Single Level Cell (SLC) Single Level Cell (SLC) Multi-levels cell (MLC) February 22, 2023
  • 56. L06-56 Sze and Emer Sze and Emer Memory Tradeoffs February 22, 2023 Density Function of circuit type (smaller → denser) Cost/bit Function of circuit type (smaller → cheaper) Energy/access/bit Function of total capacity (smaller → less energy) and circuit type (smaller → less energy) Latency Function of circuit type (smaller → slower) and total capacity (smaller → faster) Bandwidth Increases with parallelism Most attributes tend to improve with technology scaling, lower voltage and sometimes smaller capacitors
  • 57. L06-57 Sze and Emer Sze and Emer Summary • Reduce main memory access with caches – Main memory (i.e., DRAM) is slow and has high energy consumption – Exploits spatial and temporal locality • Tiling to reduce cache misses – Possible since processing order does not affect result (MACs are commutative) – Add levels to loop nest to improve temporal locality – Size of tile depends on cache size and cache associativity • Tradeoffs in storage technology – Various tradeoffs in cost, speed, energy, capacity… – Different technologies appropriate at different spots in the design February 22, 2023
  • 58. L06-58 Sze and Emer Sze and Emer Next Lecture: Vectorization Thank you! February 22, 2023