SlideShare a Scribd company logo
1 of 60
Download to read offline
L13-1
6.5930/1
Hardware Architectures for Deep Learning
Mapping to Hardware
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 20, 2023
L13-2
Sze and Emer
Data Orchestration
March 20, 2023
L13-4
Sze and Emer
Approaches: Implicit versus Explicit
Explicit:
Implicit:
March 20, 2023
L13-5
Sze and Emer
Approaches: Coupled versus Decoupled
Implicit + Coupled Implicit + Decoupled
March 20, 2023
L13-6
Sze and Emer
Explicit Decoupled Data Orchestration
Implicit + Decoupled Explicit + Decoupled
March 20, 2023
L13-7
Sze and Emer
Classifying Orchestration Approaches
implicit data
placement
decoupled
coupled
Load...
Math...
Store...
HW Pre-
fetching
ASIC-style fill
and iterate
engines
GPU
shared
memory
Cache-
level
specific
loads
SW-
controlled
replace.
policy
explicit data
placement
SW pre-
fetching
Decoupled
Access
Execute
March 20, 2023
L13-8
Sze and Emer
EDDO Strategies
March 20, 2023
Fill and use:
Buffer Storage
Drain values
Fill values
Double buffer
Buffer Storage
Drain values
Fill values
Rolling use
Buffer Storage
Drain values
Fill values
L13-9
Sze and Emer
Buffets
• Compared to FIFO
– Allows random access into live window
– Allows updates of values in live window
• Compared to scratchpad:
– Adds scoreboarding for synchronization
– Allows arbitrary degrees of buffering
• Compared to cache
– Addresses local (therefore fewer bits) and no tag store
– Push model versus pull model for smaller landing zone
– Raises level of abstraction beyond single value transfers
Buffet
Fill()
from
memory
Read() Update() Shrink()
local_offset data local_offset data
data
num to
datapath
Can be specialized, for example: “read-only” buffets that have fills but not updates
March 20, 2023
L13-11
Sze and Emer
Buffet Usage Model
March 20, 2023
Buffet
Storage
Read sequence
Read values
Update sequence
Update values
Fill values
Fill sequence
L13-12
Sze and Emer
Buffet Behavioral Attributes
• Based on ‘fill’ address sequence, will pop values from ‘fill’ channel until
fill buffer is full.
• Based on ‘read’ address sequence will try to push values down ‘read’
LI channel, but only if the value has been ‘filled’.
• ‘Read’ address sequence can also inform buffer that a value can be
dropped, i.e., space freed.
• Based on ‘update’ address sequence will try to pop values from
‘update’ channel
• Implementation may include multiple logical buffers inside a single
physical buffer.
March 20, 2023
L13-13
Sze and Emer
Buffets: Composable Idiom for E.D.D.O.
Transfers between levels only depend on a credit
flow from the adjacent level.
March 20, 2023
L13-14
Sze and Emer
Data Communications
March 20, 2023
L13-15
Sze and Emer
Configurable Systolic Array - WARP
March 20, 2023
Source: WARP Architecture and Implementation, ISCA 1986
L13-16
Sze and Emer
Latency-Insensitive (LI) Channels
March 20, 2023
IsFull()
Push() IsEmpty()
Peek()
Pop()
OUT IN
Connections
Producer Consumer
Credit: Pellauer
Inter-unit communication can use Latency-Insensitive
(LI) Channels that, for functionality purposes, have no
latency requirements
Semantics: FIFO abstraction with flow control on all interfaces
L13-17
Sze and Emer
March 20, 2023
Next:
Mapping
L13-18
Sze and Emer
This Lecture
• Continue understanding representation of a convolution using loop
nests, including mapping
• See how costs of a mapping can be determined from the loop nest
representation.
• See how loop nest can guide configuration of an accelerator.
• Consider a loop nest representation for a full CNN layer and how to
search for an optimal mapping
• Reading: Efficient Processing of Deep Neural Networks – Chap 5/6
March 20, 2023
L13-19
Sze and Emer
Mapping Output Stationary to Hardware
March 20, 2023
L13-20
Sze and Emer
1-D Convolution – Output Stationary
March 20, 2023
S
Weights
W
Inputs
Q = W-ceil(S/2)†
Outputs
* =
int i[W]; # Input activations
int f[S]; # Filter weights
int o[Q]; # Output activations
for q in [0, Q):
for s in (0, S):
o[q] += i[q+s]*f[s]
𝑂! = 𝐼!"#× 𝐹#
Traversal order (fastest to slowest): S, Q
L13-21
Sze and Emer
Single PE Output Stationary Flow
March 20, 2023
L1
Weights
L1
Inputs
L1
Outputs
MAC
>f[0]
>i[0]
>f[1]
>i[1]
>f[2]
>i[2]
<o[0]
>f[0]
>i[1]
>f[1]
>i[2]
>f[2]
>i[3]
<o[1]
Assuming S=3
Buffers
PE
L13-22
Sze and Emer
Single PE Output Stationary Flow
March 20, 2023
L1
Weights
L1
Inputs
L1
Outputs
MAC
>f[0]
>i[0]
>f[1]
>i[1]
>f[2]
>i[2]
<o[0]
>f[0]
>i[1]
>f[1]
>i[2]
>f[2]
>i[3]
<o[1]
LI
channels
LI
channels
LI
channels
Assuming S=3
Buffers
PE
Buffets
L13-23
Sze and Emer
Single PE Output Stationary Flow
March 20, 2023
L1
Weights
L1
Inputs
L1
Outputs
MAC
Assuming S=3
Buffets
PE
t = f[*]*i[*]
t += f[*]*i[*]
t += f[*]*i[*]
o[*] = t
t = f[*]*i[*]
t += f[*]*i[*]
t += f[*]*i[*]
o[*] = t
●
●
●
x[*] is input or output on LI channel stream
L13-24
Sze and Emer
LI Channel-based Buffer vs Scratchpads
• PE does not generate addresses, i.e., no load or store address
calculations
• Address generation is not serialized with arithmetic operations,
e.g., as loads or stores
• PE does not need register target for each scratchpad request in
flight
• If the channel operations are guaranteed never to block then the
channel logic can be optimized away and the reads/writes can
happen systolically.
March 20, 2023
L13-25
Sze and Emer
Output Stationary – Reference Pattern
March 20, 2023
Layer Shape: - S = 4
- Q = 9
- W = 12
for q in [0, Q):
for s in [0, S):
o[q] += i[q+s]*f[s]
L13-26
Sze and Emer
Output Stationary – Reference Pattern
March 20, 2023
Observations: - Single output is reused many times (S)
L13-27
Sze and Emer
Output Stationary – Reference Pattern
March 20, 2023
Observations: - Single output is reused many times (S)
- All weights reused repeatedly
L13-28
Sze and Emer
Observations: - Single output is reused many times (S)
- All weights reused repeatedly
- Sliding window of inputs (size = S)
Observations: - Single output is reused many times (S)
- All weights reused repeatedly
Output Stationary – Reference Pattern
March 20, 2023
L13-29
Sze and Emer
L1 Data Accesses - Weights
March 20, 2023
OS
MACs
Weight
Reads
Input
Reads
Output
Reads
Output
Writes
S
Q
Q*S
Q*S
for q in [0, Q):
for s in [0, S):
o[q] += i[q+s]*f[s]
L13-30
Sze and Emer
L1 Data Accesses - Inputs
March 20, 2023
OS
MACs Q*S
Weight
Reads
Q*S
Input
Reads
Output
Reads
Output
Writes
Q
S
Q*S
for q in [0, Q):
for s in [0, S):
o[q] += i[q+s]*f[s]
L13-31
Sze and Emer
L1 Data Accesses - Outputs
March 20, 2023
OS
MACs Q*S
Weight
Reads
Q*S
Input
Reads
Q*S
Output
Reads
Output
Writes
Q
S
0
Q
for q in [0, Q):
for s in [0, S):
o[q] += i[q+s]*f[s]
L13-39
Sze and Emer
Intermediate Buffering
March 20, 2023
L13-40
Sze and Emer
Intermediate Buffering
March 20, 2023
L0
Weights
L0
Inputs
L0
Outputs
PE
L0
Weights
L0
Inputs
L0
Outputs
L1
Weights
L1
Inputs
L1
Outputs
How will this be reflected
in the loop nest?
New ‘level’ of loops
L13-41
Sze and Emer
1-D Convolution – Buffered
March 20, 2023
int i[W]; # Input activations
int f[S]; # Filter Weights
int o[Q]; # Output activations
// Level 1
for q1 in [0, Q1):
for s1 in 0, S1):
// Level 0
for q0 in [0, Q0):
for s0 in [0, S0):
o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0]
* f[s1*S0+s0]
† Assuming: ‘valid’ style convolution
Note Q and S are
factored so:
Q0*Q1 = Q
S0*S1 = S
S
Weights
W
Inputs
Q = W-ceil(S/2)†
Outputs
* =
L13-42
Sze and Emer
Buffer sizes
• Level 0 buffer size is volume needed in each Level 1 iteration.
• Level 1 buffer size is volume needed to be preserved and re-
delivered in future (usually successive) Level 1 iterations.
• A legal assignment of loop limits will fit into the hardware’s buffer
sizes
March 20, 2023
// Level 1
for q1 in 0, Q1):
for s1 in [0 to S1):
// Level 0
for (q0 in [0, Q0):
for s0 in [0, S0):
o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0]* f[s1*S0+s0];
}
Constant over each level 1 iteration
L13-43
Sze and Emer
Buffered – 1D Convolution Einsum
March 20, 2023
𝑂! = 𝐼!"#× 𝐹#
Split: S by S0 and Q by Q0
Traversal order (fastest to slowest): S0, Q0, S1, Q1
𝑂!$∗&'"!' = 𝐼!$∗&'"!'"#$∗('"#'× 𝐹#$∗('"#'
L13-44
Sze and Emer
Spatial Mapping
March 20, 2023
L13-45
Sze and Emer
Spatial PEs
March 20, 2023
L0
Weights
L0
Inputs
L0
Outputs
PE0
PE1
How will this be reflected
in the loop nest?
New ‘level’ of loops
L13-46
Sze and Emer
1-D Convolution – Spatial
March 20, 2023
S
Weights
W
Inputs
Q = W-ceil(S/2)†
Outputs
* =
int i[W]; # Input activations
int f[S]; # Filter Weights
int o[Q]; # Output activations
// Level 1
parallel-for q1 in [0, Q1):
parallel-for s1 in [0, S1):
// Level 0
for s0 in S0):
for q0 in 0, Q0):
o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0]
* f[s1*S0+s0];
}
† Assuming: ‘valid’ style convolution
Note:
Q0*Q1 = Q
S0*S1 = S
Q1 = 1 => q1 = 0
S0 = 1, S1 = 2
L13-47
Sze and Emer
1-D Convolution – Spatial
March 20, 2023
S
Weights
W
Inputs
Q = W-ceil(S/2)†
Outputs
* =
int i[W]; # Input activations
int f[S]; # Filter Weights
int o[Q]; # Output activations
// Level 1
parallel-for s1 in [0, S1):
// Level 0
for s0 in [0, S0):
for q in [0, Q):
o[q] += i[q+s1*S0+s0] * f[s1*S0+s0]
† Assuming: ‘valid’ style convolution
Note:
Q0*Q1 = Q
S0*S1 = S
L13-48
Sze and Emer
Buffered – 1D Convolution Einsum
March 20, 2023
𝑂! = 𝐼!"#× 𝐹#
Split: S by S0
Traversal order (fastest to slowest): S0, Q
𝑂! = 𝐼!"#$∗('"#'× 𝐹#$∗('"#'
Parallel: S1
L13-49
Sze and Emer
Spatial Weight Stationary References
March 20, 2023
Shape: - S = 4
- Q = 9
- W = 12
Loop limits: - S1 = 2
- S0 = 2
- Q0 = 9
S1 = 2
S0 = 2
Q0 = 9
L13-50
Sze and Emer
Spatial Weight Stationary References
March 20, 2023
• Number of cycles half that of before
• Single weight per PE used for a long time (unicast to each PE)
• Inputs reused in next cycle (opportunity for inter-PE communication)
• Inputs are also reused after a long interval, implying a large window (Q)
• Partial sums are reused in same cycle (opportunity for spatial sum)
• Partial sums reused after a long interval, very large window (size = Q)
S1 = 2
S0 = 2
Q0 = 9
L13-51
Sze and Emer
Spatial PEs
March 20, 2023
L1
Weights
L1
Inputs
L1
Outputs
PE0
PE1
+
Sum may be
embedded in PEs
Illegal mapping!
What if hardware cannot do a spatial sum?
For
resuming
partial
sum
Inter-PE
path
Unique values
L13-52
Sze and Emer
Weight Stationary (WS)
• Note that activations are multi-cast.
• To achieve this behavior we need to “skew”
the activity in the PEs so instead of needing
activation in adjacent cycles there are needed
in the same cycle!
Global Buffer
W0 W1 W2 W3 W4 W5 W6 W7
Psum Activation
PE
Weight
March 20, 2023
L13-53
Sze and Emer
Buffered – 1D Convolution Einsum
March 20, 2023
𝑂! = 𝐼!"#× 𝐹#
Split: S by S0
Traversal order (fastest to slowest): S0, Q
𝑂! = 𝐼!"#$∗('"#'× 𝐹#$∗('"#'
Parallel: S1
Time Skew: +s1
L13-54
Sze and Emer
Spatial Weight Stationary (Skewed)
March 20, 2023
• Single weight per PE used for a long time (unicast to each PE)
• Inputs used simultaneously at both PEs (opportunity for multicast)
• Inputs are also reused after a long interval, implying a large window (Q)
• Partial sums are reused are reused in adjacent cycles in adjacent PEs
opportunity for inter-PE communication and temporal sum
• Partial sums reused after a long interval, very large window (size = Q)
S1 = 2
S0 = 2
Q0 = 9
L13-55
Sze and Emer
Spatial PEs
March 20, 2023
L1
Weights
L1
Inputs
L1
Outputs
PE0
PE1
Sum is embedded
in PE
Weights are still unique, but note lower bandwidth and bursty demand.
For
resuming
partial
sum
Inter-PE
path
Same
values
Is there a way to avoid the large input and psum window? Make S1 = S
L13-56
Sze and Emer
With S1 = S
March 20, 2023
Shape: - S = 4
- Q = 9
- W = 12
Loop limits: - S1 = 4
- S0 = 1
- Q0 = 9
L13-57
Sze and Emer
Mapping Process
March 20, 2023
L13-58
Sze and Emer
Mapping
Definition: selecting the placement and scheduling in space and time of every
operation (including delivering the appropriate operands) required for a DNN
computation onto the hardware function units of the accelerator.
Steps: Within the constraints of the hardware, select for each level of the
storage hierarchy:
• A dataflow (for loop order)
• A partitioning (for loop limits) both spatial and temporal
• Other behavioral details…, e.g., bypassing
• A binding computation to specific hardware units
March 20, 2023
L13-59
Sze and Emer
CPU Compute Model
March 20, 2023
Processor
Compiler
Architecture
µArchitecture
Input
Data
Binary Program
Processed
Data
Program
Behavioral Statistics
L13-60
Sze and Emer
DNN Compute Model
March 20, 2023
Processor
Compiler
Input
Data
Processed
Data
Program
Behavioral Statistics
Problem Spec
and Shape
Configuration
Dataflow(s) and
Constraints
Implementation
DNN
Accelerator
Mapper
Input
Activations
and Weights
Output
Activations
L13-61
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
…
…
…
C
…
C
…
…
…
filters
…
P
Q
…
…
H
…
…
C
…
H
W
…
…
…
…
C
…
…
P
…
…
1 1
N
N
W Q
March 20, 2023
Batch Size (N)
L13-62
Sze and Emer
Mapping Choices
480,000 mappings shown
Spread: 19x in energy efficiency
Only 1 is optimal, 9 others within 1%
Energy-efficiency of peak-perf mappings of a single problem
A mapper needs a good cost model to
find an optimal mapping
A model needs a mapper to evaluate a
DNN workload on an architecture
6,582 mappings have min. DRAM accesses
but vary 11x in energy efficiency
Source: Parashar, Timeloop
March 20, 2023
L13-63
Sze and Emer
Mapper Organization
March 20, 2023
Mapper
Optimizing
Search
Perf/Energy
Model
Problem/Shape
Map space
Construction
Dataflow(s)/Constraints
Implementation
Mapspace
Mapping
Config
Mapping to
Config
L13-65
Sze and Emer
Timeloop Accelergy
March 20, 2023
Accelergy
(Energy Estimator Tool)
Architecture &
Implementation
description
Action
counts
…
Energy
estimation
Energy
estimation
plug-in 0
Energy
estimation
plug-in 1
Timeloop
(DNN Mapping Tool:
map space creation, search,
performance model)
Problem Spec /
Shape
(DNN Model)
L13-66
Sze and Emer
Workload Specification
• Deep Loop Nest
.
.
.
N
.
.
.
N
C
C
M
M
C
P
Weights Inputs
Outputs
S
R
H=
Q+S-1
W=P+R-1
Q
Weights
Inputs
S
W=Q+S-1
Operation Space
Data Spaces
Provided to Timeloop Inferred by Timeloop
Projection
P
r
o
j
e
c
t
i
o
n
Source: Parashar, Timeloop
March 20, 2023
𝑂!,#,$,% = 𝐼!,&,'$(),'%(*
×𝐹#,&,),*
L13-67
Sze and Emer
Architecture Specifications
• Temporal reuse features
– Number of buffer levels and buffer sizes
– Buffer bypassing capabilities
– Network topology
– …
• Parallelism and spatial reuse features
– Topology of spatial fractures
– Multicast capabilities
– Inter-PE network, e.g., spatial sum and forwarding reuse
– …
• Constraints
– Index sequence restrictions, e.g., allowable strides
– Fixed level 0 loop nest, e.g., fixed vector width
– Fixed level 1 spatial mappings, e.g., input/output channel array
– …
March 20, 2023
Determines the legal mappings:
loop permutations (dataflows) and associated loop limits
L13-68
Sze and Emer
Implementation Specifications
• Buffer bandwidth and latency
• Buffer port and banking organization
• PE vectorization
• Network bandwidth and latency, e.g., router costs
• Shared or per-datatype network links
• …
March 20, 2023
Determines the latency and energy consumption of a mapping.
L13-69
Sze and Emer
Timeloop
• Tool for Evaluation and Architectural Design-Space Exploration of DNN
Accelerators
Model variety of DNN accelerators
Target every architecture supported by Model
Source: Parashar, Timeloop
March 20, 2023
L13-75
Sze and Emer
Next Lecture: Sparsity
Thank you!
March 20, 2023

More Related Content

Similar to L13.pdf

stack & queues .pptx
stack & queues .pptxstack & queues .pptx
stack & queues .pptxkaishsahu
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Addergueste731a4
 
cualculo de flechas en curvas ferroviarias
cualculo de flechas en curvas ferroviariascualculo de flechas en curvas ferroviarias
cualculo de flechas en curvas ferroviariasreynaldoxxxxxxx
 
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEM
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEMDISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEM
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEMIAEME Publication
 
Section 4 3_the_scattering_matrix_package
Section 4 3_the_scattering_matrix_packageSection 4 3_the_scattering_matrix_package
Section 4 3_the_scattering_matrix_packageJamal Kazazi
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithmRicha Kumari
 
Implementation of Stronger S-Box for Advanced Encryption Standard
Implementation of Stronger S-Box for Advanced Encryption StandardImplementation of Stronger S-Box for Advanced Encryption Standard
Implementation of Stronger S-Box for Advanced Encryption Standardtheijes
 
Comparison among Different Adders
Comparison among Different Adders Comparison among Different Adders
Comparison among Different Adders iosrjce
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueKumar Goud
 

Similar to L13.pdf (20)

Sketch root locus
Sketch root locusSketch root locus
Sketch root locus
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 
stack & queues .pptx
stack & queues .pptxstack & queues .pptx
stack & queues .pptx
 
L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Adder
 
07Decoders121
07Decoders12107Decoders121
07Decoders121
 
cualculo de flechas en curvas ferroviarias
cualculo de flechas en curvas ferroviariascualculo de flechas en curvas ferroviarias
cualculo de flechas en curvas ferroviarias
 
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEM
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEMDISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEM
DISTRIBUTION LOAD FLOW ANALYSIS FOR RDIAL & MESH DISTRIBUTION SYSTEM
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
L06.pdf
L06.pdfL06.pdf
L06.pdf
 
Jy3517271730
Jy3517271730Jy3517271730
Jy3517271730
 
Section 4 3_the_scattering_matrix_package
Section 4 3_the_scattering_matrix_packageSection 4 3_the_scattering_matrix_package
Section 4 3_the_scattering_matrix_package
 
Ee321s3.1
Ee321s3.1Ee321s3.1
Ee321s3.1
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithm
 
Implementation of Stronger S-Box for Advanced Encryption Standard
Implementation of Stronger S-Box for Advanced Encryption StandardImplementation of Stronger S-Box for Advanced Encryption Standard
Implementation of Stronger S-Box for Advanced Encryption Standard
 
Comparison among Different Adders
Comparison among Different Adders Comparison among Different Adders
Comparison among Different Adders
 
Transmission lines
Transmission linesTransmission lines
Transmission lines
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition Technique
 

More from TRNHONGLINHBCHCM (11)

L09-handout.pdf
L09-handout.pdfL09-handout.pdf
L09-handout.pdf
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L09.pdf
L09.pdfL09.pdf
L09.pdf
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 

Recently uploaded

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 

L13.pdf

  • 1. L13-1 6.5930/1 Hardware Architectures for Deep Learning Mapping to Hardware Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science March 20, 2023
  • 2. L13-2 Sze and Emer Data Orchestration March 20, 2023
  • 3. L13-4 Sze and Emer Approaches: Implicit versus Explicit Explicit: Implicit: March 20, 2023
  • 4. L13-5 Sze and Emer Approaches: Coupled versus Decoupled Implicit + Coupled Implicit + Decoupled March 20, 2023
  • 5. L13-6 Sze and Emer Explicit Decoupled Data Orchestration Implicit + Decoupled Explicit + Decoupled March 20, 2023
  • 6. L13-7 Sze and Emer Classifying Orchestration Approaches implicit data placement decoupled coupled Load... Math... Store... HW Pre- fetching ASIC-style fill and iterate engines GPU shared memory Cache- level specific loads SW- controlled replace. policy explicit data placement SW pre- fetching Decoupled Access Execute March 20, 2023
  • 7. L13-8 Sze and Emer EDDO Strategies March 20, 2023 Fill and use: Buffer Storage Drain values Fill values Double buffer Buffer Storage Drain values Fill values Rolling use Buffer Storage Drain values Fill values
  • 8. L13-9 Sze and Emer Buffets • Compared to FIFO – Allows random access into live window – Allows updates of values in live window • Compared to scratchpad: – Adds scoreboarding for synchronization – Allows arbitrary degrees of buffering • Compared to cache – Addresses local (therefore fewer bits) and no tag store – Push model versus pull model for smaller landing zone – Raises level of abstraction beyond single value transfers Buffet Fill() from memory Read() Update() Shrink() local_offset data local_offset data data num to datapath Can be specialized, for example: “read-only” buffets that have fills but not updates March 20, 2023
  • 9. L13-11 Sze and Emer Buffet Usage Model March 20, 2023 Buffet Storage Read sequence Read values Update sequence Update values Fill values Fill sequence
  • 10. L13-12 Sze and Emer Buffet Behavioral Attributes • Based on ‘fill’ address sequence, will pop values from ‘fill’ channel until fill buffer is full. • Based on ‘read’ address sequence will try to push values down ‘read’ LI channel, but only if the value has been ‘filled’. • ‘Read’ address sequence can also inform buffer that a value can be dropped, i.e., space freed. • Based on ‘update’ address sequence will try to pop values from ‘update’ channel • Implementation may include multiple logical buffers inside a single physical buffer. March 20, 2023
  • 11. L13-13 Sze and Emer Buffets: Composable Idiom for E.D.D.O. Transfers between levels only depend on a credit flow from the adjacent level. March 20, 2023
  • 12. L13-14 Sze and Emer Data Communications March 20, 2023
  • 13. L13-15 Sze and Emer Configurable Systolic Array - WARP March 20, 2023 Source: WARP Architecture and Implementation, ISCA 1986
  • 14. L13-16 Sze and Emer Latency-Insensitive (LI) Channels March 20, 2023 IsFull() Push() IsEmpty() Peek() Pop() OUT IN Connections Producer Consumer Credit: Pellauer Inter-unit communication can use Latency-Insensitive (LI) Channels that, for functionality purposes, have no latency requirements Semantics: FIFO abstraction with flow control on all interfaces
  • 15. L13-17 Sze and Emer March 20, 2023 Next: Mapping
  • 16. L13-18 Sze and Emer This Lecture • Continue understanding representation of a convolution using loop nests, including mapping • See how costs of a mapping can be determined from the loop nest representation. • See how loop nest can guide configuration of an accelerator. • Consider a loop nest representation for a full CNN layer and how to search for an optimal mapping • Reading: Efficient Processing of Deep Neural Networks – Chap 5/6 March 20, 2023
  • 17. L13-19 Sze and Emer Mapping Output Stationary to Hardware March 20, 2023
  • 18. L13-20 Sze and Emer 1-D Convolution – Output Stationary March 20, 2023 S Weights W Inputs Q = W-ceil(S/2)† Outputs * = int i[W]; # Input activations int f[S]; # Filter weights int o[Q]; # Output activations for q in [0, Q): for s in (0, S): o[q] += i[q+s]*f[s] 𝑂! = 𝐼!"#× 𝐹# Traversal order (fastest to slowest): S, Q
  • 19. L13-21 Sze and Emer Single PE Output Stationary Flow March 20, 2023 L1 Weights L1 Inputs L1 Outputs MAC >f[0] >i[0] >f[1] >i[1] >f[2] >i[2] <o[0] >f[0] >i[1] >f[1] >i[2] >f[2] >i[3] <o[1] Assuming S=3 Buffers PE
  • 20. L13-22 Sze and Emer Single PE Output Stationary Flow March 20, 2023 L1 Weights L1 Inputs L1 Outputs MAC >f[0] >i[0] >f[1] >i[1] >f[2] >i[2] <o[0] >f[0] >i[1] >f[1] >i[2] >f[2] >i[3] <o[1] LI channels LI channels LI channels Assuming S=3 Buffers PE Buffets
  • 21. L13-23 Sze and Emer Single PE Output Stationary Flow March 20, 2023 L1 Weights L1 Inputs L1 Outputs MAC Assuming S=3 Buffets PE t = f[*]*i[*] t += f[*]*i[*] t += f[*]*i[*] o[*] = t t = f[*]*i[*] t += f[*]*i[*] t += f[*]*i[*] o[*] = t ● ● ● x[*] is input or output on LI channel stream
  • 22. L13-24 Sze and Emer LI Channel-based Buffer vs Scratchpads • PE does not generate addresses, i.e., no load or store address calculations • Address generation is not serialized with arithmetic operations, e.g., as loads or stores • PE does not need register target for each scratchpad request in flight • If the channel operations are guaranteed never to block then the channel logic can be optimized away and the reads/writes can happen systolically. March 20, 2023
  • 23. L13-25 Sze and Emer Output Stationary – Reference Pattern March 20, 2023 Layer Shape: - S = 4 - Q = 9 - W = 12 for q in [0, Q): for s in [0, S): o[q] += i[q+s]*f[s]
  • 24. L13-26 Sze and Emer Output Stationary – Reference Pattern March 20, 2023 Observations: - Single output is reused many times (S)
  • 25. L13-27 Sze and Emer Output Stationary – Reference Pattern March 20, 2023 Observations: - Single output is reused many times (S) - All weights reused repeatedly
  • 26. L13-28 Sze and Emer Observations: - Single output is reused many times (S) - All weights reused repeatedly - Sliding window of inputs (size = S) Observations: - Single output is reused many times (S) - All weights reused repeatedly Output Stationary – Reference Pattern March 20, 2023
  • 27. L13-29 Sze and Emer L1 Data Accesses - Weights March 20, 2023 OS MACs Weight Reads Input Reads Output Reads Output Writes S Q Q*S Q*S for q in [0, Q): for s in [0, S): o[q] += i[q+s]*f[s]
  • 28. L13-30 Sze and Emer L1 Data Accesses - Inputs March 20, 2023 OS MACs Q*S Weight Reads Q*S Input Reads Output Reads Output Writes Q S Q*S for q in [0, Q): for s in [0, S): o[q] += i[q+s]*f[s]
  • 29. L13-31 Sze and Emer L1 Data Accesses - Outputs March 20, 2023 OS MACs Q*S Weight Reads Q*S Input Reads Q*S Output Reads Output Writes Q S 0 Q for q in [0, Q): for s in [0, S): o[q] += i[q+s]*f[s]
  • 30. L13-39 Sze and Emer Intermediate Buffering March 20, 2023
  • 31. L13-40 Sze and Emer Intermediate Buffering March 20, 2023 L0 Weights L0 Inputs L0 Outputs PE L0 Weights L0 Inputs L0 Outputs L1 Weights L1 Inputs L1 Outputs How will this be reflected in the loop nest? New ‘level’ of loops
  • 32. L13-41 Sze and Emer 1-D Convolution – Buffered March 20, 2023 int i[W]; # Input activations int f[S]; # Filter Weights int o[Q]; # Output activations // Level 1 for q1 in [0, Q1): for s1 in 0, S1): // Level 0 for q0 in [0, Q0): for s0 in [0, S0): o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0] * f[s1*S0+s0] † Assuming: ‘valid’ style convolution Note Q and S are factored so: Q0*Q1 = Q S0*S1 = S S Weights W Inputs Q = W-ceil(S/2)† Outputs * =
  • 33. L13-42 Sze and Emer Buffer sizes • Level 0 buffer size is volume needed in each Level 1 iteration. • Level 1 buffer size is volume needed to be preserved and re- delivered in future (usually successive) Level 1 iterations. • A legal assignment of loop limits will fit into the hardware’s buffer sizes March 20, 2023 // Level 1 for q1 in 0, Q1): for s1 in [0 to S1): // Level 0 for (q0 in [0, Q0): for s0 in [0, S0): o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0]* f[s1*S0+s0]; } Constant over each level 1 iteration
  • 34. L13-43 Sze and Emer Buffered – 1D Convolution Einsum March 20, 2023 𝑂! = 𝐼!"#× 𝐹# Split: S by S0 and Q by Q0 Traversal order (fastest to slowest): S0, Q0, S1, Q1 𝑂!$∗&'"!' = 𝐼!$∗&'"!'"#$∗('"#'× 𝐹#$∗('"#'
  • 35. L13-44 Sze and Emer Spatial Mapping March 20, 2023
  • 36. L13-45 Sze and Emer Spatial PEs March 20, 2023 L0 Weights L0 Inputs L0 Outputs PE0 PE1 How will this be reflected in the loop nest? New ‘level’ of loops
  • 37. L13-46 Sze and Emer 1-D Convolution – Spatial March 20, 2023 S Weights W Inputs Q = W-ceil(S/2)† Outputs * = int i[W]; # Input activations int f[S]; # Filter Weights int o[Q]; # Output activations // Level 1 parallel-for q1 in [0, Q1): parallel-for s1 in [0, S1): // Level 0 for s0 in S0): for q0 in 0, Q0): o[q1*Q0+q0] += i[q1*Q0+q0 + s1*S0+s0] * f[s1*S0+s0]; } † Assuming: ‘valid’ style convolution Note: Q0*Q1 = Q S0*S1 = S Q1 = 1 => q1 = 0 S0 = 1, S1 = 2
  • 38. L13-47 Sze and Emer 1-D Convolution – Spatial March 20, 2023 S Weights W Inputs Q = W-ceil(S/2)† Outputs * = int i[W]; # Input activations int f[S]; # Filter Weights int o[Q]; # Output activations // Level 1 parallel-for s1 in [0, S1): // Level 0 for s0 in [0, S0): for q in [0, Q): o[q] += i[q+s1*S0+s0] * f[s1*S0+s0] † Assuming: ‘valid’ style convolution Note: Q0*Q1 = Q S0*S1 = S
  • 39. L13-48 Sze and Emer Buffered – 1D Convolution Einsum March 20, 2023 𝑂! = 𝐼!"#× 𝐹# Split: S by S0 Traversal order (fastest to slowest): S0, Q 𝑂! = 𝐼!"#$∗('"#'× 𝐹#$∗('"#' Parallel: S1
  • 40. L13-49 Sze and Emer Spatial Weight Stationary References March 20, 2023 Shape: - S = 4 - Q = 9 - W = 12 Loop limits: - S1 = 2 - S0 = 2 - Q0 = 9 S1 = 2 S0 = 2 Q0 = 9
  • 41. L13-50 Sze and Emer Spatial Weight Stationary References March 20, 2023 • Number of cycles half that of before • Single weight per PE used for a long time (unicast to each PE) • Inputs reused in next cycle (opportunity for inter-PE communication) • Inputs are also reused after a long interval, implying a large window (Q) • Partial sums are reused in same cycle (opportunity for spatial sum) • Partial sums reused after a long interval, very large window (size = Q) S1 = 2 S0 = 2 Q0 = 9
  • 42. L13-51 Sze and Emer Spatial PEs March 20, 2023 L1 Weights L1 Inputs L1 Outputs PE0 PE1 + Sum may be embedded in PEs Illegal mapping! What if hardware cannot do a spatial sum? For resuming partial sum Inter-PE path Unique values
  • 43. L13-52 Sze and Emer Weight Stationary (WS) • Note that activations are multi-cast. • To achieve this behavior we need to “skew” the activity in the PEs so instead of needing activation in adjacent cycles there are needed in the same cycle! Global Buffer W0 W1 W2 W3 W4 W5 W6 W7 Psum Activation PE Weight March 20, 2023
  • 44. L13-53 Sze and Emer Buffered – 1D Convolution Einsum March 20, 2023 𝑂! = 𝐼!"#× 𝐹# Split: S by S0 Traversal order (fastest to slowest): S0, Q 𝑂! = 𝐼!"#$∗('"#'× 𝐹#$∗('"#' Parallel: S1 Time Skew: +s1
  • 45. L13-54 Sze and Emer Spatial Weight Stationary (Skewed) March 20, 2023 • Single weight per PE used for a long time (unicast to each PE) • Inputs used simultaneously at both PEs (opportunity for multicast) • Inputs are also reused after a long interval, implying a large window (Q) • Partial sums are reused are reused in adjacent cycles in adjacent PEs opportunity for inter-PE communication and temporal sum • Partial sums reused after a long interval, very large window (size = Q) S1 = 2 S0 = 2 Q0 = 9
  • 46. L13-55 Sze and Emer Spatial PEs March 20, 2023 L1 Weights L1 Inputs L1 Outputs PE0 PE1 Sum is embedded in PE Weights are still unique, but note lower bandwidth and bursty demand. For resuming partial sum Inter-PE path Same values Is there a way to avoid the large input and psum window? Make S1 = S
  • 47. L13-56 Sze and Emer With S1 = S March 20, 2023 Shape: - S = 4 - Q = 9 - W = 12 Loop limits: - S1 = 4 - S0 = 1 - Q0 = 9
  • 48. L13-57 Sze and Emer Mapping Process March 20, 2023
  • 49. L13-58 Sze and Emer Mapping Definition: selecting the placement and scheduling in space and time of every operation (including delivering the appropriate operands) required for a DNN computation onto the hardware function units of the accelerator. Steps: Within the constraints of the hardware, select for each level of the storage hierarchy: • A dataflow (for loop order) • A partitioning (for loop limits) both spatial and temporal • Other behavioral details…, e.g., bypassing • A binding computation to specific hardware units March 20, 2023
  • 50. L13-59 Sze and Emer CPU Compute Model March 20, 2023 Processor Compiler Architecture µArchitecture Input Data Binary Program Processed Data Program Behavioral Statistics
  • 51. L13-60 Sze and Emer DNN Compute Model March 20, 2023 Processor Compiler Input Data Processed Data Program Behavioral Statistics Problem Spec and Shape Configuration Dataflow(s) and Constraints Implementation DNN Accelerator Mapper Input Activations and Weights Output Activations
  • 52. L13-61 Sze and Emer Convolution (CONV) Layer … M … Many Input fmaps (N) Many Output fmaps (N) … R S R S … … … C … C … … … filters … P Q … … H … … C … H W … … … … C … … P … … 1 1 N N W Q March 20, 2023 Batch Size (N)
  • 53. L13-62 Sze and Emer Mapping Choices 480,000 mappings shown Spread: 19x in energy efficiency Only 1 is optimal, 9 others within 1% Energy-efficiency of peak-perf mappings of a single problem A mapper needs a good cost model to find an optimal mapping A model needs a mapper to evaluate a DNN workload on an architecture 6,582 mappings have min. DRAM accesses but vary 11x in energy efficiency Source: Parashar, Timeloop March 20, 2023
  • 54. L13-63 Sze and Emer Mapper Organization March 20, 2023 Mapper Optimizing Search Perf/Energy Model Problem/Shape Map space Construction Dataflow(s)/Constraints Implementation Mapspace Mapping Config Mapping to Config
  • 55. L13-65 Sze and Emer Timeloop Accelergy March 20, 2023 Accelergy (Energy Estimator Tool) Architecture & Implementation description Action counts … Energy estimation Energy estimation plug-in 0 Energy estimation plug-in 1 Timeloop (DNN Mapping Tool: map space creation, search, performance model) Problem Spec / Shape (DNN Model)
  • 56. L13-66 Sze and Emer Workload Specification • Deep Loop Nest . . . N . . . N C C M M C P Weights Inputs Outputs S R H= Q+S-1 W=P+R-1 Q Weights Inputs S W=Q+S-1 Operation Space Data Spaces Provided to Timeloop Inferred by Timeloop Projection P r o j e c t i o n Source: Parashar, Timeloop March 20, 2023 𝑂!,#,$,% = 𝐼!,&,'$(),'%(* ×𝐹#,&,),*
  • 57. L13-67 Sze and Emer Architecture Specifications • Temporal reuse features – Number of buffer levels and buffer sizes – Buffer bypassing capabilities – Network topology – … • Parallelism and spatial reuse features – Topology of spatial fractures – Multicast capabilities – Inter-PE network, e.g., spatial sum and forwarding reuse – … • Constraints – Index sequence restrictions, e.g., allowable strides – Fixed level 0 loop nest, e.g., fixed vector width – Fixed level 1 spatial mappings, e.g., input/output channel array – … March 20, 2023 Determines the legal mappings: loop permutations (dataflows) and associated loop limits
  • 58. L13-68 Sze and Emer Implementation Specifications • Buffer bandwidth and latency • Buffer port and banking organization • PE vectorization • Network bandwidth and latency, e.g., router costs • Shared or per-datatype network links • … March 20, 2023 Determines the latency and energy consumption of a mapping.
  • 59. L13-69 Sze and Emer Timeloop • Tool for Evaluation and Architectural Design-Space Exploration of DNN Accelerators Model variety of DNN accelerators Target every architecture supported by Model Source: Parashar, Timeloop March 20, 2023
  • 60. L13-75 Sze and Emer Next Lecture: Sparsity Thank you! March 20, 2023