SlideShare a Scribd company logo
1 of 31
Download to read offline
1
DNN Model and
Hardware Co-Design
ISCA Tutorial (2019)
Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang
2
Reduce Number of Ops and Weights
• Exploit Activation Statistics
• Exploit Weight Statistics
• Exploit Dot Product Computation
• Decomposed Trained Filters
• Knowledge Distillation
3
Sparsity in Fmaps
9 -1 -3
1 -5 5
-2 6 -1
Many zeros in output fmaps after ReLU
ReLU
9 0 0
1 0 5
0 6 0
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
CONV Layer
# of activations # of non-zero activations
(Normalized)
4
…
…
…
…
…
…
ReL
U
Input Image
Output Image
Filter Filt
Img
Psum
Psum
Buffer
SRAM
108K
B
14×12 PE Array
Link Clock Core Clock
I/O Compression in Eyeriss
Run-Length Compression (RLC)
Example:
Output (64b):
Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, …
5b 16b 1b
5b 16b 5b 16b
2 12 4 53 2 22 0
RunLevelRunLevelRunLevelTerm
Off-Chip DRAM
64 bits
Decomp
Comp
[Chen et al., ISSCC 2016]
DCNN Accelerator
5
Compression Reduces DRAM BW
0
1
2
3
4
5
6
1 2 3 4 5
DRAM
Access
(MB)
AlexNet Conv Layer
Uncompressed Compressed
1 2 3 4 5
AlexNet Conv Layer
DRAM
Access
(MB)
0
2
4
6
1.2×
1.4×
1.7×
1.8×
1.9×
Uncompressed
Fmaps + Weights
RLE Compressed
Fmaps + Weights
[Chen et al., ISSCC 2016]
Simple RLC within 5% - 10% of theoretical entropy limit
6
Data Gating / Zero Skipping in Eyeriss
Filter
Scratch Pad
(225x16b SRAM)
Partial Sum
Scratch Pad
(24x16b REG)
Filt
Im
g
Input
Psum
2-stage
pipelined
multiplier
Output
Psum
0
Accumulate
Input Psum
1
0
== 0 Zero
Buffer
Enable
Image
Scratch Pad
(12x16b REG)
0
1
Skip MAC and mem reads
when image data is zero.
Reduce PE power by 45%
Reset
[Chen et al., ISSCC 2016]
7
Cnvlutin
• Process Convolution Layers
• Built on top of DaDianNao (4.49% area overhead)
• Speed up of 1.37x (1.52x with activation pruning)
[Albericio et al., ISCA 2016]
8
Pruning Activations
[Reagen et al., ISCA 2016]
Remove small activation values
[Albericio et al., ISCA 2016]
Speed up 11% (ImageNet) Reduce power 2x (MNIST)
Minerva
Cnvlutin
9
Exploit Correlation in Input Data
• Exploit Temporal Correlation of Inputs
– Reduce amount of computation if there is temporal correlation between
frames
– Requires additional storage and need to measure redundancy (e.g.
motion vector for videos)
– Application specific (e.g. videos) – requires that the same operation is
done for each frame (not always the case)
[Zhang et al., FAST, CVPRW 2017], [EVA2, ISCA 2018],
[Euphrates, ISCA 2018], [Riera et al., ISCA 2018]
10
Exploit Correlation in Input Data
• Exploit Spatial Correlation of Inputs
– Delta code neighboring values (activation) resulting in sparse inputs to
each layer
– Reduces storage cost and data movement for improvement in energy-
efficiency and throughput
[Diffy, MICRO 2018]
11
Pruning – Make Weights Sparse
• Optimal Brain Damage
1. Choose a reasonable network
architecture
2. Train network until reasonable
solution obtained
3. Compute the second derivative
for each weight
4. Compute saliencies (i.e. impact
on training error) for each weight
5. Sort weights by saliency and
delete low-saliency weights
6. Iterate to step 2
[Lecun et al., NeurIPS 1989]
retraining
12
Pruning – Make Weights Sparse
pruning
neurons
pruning
synapses
after pruning
before pruning
Prune based on magnitude of weights
Train Connectivity
Prune Connections
Train Weights
Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction
[Han et al., NeurIPS 2015]
13
Speed up of Weight Pruning on CPU/GPU
Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
Batch size = 1
On Fully Connected Layers Only
Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU
[Han et al., NeurIPS 2015]
14
Design of Efficient DNN Algorithms
• Popular efficient DNN algorithm approaches
pruning
neurons
pruning
synapses
after pruning
before pruning
Network Pruning
C
1
1
S
R
1
R
S
C
Compact Network Architectures
Examples: SqueezeNet, MobileNet
... also reduced precision
• Focus on reducing number of MACs and weights
• Does it translate to energy savings and reduced latency?
15
Energy-Evaluation Methodology
CNN Shape Configuration
(# of channels, # of filters, etc.)
CNN Weights and Input Data
[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
CNN Energy Consumption
L1 L2 L3
Energy
…
Memory
Accesses
Optimization
# of MACs
Calculation
…
# acc. at mem. level 1
# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each
MAC and Memory Access
Ecomp
Edata
Evaluation tool available at http://eyeriss.mit.edu/energy.html
16
Key Observations
• Number of weights alone is not a good metric for energy
• All data types should be considered
Output Feature Map
43%
Input Feature Map
25%
Weights
22%
Computation
10%
Energy Consumption
of GoogLeNet
[Yang et al., CVPR 2017]
17
Energy-Aware Pruning
Directly target energy and
incorporate it into the
optimization of DNNs to
provide greater energy savings
• Sort layers based on energy and
prune layers that consume most
energy first
• EAP reduces AlexNet energy by
3.7x and outperforms the
previous work that uses
magnitude-based pruning by 1.7x
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Ori. DC EAP
Normalized Energy (AlexNet)
2.1x
3.7x
x109
Magnitude
Based Pruning
Energy Aware
Pruning
Pruned models available at
http://eyeriss.mit.edu/energy.html
[Yang et al., CVPR 2017]
18
# of Operations vs. Latency
• # of operations (MACs) does not approximate latency well
Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)
19
NetAdapt: Platform-Aware DNN Adaptation
• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget
• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
[Yang et al., ECCV 2018]
NetAdapt Measure
…
Network Proposals
Empirical Measurements
Metric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…
…
…
Pretrained
Network Metric Budget
Latency 3.8
Energy 10.5
Budget
Adapted
Network
…
…
Platform
A B C D Z
Code to be released at http://netadapt.mit.edu
20
Improved Latency vs. Accuracy Tradeoff
• NetAdapt boosts the real inference speed of MobileNet
by up to 1.7x with higher accuracy
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
Reference:
MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017
MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018
*Tested on the ImageNet dataset and a Google Pixel 1 CPU
[Yang et al., ECCV 2018]
21
Compression of Weights & Activations
• Compress weights and activations between DRAM
and accelerator
• Variable Length / Huffman Coding
• Tested on AlexNet à 2× overall BW Reduction
[Moons et al., VLSI 2016; Han et al., ICLR 2016]
Value: 16’b0 à Compressed Code: {1’b0}
Value: 16’bx à Compressed Code: {1’b1, 16’bx}
Example:
22
Compression Overhead
Index (non-zero position info – e.g., IA and JA for CSR) accounts for
approximately half of storage for fine grained pruning
[Han et al., ICLR 2016]
23
Coarse-Grained Pruning
…
E
output fmap
…
…
many
filters (M)
Many
Output Channels (M)
M
…
R
S
1
R
S
…
…
…
C
…
M
H
input fmap
…
…
…
…
C
…
C
…
…
…
W F
May prune by eliminating entire filter
planes or extremely sparse input
activation planes or just a tile of either.
24
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up
[Yu et al., ISCA 2017]
Dense weights Sparse weights
Example: 2-way SIMD
25
Exploit Redundant Weights
• Preprocessing to reorder weights (ok since weights are
known)
• Perform addition before multiplication to reduce number
of multiplies and reads of weights
• Example: Input = [1 2 3 ] and filter [A B A]
Typical processing: Output = A*1+B*2+A*3
If reorder as [A A B]: Output = A*(1+3)+B*1
3 multiplies and 3 weight reads
2 multiplies and 2 weight reads
Note: Bitwidth of multiplication may need to increase
[UCNN, ISCA 2018]
26
Exploit ReLU
• Reduce number operations when if resulting activation will be
negative as ReLU will set to zero
• Need to either perform preprocess (sort weights) or minimize
prediction overhead and error
[PredictiveNet, ISCAS 2017], [SnaPEA, ISCA 2018], [Song et al., ISCA 2018]
27
Compact Network Architectures
• Break large convolutional layers into a series of
smaller convolutional layers
– Fewer weights, but same effective receptive field
• Before Training: Network Architecture Design
(already discussed this morning; e.g., MobileNet)
• After Training: Decompose Trained Filters
28
Decompose Trained Filters
After training, perform low-rank approximation by applying tensor
decomposition to weight kernel; then fine-tune weights for accuracy
[Lebedev et al., ICLR 2015]
R = canonical rank
29
Decompose Trained Filters
[Denton et al., NeurIPS 2014]
• Speed up by 1.6 – 2.7x on CPU/GPU for CONV1,
CONV2 layers
• Reduce size by 5 - 13x for FC layer
• < 1% drop in accuracy
Original Approx.
Visualization of Filters
30
Decompose Trained Filters on Phone
[Kim et al., ICLR 2016]
Tucker Decomposition
31
Knowledge Distillation
[Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015]
Complex
DNN B
(teacher)
Simple DNN
(student)
softmax
softmax
Complex
DNN A
(teacher)
softmax
scores
class
probabilities
Try to match

More Related Content

Similar to Tutorial-on-DNN-09A-Co-design-Sparsity.pdf

Instrumenting Open vSwitch with Monitoring Capabilities: Designs and Challenges
Instrumenting Open vSwitch with Monitoring Capabilities: Designs and ChallengesInstrumenting Open vSwitch with Monitoring Capabilities: Designs and Challenges
Instrumenting Open vSwitch with Monitoring Capabilities: Designs and ChallengesAJAY KHARAT
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
 
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET- Handwritten Decimal Image Compression using Deep Stacked AutoencoderIRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET- Handwritten Decimal Image Compression using Deep Stacked AutoencoderIRJET Journal
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networksFörderverein Technische Fakultät
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationApril Knyff
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_papershanullah3
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Implementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationImplementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationkarthik annam
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 

Similar to Tutorial-on-DNN-09A-Co-design-Sparsity.pdf (20)

Instrumenting Open vSwitch with Monitoring Capabilities: Designs and Challenges
Instrumenting Open vSwitch with Monitoring Capabilities: Designs and ChallengesInstrumenting Open vSwitch with Monitoring Capabilities: Designs and Challenges
Instrumenting Open vSwitch with Monitoring Capabilities: Designs and Challenges
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET- Handwritten Decimal Image Compression using Deep Stacked AutoencoderIRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Implementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationImplementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulation
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 

Recently uploaded

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

Tutorial-on-DNN-09A-Co-design-Sparsity.pdf

  • 1. 1 DNN Model and Hardware Co-Design ISCA Tutorial (2019) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang
  • 2. 2 Reduce Number of Ops and Weights • Exploit Activation Statistics • Exploit Weight Statistics • Exploit Dot Product Computation • Decomposed Trained Filters • Knowledge Distillation
  • 3. 3 Sparsity in Fmaps 9 -1 -3 1 -5 5 -2 6 -1 Many zeros in output fmaps after ReLU ReLU 9 0 0 1 0 5 0 6 0 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 CONV Layer # of activations # of non-zero activations (Normalized)
  • 4. 4 … … … … … … ReL U Input Image Output Image Filter Filt Img Psum Psum Buffer SRAM 108K B 14×12 PE Array Link Clock Core Clock I/O Compression in Eyeriss Run-Length Compression (RLC) Example: Output (64b): Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, … 5b 16b 1b 5b 16b 5b 16b 2 12 4 53 2 22 0 RunLevelRunLevelRunLevelTerm Off-Chip DRAM 64 bits Decomp Comp [Chen et al., ISSCC 2016] DCNN Accelerator
  • 5. 5 Compression Reduces DRAM BW 0 1 2 3 4 5 6 1 2 3 4 5 DRAM Access (MB) AlexNet Conv Layer Uncompressed Compressed 1 2 3 4 5 AlexNet Conv Layer DRAM Access (MB) 0 2 4 6 1.2× 1.4× 1.7× 1.8× 1.9× Uncompressed Fmaps + Weights RLE Compressed Fmaps + Weights [Chen et al., ISSCC 2016] Simple RLC within 5% - 10% of theoretical entropy limit
  • 6. 6 Data Gating / Zero Skipping in Eyeriss Filter Scratch Pad (225x16b SRAM) Partial Sum Scratch Pad (24x16b REG) Filt Im g Input Psum 2-stage pipelined multiplier Output Psum 0 Accumulate Input Psum 1 0 == 0 Zero Buffer Enable Image Scratch Pad (12x16b REG) 0 1 Skip MAC and mem reads when image data is zero. Reduce PE power by 45% Reset [Chen et al., ISSCC 2016]
  • 7. 7 Cnvlutin • Process Convolution Layers • Built on top of DaDianNao (4.49% area overhead) • Speed up of 1.37x (1.52x with activation pruning) [Albericio et al., ISCA 2016]
  • 8. 8 Pruning Activations [Reagen et al., ISCA 2016] Remove small activation values [Albericio et al., ISCA 2016] Speed up 11% (ImageNet) Reduce power 2x (MNIST) Minerva Cnvlutin
  • 9. 9 Exploit Correlation in Input Data • Exploit Temporal Correlation of Inputs – Reduce amount of computation if there is temporal correlation between frames – Requires additional storage and need to measure redundancy (e.g. motion vector for videos) – Application specific (e.g. videos) – requires that the same operation is done for each frame (not always the case) [Zhang et al., FAST, CVPRW 2017], [EVA2, ISCA 2018], [Euphrates, ISCA 2018], [Riera et al., ISCA 2018]
  • 10. 10 Exploit Correlation in Input Data • Exploit Spatial Correlation of Inputs – Delta code neighboring values (activation) resulting in sparse inputs to each layer – Reduces storage cost and data movement for improvement in energy- efficiency and throughput [Diffy, MICRO 2018]
  • 11. 11 Pruning – Make Weights Sparse • Optimal Brain Damage 1. Choose a reasonable network architecture 2. Train network until reasonable solution obtained 3. Compute the second derivative for each weight 4. Compute saliencies (i.e. impact on training error) for each weight 5. Sort weights by saliency and delete low-saliency weights 6. Iterate to step 2 [Lecun et al., NeurIPS 1989] retraining
  • 12. 12 Pruning – Make Weights Sparse pruning neurons pruning synapses after pruning before pruning Prune based on magnitude of weights Train Connectivity Prune Connections Train Weights Example: AlexNet Weight Reduction: CONV layers 2.7x, FC layers 9.9x (Most reduction on fully connected layers) Overall: 9x weight reduction, 3x MAC reduction [Han et al., NeurIPS 2015]
  • 13. 13 Speed up of Weight Pruning on CPU/GPU Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV Batch size = 1 On Fully Connected Layers Only Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU [Han et al., NeurIPS 2015]
  • 14. 14 Design of Efficient DNN Algorithms • Popular efficient DNN algorithm approaches pruning neurons pruning synapses after pruning before pruning Network Pruning C 1 1 S R 1 R S C Compact Network Architectures Examples: SqueezeNet, MobileNet ... also reduced precision • Focus on reducing number of MACs and weights • Does it translate to energy savings and reduced latency?
  • 15. 15 Energy-Evaluation Methodology CNN Shape Configuration (# of channels, # of filters, etc.) CNN Weights and Input Data [0.3, 0, -0.4, 0.7, 0, 0, 0.1, …] CNN Energy Consumption L1 L2 L3 Energy … Memory Accesses Optimization # of MACs Calculation … # acc. at mem. level 1 # acc. at mem. level 2 # acc. at mem. level n # of MACs Hardware Energy Costs of each MAC and Memory Access Ecomp Edata Evaluation tool available at http://eyeriss.mit.edu/energy.html
  • 16. 16 Key Observations • Number of weights alone is not a good metric for energy • All data types should be considered Output Feature Map 43% Input Feature Map 25% Weights 22% Computation 10% Energy Consumption of GoogLeNet [Yang et al., CVPR 2017]
  • 17. 17 Energy-Aware Pruning Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings • Sort layers based on energy and prune layers that consume most energy first • EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Ori. DC EAP Normalized Energy (AlexNet) 2.1x 3.7x x109 Magnitude Based Pruning Energy Aware Pruning Pruned models available at http://eyeriss.mit.edu/energy.html [Yang et al., CVPR 2017]
  • 18. 18 # of Operations vs. Latency • # of operations (MACs) does not approximate latency well Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)
  • 19. 19 NetAdapt: Platform-Aware DNN Adaptation • Automatically adapt DNN to a mobile platform to reach a target latency or energy budget • Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture) [Yang et al., ECCV 2018] NetAdapt Measure … Network Proposals Empirical Measurements Metric Proposal A … Proposal Z Latency 15.6 … 14.3 Energy 41 … 46 … … … Pretrained Network Metric Budget Latency 3.8 Energy 10.5 Budget Adapted Network … … Platform A B C D Z Code to be released at http://netadapt.mit.edu
  • 20. 20 Improved Latency vs. Accuracy Tradeoff • NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy +0.3% accuracy 1.7x faster +0.3% accuracy 1.6x faster Reference: MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017 MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018 *Tested on the ImageNet dataset and a Google Pixel 1 CPU [Yang et al., ECCV 2018]
  • 21. 21 Compression of Weights & Activations • Compress weights and activations between DRAM and accelerator • Variable Length / Huffman Coding • Tested on AlexNet à 2× overall BW Reduction [Moons et al., VLSI 2016; Han et al., ICLR 2016] Value: 16’b0 à Compressed Code: {1’b0} Value: 16’bx à Compressed Code: {1’b1, 16’bx} Example:
  • 22. 22 Compression Overhead Index (non-zero position info – e.g., IA and JA for CSR) accounts for approximately half of storage for fine grained pruning [Han et al., ICLR 2016]
  • 23. 23 Coarse-Grained Pruning … E output fmap … … many filters (M) Many Output Channels (M) M … R S 1 R S … … … C … M H input fmap … … … … C … C … … … W F May prune by eliminating entire filter planes or extremely sparse input activation planes or just a tile of either.
  • 24. 24 Structured/Coarse-Grained Pruning • Scalpel – Prune to match the underlying data-parallel hardware organization for speed up [Yu et al., ISCA 2017] Dense weights Sparse weights Example: 2-way SIMD
  • 25. 25 Exploit Redundant Weights • Preprocessing to reorder weights (ok since weights are known) • Perform addition before multiplication to reduce number of multiplies and reads of weights • Example: Input = [1 2 3 ] and filter [A B A] Typical processing: Output = A*1+B*2+A*3 If reorder as [A A B]: Output = A*(1+3)+B*1 3 multiplies and 3 weight reads 2 multiplies and 2 weight reads Note: Bitwidth of multiplication may need to increase [UCNN, ISCA 2018]
  • 26. 26 Exploit ReLU • Reduce number operations when if resulting activation will be negative as ReLU will set to zero • Need to either perform preprocess (sort weights) or minimize prediction overhead and error [PredictiveNet, ISCAS 2017], [SnaPEA, ISCA 2018], [Song et al., ISCA 2018]
  • 27. 27 Compact Network Architectures • Break large convolutional layers into a series of smaller convolutional layers – Fewer weights, but same effective receptive field • Before Training: Network Architecture Design (already discussed this morning; e.g., MobileNet) • After Training: Decompose Trained Filters
  • 28. 28 Decompose Trained Filters After training, perform low-rank approximation by applying tensor decomposition to weight kernel; then fine-tune weights for accuracy [Lebedev et al., ICLR 2015] R = canonical rank
  • 29. 29 Decompose Trained Filters [Denton et al., NeurIPS 2014] • Speed up by 1.6 – 2.7x on CPU/GPU for CONV1, CONV2 layers • Reduce size by 5 - 13x for FC layer • < 1% drop in accuracy Original Approx. Visualization of Filters
  • 30. 30 Decompose Trained Filters on Phone [Kim et al., ICLR 2016] Tucker Decomposition
  • 31. 31 Knowledge Distillation [Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015] Complex DNN B (teacher) Simple DNN (student) softmax softmax Complex DNN A (teacher) softmax scores class probabilities Try to match