SlideShare a Scribd company logo
1 of 81
Download to read offline
Neural Network
Approximation
Low rank, Sparsity, and Quantization
zsc@megvii.com
Oct. 2017
Motivation
● Faster Inference
○ Latency critical scenarios
■ VR/AR, UGV/UAV
○ Saves time and energy
● Faster Training
○ Higher iteration speed
○ Saves life
● Smaller
○ storage size
○ memory footprint
Lee Sedol v.s. AlphaGo
77 kWatt
83 Watt
Neural Network
Machine Learning as Optimization
● Supervised learning
○ is the parameter
○ is output, X is input
○ y is ground truth
○ d is the objective function
● Unsupervised learning
○ some oracle function r: low rank, sparse, K
Machine Learning as Optimization
● Regularized supervised learning
○
● Probabilistic interpretation
○ d measures conditional probability
○ r measures prior probability
○ Probability approach is more constrained than the
optimization-approach due to normalization problem
■ Not easy to represent uniform distribution over [0,
infty]
● Can be solved by an ODE:
○ Discretizing with step length we get gradient
descent with learning rate
○
● Convergence proof
Gradient descent
Derive
Linear Regression
●
○
○
○ x is input, hat{y} is prediction, y is
ground truth.
○ W with dimension (m,n)
○ #param = m n, #OPs = m n
Fully-connected
●
○ In general, will use nonlinearity
to increase “model capacity”.
○ Make sense if f is identity? I.e.
f(x) = x?
■ Sometimes, if W_2 is m by r and
W_1 is r by n, then W_2 W_1 is a
matrix of rank r, which is different
from a m by n matrix.
●
○ Deep learning!
Neural Network
X
y
Cost
d + r
Activations/
Feature maps/
Neurons
● Can be solved by an ODE:
○ Discretizing with step length we get gradient
descent with learning rate
○
● Convergence proof
Gradient descent
Derive
Backpropagation
Neural Network Training
X
y
Cost
d + r
Activations/
Feature maps/
Neurons
Gradients
CNN: Alexnet-like
Method 2: Convolution as matrix product
● Convolution
○ feature map <N, C, H’, W’>
○ weights <K, C, H, W>
● Convolution as FC
○ under proper padding, can extract patchs
○ feature map <N H’ W’, C H W>
○ weights <C H W, K>
Kernel stride
determines
how much
overlap
Height
Width
Importance of Convolutions and FC
Feature map size
Neupack: inspect_model.py
NeuPeak: npk-model-manip XXX info
Most storage
size
Most
Computation
The Matrix View of Neural Network
● Weights of FullyConnected and
Convolutions layers
○ take up most computation and storage size
○ are representable as matrices
● Approximating the matrices approximates
the network
○ The approximation error accumulates.
Low rank Approximation
Singular Value Decomposition
● Matrix deocomposition view
○ A = U S V^T
○ Rows of U, V are orthogonal. S is diagonal.
■ u, s, v^T = np.linalg.svd(x, full_matrices=0,compute_uv=1)
■ The diagonals are non-negative and are in descending order.
■ U^T U = I, but U U^T is not full rank
Compact SVD
Truncated SVD
● Assume diagonals of S are in descending order
○ Always achievable
○ Just ignore the blue segments.
Matrix factorization => Convolution factorization
● Factorization into HxW followed by 1x1
○ feature map (N H’ W’, C H W)
○ first conv weights (C H W, R)
○ feature map (N H’ W’, R)
○ second conv weights (R, K)
○ feature map (N H’ W’, K)
HxW
HxW
1x1
K
K
C
C
R
K K
CHW
R
R
CHW
Approximating Convolution Weight
● W is a (K, C, H, W) 4-tensor
○ can be reshaped to a (CHW, K) matrix, etc.
● F-norm is invariant under reshape
○
W W_a
M M_a
reshape reshape
approximate
Matrix factorization => Convolution factorization
● Factorization into 1x1 followed by HxW
○ feature map (N H’ W’ H W, C)
○ first conv weights (C, R)
○ feature map (N H’ W’ H W, R) = (N H’ W’, R H W)
○ second conv weights (R H W, K)
○ feature map (N H’ W’, K)
● Steps
○ Reshape (CHW, K) to (C, HW, K)
○ (C, HW, K) = (C, R) (R, HW, K)
○ Reshape (R, HW, K) to (RHW, K)
1x1
HxW
HxW
C
R
K
C
K
Horizontal-Vertical Decomposition
● Approximating with Separable Filters
● Original Convolution
○ feature map (N H’ W’, C H W)
○ weights (C H W, K)
● Factorization into Hx1 followed by 1xW
○ feature map (N H’ W’ W, C H)
○ first conv weights (C H, R)
○ feature map (N H’ W’ W, R) = (N H’ W’, R W)
○ second conv weights (R W, K)
○ feature map (N H’ W’, K)
● Steps
○ Reshape (CHW, K) to (CH, WK)
○ (CH, WK) = (CH, R) (R, WK)
○ Reshape (R, WK) to (RW, K)
Hx1
HxW
1xW
C
K
C
K
R
Factorizing N-D convolution
● Original Convolution
○ let dimension be N
○ feature map (N D’_1 D’_2 … D’_Z, C D_1 D_2 … D_N)
○ weights (C D_1 D_2 … D_N, K)
● Factorization into N number of D_i x1
○ R_0 = C, R_Z = K
○ feature map (N D’_1 D’_2 … D’_Z, C D_1 D_2 … D_N)
○ weights (R_0 D_1, R_1)
○ feature map (N D’_1 D’_2 … D’_Z, R_1 D_2 … D_N)
○ weights (R_1 D_2, R_2)
○ ...
Hx1x1
HxWxZ
1xWx1
C
K
C
R2
R1
1x1xZ
Z
SVD
Kronecker
Product
+
Kronecker Conv
● (C H W, K)
● Reshape as (C_1 C_2 H W, K_1 K_2)
● Steps
○ Feature map is (N C H’ W’)
○ Extract patches and reshape (N H’ W’ C_2, C_1 H)
○ apply (C_1 H, K_1 R)
○ Feature map is (N K_1 R H’ W’ C_2)
○ Extract patches and reshape (N K_1 H’ W’, R C_2 W)
○ apply (R C_2 W, K_2)
● For rank efficiency, should have
○ R C_2 approx C_1
Exploiting Local Structures with the Kronecker Layer
in Convolutional Networks 1512
Shared Group Convolution is a Kronecker Layer
AlexNet partitioned a conv
Conv/FC Shared Group
Conv/FC
CP-decomposition and Xception
● Xception: Deep Learning with Depthwise Separable Convolutions 1610
● CP-decomposition with Tensor Power Method forConvolutional Neural
Networks Compression 1701
● MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications 1704
○ They submitted the paper to CVPR about the same time as Xception.
Matrix Joint Diagonalization = CP
S
T
T1
T1
T1
S3
S2
S1
CP
MJD
HxW
C
K
C
HxW
K
CP-decomposition with Tensor Power Method for
Convolutional Neural Networks Compression 1701
Convolution
Xception
Channel-wise
Tensor Train Decomposition
Tensor Train Decomposition: just a few SVD’s
Tensor Train Decomposition on FC
Graph Summary of SVD variants
Matrix Product State
CNN layers as Multilinear Maps
Sparse Approximation
Distribution of Weights
● Universal across convolutions and FC
● Concentration of values near 0
● Large values cannot be dropped
Sparsity of NN: statistics
Sparsity of NN: statistics
Weight Pruning: from DeepCompression
Train
Network
Extract
Mask M
Train
W => M o W
...
The model has been trained
with exccessive #epoch.
Sparse Matrix at Runtime
● Sparse Matrix = Discrete Mask + Continuous
values
○ Mask cannot be learnt the normal way
○ The values have well-defined gradients
● The matrix value look up need go through a LUT
○ CSR format
■ A: NNZ values
■ IA: accumulated #NNZ of rows
■ JA: the column in the row
Burden of Sparseness
● Lost of regularity of memory access and computation
○ Need special hardware for efficient access
○ May need high zero ratio to match dense matrix
■ Matrices will less than 70% zero values, better to treat as dense matrices.
Convolution layers are harder to compress than FC
Dynamic Generation of Code
● CVPR’15: Sparse Convolutional Neural Networks
● Relies on compiler for
○ register allocation
○ scheduling
● Good on CPU
Channel Pruning
● Learning the Number of Neurons in Deep Networks 1611
● Channel Pruning for Accelerating Very Deep Neural Networks 1707
○ Also exploits low-rankness of features
Sparse Communication for Distributed Gradient
Descent 1704
Quantization
Precursor: Ising model & Boltzmann machine
● Ising model
○ used to model magnetics
○ 1D has trivial analytic solution
○ 2D exhibits phase-transition
○ 2D Ising model can be used for denoising
■ when the mean signal is reliable
● Inference also requires optimization
Neural Network Training
X
y
Cost
d + r
Activations/
Feature maps/
Neurons
Gradients
Quantized Quantized
Quantized
Backpropagation
There will be no
gradient flow if we
quantize somewhere!
Differentiable Quantization
● Bengio ’13: Estimating or Propagating Gradients Through
Stochastic Neurons for Conditional Computation
○ REINFORCE algorithm
○ Decompose binary stochastic neuron into stochastic and
differentiable part
○ Injection of additive/multiplicative noise
○ Straight-through estimator
Gradient vanishes after quantization.
Quantization also at Train time
● Neural Network can adapt to the constraints imposed by quantization
● Exploits “Straight-through estimator” (Hinton, Coursera lecture, 2012)
○
○
○
○
○
● Example
Bit Neural Network
● Matthieu Courbariaux et al. BinaryConnect: Training Deep Neural Networks with binary
weights during propagations. http://arxiv.org/abs/1511.00363
● Itay Hubara et al. Binarized Neural Networks https://arxiv.org/abs/1602.02505v3
● Matthieu Courbariaux et al. Binarized Neural Networks: Training Neural Networks with
Weights and Activations Constrained to +1 or -1. http://arxiv.org/pdf/1602.02830v3.pdf
● Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural
Networks http://arxiv.org/pdf/1603.05279v1.pdf
● Zhou et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with
Low Bitwidth Gradients https://arxiv.org/abs/1606.06160
● Hubara et al. Quantized Neural Networks: Training Neural Networks with Low Precision
Weights and Activations https://arxiv.org/abs/1609.07061
Binarizing AlexNet
Theoretical
Scaled binarization
●
●
● Sol:
= Varaince of rows of W o B
XNOR-Net
Binary weights network
● Filter repetition
○ 3x3 binary kernel has only 256 patterns modulo sign.
○ 3x1 binary kernel only has only 4 patterns modulo sign.
○ Not easily exploitable as we are applying CHW as filter
Binarizing AlexNet
Theoretical
Scaled binarization is no longer exact and not found
to be useful
The solution below is quite bad, like when Y = [-4, 1]
Quantization of Activations
● XNOR-net adopted STE method in their open-source our code
Input ReLU
Capped
ReLU
Quantization
Input
DoReFa-Net: Training Low Bitwidth Convolutional
Neural Networks with Low Bitwidth Gradients
● Uniform stochastic quantization of gradients
○ 6 bit for ImageNet, 4 bit for SVHN
● Simplified scaled binarization: only scalar
○ Forward and backward multiplies the bit matrices from different sides.
○ Using scalar binarization allows using bit operations
● Floating-point-free inference even when with BN
● Future work
○ BN requires FP computation during training
○ Require FP weights for accumulating gradients
SVHN
A B C D
A has two times as many channels as B.
B has two times as many channels as C.
...
Quantization Methods
● Deterministic Quantization
○
○
● Stochastic Quantizaiton
○
○
○
○
Injection of noise realizes the sampling.
Quantization of Weights, Activations and Gradients
● A half #channel 2-bit AlexNet (same bit complexity as XNOR-net)
Quantization Error measured by Cosine Similarity
● Wb_sn is n-bit quantizaiton of real W
● x is Gaussian R. V. clipped by tanh
Saturates
Effective Quantization Methods for Recurrent Neural
Networks 2016
Our FP baseline is worse than that of Hubara.
Training Bit Fully Convolutional Network for Fast
Semantic Segmentation 2016
FPGA is made up of many LUT's
TernGrad: Ternary Gradients to ReduceCommunication in
Distributed Deep Learning 1705
● Weights and activations not quantized.
More References
● Xiangyu Zhang, Jianhua Zou, Kaiming He, Jian Sun: Accelerating Very Deep Convolutional Networks for Classification and Detection.
IEEE Trans. Pattern Anal. Mach. Intell. 38(10): 1943-1955 (2016)
● ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices https://arxiv.org/abs/1707.01083
● Aggregated Residual Transformations for Deep Neural Networks https://arxiv.org/abs/1611.05431
● Convolutional neural networks with low-rank regularization https://arxiv.org/abs/1511.06067
Backup after this slide
Slide also available at my home page:
https://zsc.github.io/
Low-rankness of Activations
● Accelerating Very Deep Convolutional Networks for Classification and
Detection

More Related Content

Similar to Neural Network Approximation.pdf

Introduction to artificial neural network and deep learning
Introduction to artificial neural network and deep learningIntroduction to artificial neural network and deep learning
Introduction to artificial neural network and deep learningPramod Ramachandra
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Universitat Politècnica de Catalunya
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smallerTony Tran
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxvipul6601
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsMathias Niepert
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術CHENHuiMei
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet upDeep Learning Italia
 

Similar to Neural Network Approximation.pdf (20)

Introduction to artificial neural network and deep learning
Introduction to artificial neural network and deep learningIntroduction to artificial neural network and deep learning
Introduction to artificial neural network and deep learning
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptx
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Regularization
RegularizationRegularization
Regularization
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
talk_NASPDE.pdf
talk_NASPDE.pdftalk_NASPDE.pdf
talk_NASPDE.pdf
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Subquad multi ff
Subquad multi ffSubquad multi ff
Subquad multi ff
 

Recently uploaded

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Recently uploaded (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

Neural Network Approximation.pdf

  • 1. Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017
  • 2. Motivation ● Faster Inference ○ Latency critical scenarios ■ VR/AR, UGV/UAV ○ Saves time and energy ● Faster Training ○ Higher iteration speed ○ Saves life ● Smaller ○ storage size ○ memory footprint Lee Sedol v.s. AlphaGo 77 kWatt 83 Watt
  • 3.
  • 4.
  • 6. Machine Learning as Optimization ● Supervised learning ○ is the parameter ○ is output, X is input ○ y is ground truth ○ d is the objective function ● Unsupervised learning ○ some oracle function r: low rank, sparse, K
  • 7. Machine Learning as Optimization ● Regularized supervised learning ○ ● Probabilistic interpretation ○ d measures conditional probability ○ r measures prior probability ○ Probability approach is more constrained than the optimization-approach due to normalization problem ■ Not easy to represent uniform distribution over [0, infty]
  • 8. ● Can be solved by an ODE: ○ Discretizing with step length we get gradient descent with learning rate ○ ● Convergence proof Gradient descent Derive
  • 9. Linear Regression ● ○ ○ ○ x is input, hat{y} is prediction, y is ground truth. ○ W with dimension (m,n) ○ #param = m n, #OPs = m n
  • 10. Fully-connected ● ○ In general, will use nonlinearity to increase “model capacity”. ○ Make sense if f is identity? I.e. f(x) = x? ■ Sometimes, if W_2 is m by r and W_1 is r by n, then W_2 W_1 is a matrix of rank r, which is different from a m by n matrix. ● ○ Deep learning!
  • 11. Neural Network X y Cost d + r Activations/ Feature maps/ Neurons
  • 12. ● Can be solved by an ODE: ○ Discretizing with step length we get gradient descent with learning rate ○ ● Convergence proof Gradient descent Derive
  • 14. Neural Network Training X y Cost d + r Activations/ Feature maps/ Neurons Gradients
  • 16. Method 2: Convolution as matrix product ● Convolution ○ feature map <N, C, H’, W’> ○ weights <K, C, H, W> ● Convolution as FC ○ under proper padding, can extract patchs ○ feature map <N H’ W’, C H W> ○ weights <C H W, K> Kernel stride determines how much overlap Height Width
  • 17. Importance of Convolutions and FC Feature map size Neupack: inspect_model.py NeuPeak: npk-model-manip XXX info Most storage size Most Computation
  • 18. The Matrix View of Neural Network ● Weights of FullyConnected and Convolutions layers ○ take up most computation and storage size ○ are representable as matrices ● Approximating the matrices approximates the network ○ The approximation error accumulates.
  • 20. Singular Value Decomposition ● Matrix deocomposition view ○ A = U S V^T ○ Rows of U, V are orthogonal. S is diagonal. ■ u, s, v^T = np.linalg.svd(x, full_matrices=0,compute_uv=1) ■ The diagonals are non-negative and are in descending order. ■ U^T U = I, but U U^T is not full rank Compact SVD
  • 21. Truncated SVD ● Assume diagonals of S are in descending order ○ Always achievable ○ Just ignore the blue segments.
  • 22.
  • 23. Matrix factorization => Convolution factorization ● Factorization into HxW followed by 1x1 ○ feature map (N H’ W’, C H W) ○ first conv weights (C H W, R) ○ feature map (N H’ W’, R) ○ second conv weights (R, K) ○ feature map (N H’ W’, K) HxW HxW 1x1 K K C C R K K CHW R R CHW
  • 24. Approximating Convolution Weight ● W is a (K, C, H, W) 4-tensor ○ can be reshaped to a (CHW, K) matrix, etc. ● F-norm is invariant under reshape ○ W W_a M M_a reshape reshape approximate
  • 25. Matrix factorization => Convolution factorization ● Factorization into 1x1 followed by HxW ○ feature map (N H’ W’ H W, C) ○ first conv weights (C, R) ○ feature map (N H’ W’ H W, R) = (N H’ W’, R H W) ○ second conv weights (R H W, K) ○ feature map (N H’ W’, K) ● Steps ○ Reshape (CHW, K) to (C, HW, K) ○ (C, HW, K) = (C, R) (R, HW, K) ○ Reshape (R, HW, K) to (RHW, K) 1x1 HxW HxW C R K C K
  • 26. Horizontal-Vertical Decomposition ● Approximating with Separable Filters ● Original Convolution ○ feature map (N H’ W’, C H W) ○ weights (C H W, K) ● Factorization into Hx1 followed by 1xW ○ feature map (N H’ W’ W, C H) ○ first conv weights (C H, R) ○ feature map (N H’ W’ W, R) = (N H’ W’, R W) ○ second conv weights (R W, K) ○ feature map (N H’ W’, K) ● Steps ○ Reshape (CHW, K) to (CH, WK) ○ (CH, WK) = (CH, R) (R, WK) ○ Reshape (R, WK) to (RW, K) Hx1 HxW 1xW C K C K R
  • 27. Factorizing N-D convolution ● Original Convolution ○ let dimension be N ○ feature map (N D’_1 D’_2 … D’_Z, C D_1 D_2 … D_N) ○ weights (C D_1 D_2 … D_N, K) ● Factorization into N number of D_i x1 ○ R_0 = C, R_Z = K ○ feature map (N D’_1 D’_2 … D’_Z, C D_1 D_2 … D_N) ○ weights (R_0 D_1, R_1) ○ feature map (N D’_1 D’_2 … D’_Z, R_1 D_2 … D_N) ○ weights (R_1 D_2, R_2) ○ ... Hx1x1 HxWxZ 1xWx1 C K C R2 R1 1x1xZ Z
  • 29. Kronecker Conv ● (C H W, K) ● Reshape as (C_1 C_2 H W, K_1 K_2) ● Steps ○ Feature map is (N C H’ W’) ○ Extract patches and reshape (N H’ W’ C_2, C_1 H) ○ apply (C_1 H, K_1 R) ○ Feature map is (N K_1 R H’ W’ C_2) ○ Extract patches and reshape (N K_1 H’ W’, R C_2 W) ○ apply (R C_2 W, K_2) ● For rank efficiency, should have ○ R C_2 approx C_1
  • 30. Exploiting Local Structures with the Kronecker Layer in Convolutional Networks 1512
  • 31. Shared Group Convolution is a Kronecker Layer AlexNet partitioned a conv Conv/FC Shared Group Conv/FC
  • 32. CP-decomposition and Xception ● Xception: Deep Learning with Depthwise Separable Convolutions 1610 ● CP-decomposition with Tensor Power Method forConvolutional Neural Networks Compression 1701 ● MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications 1704 ○ They submitted the paper to CVPR about the same time as Xception.
  • 33. Matrix Joint Diagonalization = CP S T T1 T1 T1 S3 S2 S1 CP MJD HxW C K C HxW K
  • 34. CP-decomposition with Tensor Power Method for Convolutional Neural Networks Compression 1701 Convolution Xception Channel-wise
  • 36. Tensor Train Decomposition: just a few SVD’s
  • 38. Graph Summary of SVD variants Matrix Product State
  • 39. CNN layers as Multilinear Maps
  • 41. Distribution of Weights ● Universal across convolutions and FC ● Concentration of values near 0 ● Large values cannot be dropped
  • 42. Sparsity of NN: statistics
  • 43. Sparsity of NN: statistics
  • 44. Weight Pruning: from DeepCompression Train Network Extract Mask M Train W => M o W ... The model has been trained with exccessive #epoch.
  • 45. Sparse Matrix at Runtime ● Sparse Matrix = Discrete Mask + Continuous values ○ Mask cannot be learnt the normal way ○ The values have well-defined gradients ● The matrix value look up need go through a LUT ○ CSR format ■ A: NNZ values ■ IA: accumulated #NNZ of rows ■ JA: the column in the row
  • 46. Burden of Sparseness ● Lost of regularity of memory access and computation ○ Need special hardware for efficient access ○ May need high zero ratio to match dense matrix ■ Matrices will less than 70% zero values, better to treat as dense matrices.
  • 47. Convolution layers are harder to compress than FC
  • 48. Dynamic Generation of Code ● CVPR’15: Sparse Convolutional Neural Networks ● Relies on compiler for ○ register allocation ○ scheduling ● Good on CPU
  • 49. Channel Pruning ● Learning the Number of Neurons in Deep Networks 1611 ● Channel Pruning for Accelerating Very Deep Neural Networks 1707 ○ Also exploits low-rankness of features
  • 50. Sparse Communication for Distributed Gradient Descent 1704
  • 52. Precursor: Ising model & Boltzmann machine ● Ising model ○ used to model magnetics ○ 1D has trivial analytic solution ○ 2D exhibits phase-transition ○ 2D Ising model can be used for denoising ■ when the mean signal is reliable ● Inference also requires optimization
  • 53. Neural Network Training X y Cost d + r Activations/ Feature maps/ Neurons Gradients Quantized Quantized Quantized
  • 54. Backpropagation There will be no gradient flow if we quantize somewhere!
  • 55. Differentiable Quantization ● Bengio ’13: Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation ○ REINFORCE algorithm ○ Decompose binary stochastic neuron into stochastic and differentiable part ○ Injection of additive/multiplicative noise ○ Straight-through estimator Gradient vanishes after quantization.
  • 56. Quantization also at Train time ● Neural Network can adapt to the constraints imposed by quantization ● Exploits “Straight-through estimator” (Hinton, Coursera lecture, 2012) ○ ○ ○ ○ ○ ● Example
  • 57. Bit Neural Network ● Matthieu Courbariaux et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. http://arxiv.org/abs/1511.00363 ● Itay Hubara et al. Binarized Neural Networks https://arxiv.org/abs/1602.02505v3 ● Matthieu Courbariaux et al. Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1. http://arxiv.org/pdf/1602.02830v3.pdf ● Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks http://arxiv.org/pdf/1603.05279v1.pdf ● Zhou et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients https://arxiv.org/abs/1606.06160 ● Hubara et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations https://arxiv.org/abs/1609.07061
  • 59. Scaled binarization ● ● ● Sol: = Varaince of rows of W o B
  • 61. Binary weights network ● Filter repetition ○ 3x3 binary kernel has only 256 patterns modulo sign. ○ 3x1 binary kernel only has only 4 patterns modulo sign. ○ Not easily exploitable as we are applying CHW as filter
  • 63. Scaled binarization is no longer exact and not found to be useful The solution below is quite bad, like when Y = [-4, 1]
  • 64. Quantization of Activations ● XNOR-net adopted STE method in their open-source our code Input ReLU Capped ReLU Quantization Input
  • 65. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients ● Uniform stochastic quantization of gradients ○ 6 bit for ImageNet, 4 bit for SVHN ● Simplified scaled binarization: only scalar ○ Forward and backward multiplies the bit matrices from different sides. ○ Using scalar binarization allows using bit operations ● Floating-point-free inference even when with BN ● Future work ○ BN requires FP computation during training ○ Require FP weights for accumulating gradients
  • 66. SVHN A B C D A has two times as many channels as B. B has two times as many channels as C. ...
  • 67. Quantization Methods ● Deterministic Quantization ○ ○ ● Stochastic Quantizaiton ○ ○ ○ ○ Injection of noise realizes the sampling.
  • 68. Quantization of Weights, Activations and Gradients ● A half #channel 2-bit AlexNet (same bit complexity as XNOR-net)
  • 69. Quantization Error measured by Cosine Similarity ● Wb_sn is n-bit quantizaiton of real W ● x is Gaussian R. V. clipped by tanh Saturates
  • 70.
  • 71.
  • 72. Effective Quantization Methods for Recurrent Neural Networks 2016 Our FP baseline is worse than that of Hubara.
  • 73. Training Bit Fully Convolutional Network for Fast Semantic Segmentation 2016
  • 74.
  • 75. FPGA is made up of many LUT's
  • 76.
  • 77. TernGrad: Ternary Gradients to ReduceCommunication in Distributed Deep Learning 1705 ● Weights and activations not quantized.
  • 78. More References ● Xiangyu Zhang, Jianhua Zou, Kaiming He, Jian Sun: Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Trans. Pattern Anal. Mach. Intell. 38(10): 1943-1955 (2016) ● ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices https://arxiv.org/abs/1707.01083 ● Aggregated Residual Transformations for Deep Neural Networks https://arxiv.org/abs/1611.05431 ● Convolutional neural networks with low-rank regularization https://arxiv.org/abs/1511.06067
  • 79. Backup after this slide Slide also available at my home page: https://zsc.github.io/
  • 80.
  • 81. Low-rankness of Activations ● Accelerating Very Deep Convolutional Networks for Classification and Detection