SlideShare a Scribd company logo
Adaptive Proximal Gradient Methods for
Structured Neural Networks
Jihun Yun1, Aurelie C. Lozano2, Eunho Yang1,3
1KAIST 2IBM T.J. Watson Research Center 3AITRICS
arcprime@kaist.ac.kr
Conference on Neural Information Processing Systems (NeurIPS) 2021
Regularized Training in Classical ML
• Regularized training is ubiquitous in machine learning problems
• Such tasks usually solve the following optimization problems
2
Loss
function
Suitable
Regularizer
(a) Lasso (b) Graphical Lasso (c) Matrix completion
Non-smoothness of Regularizers
• In many cases, the regularizer ℛ(⋅) could be non-smooth
• For example, ℓ𝑞 regularization at the origin
• In this case, we CANNOT use gradient descent
algorithm by non-differentiability at the origin
• One can use subgradient instead of gradient,
but this slows down the convergence of the
optimization algorithm
3
Non-smooth!
Proximal Gradient Descent (PGD)
• Bypass the non-smoothness via proximity operator
• This operator avoids taking gradient w.r.t. ℛ ⋅
• Parameter update rule with proximal gradient descent
4
Take gradient w.r.t. only
the loss function
Take proximal operator
corresponding
regularizer ℛ(⋅)
What about Regularized Training in Deep Learning?
• Still important in many practical applications!
• They also solve the following optimization problems
5
< Network Quantization >
< Network Pruning >
Network
loss
function
Suitable
Regularizer
Optimization in Deep Learning
• Network loss is quite complex!
• Generally, deep models are trained via adaptive gradient methods!
• AdaGrad, RMSprop, Adam, …
6
How to Solve the Regularized Problems in Deep Learning?
• Modern deep learning libraries employ subgradient-based solvers
• However, as we mentioned, the regularized problems should be solved via PGD
7
How to solve the regularized problems with
adaptive PGD?
AdaGrad: (Online) PGD with Adaptive Learning Rates
• AdaGrad (The first algorithm for coordinate-wise adaptive learning rate)
• Exploits the past gradient history
• AdaGrad provide the proximal update for the above update rule with specific
preconditioner
8
1) AdaGrad. J. Duchi. 2011
with stepsize 𝜂 and 𝐺𝑡,𝑖 =
𝑡=1
𝑇
𝑔𝑡,𝑖
2
AdaGrad Update Rule
Preconditioning gradient
However… the proximal update for the most popular
optimizer, such as Adam, has not been studied so far..
ProxGen: A Unified Framework for Stochastic PGD
• In this paper, beyond Adam, we propose a unified framework for arbitrary
preconditioner and any (non-convex) regularizer!
• Why arbitrary preconditioner?
• There are various preconditioning methods for deep learning, such as..
• AdaGrad
• Adam
• KFAC
• Etc…
• Why non-convex regularizer?
• Many regularizers are non-convex
• In many applications, non-convex regularizers show the superiority both in theory
and practice
9
ProxGen: A Unified Framework for Stochastic PGD
• We consider the following general family
10
ProxGen: Detailed Algorithms
• ProxGen Algorithm
• In terms of theory, our framework can guarantee the convergence for any optimizers used
in deep learning
• In terms of practice, our update rule is superior to subgradient-based methods
11
Brief Comparison with Previous Studies
• Simple comparison for stochastic proximal gradient methods
• Only our work can cover the proximal version of Adam
• Our unified framework is the most general form of stochastic proximal gradient methods
12
Examples of Proximal Mappings – ℓ𝒒 regularization
• ℓ𝑞 regularization
• For 𝑞 ∈ 0,
1
2
,
2
3
, 1 , there exists a closed-form proximal mappings
• As an example, the proximal mapping of ℓ1/2 for the following program
is known to be
• Using these examples, we can derive the proximal updates for preconditioned
gradient methods
13
Examples of Proximal Mappings – Quantization
• Revising the ProxQuant (ICLR 2019)
• For training binary neural networks, ProxQuant [ICLR 2019] propose the following W-shape
regularizers
• This regularizer has a closed-form proximal mappings
• For this regularizer, they consider the following proximal update rule (Adam case)
14
Examples of Proximal Mappings – Quantization
• ProxQuant vs. Our framework
• ProxQuant [ICLR 2019] considers the following update rule (Adam case):
• 𝑚𝑡: first-order momentum, 𝑉𝑡: second-order momentum
• This update rule do NOT consider the preconditioner in the proximal mappings
• Our update rule (Revised version)
15
Examples of Proximal Mappings – Quantization
• Extending ProxQuant (ICLR 2019)
• Since we know the proximal mappings for ℓ𝑞 regularization, we propose following
regularizers:
• We will evaluate our extended regularizers in the experiment section
16
Convergence Analysis – Main Theorem
• General Convergence
• We can derive two corollaries (constant batch size & increasing batch size).
17
Theorem 1 (General Convergence)
Under the mild conditions with the initial stepsize 𝛼0 ≤
𝛿
3𝐿
and non-increasing 𝛼𝑡,
our proximal update rule is guaranteed to yield
where Δ = 𝑓 𝜃 − 𝑓(𝜃∗) with optimal point 𝜃∗ and 𝑄𝑖 𝑖=1
3
are constants
independent of 𝑇.
Depend on batch size
Experiments – Sparse Neural Networks
• We consider the following objective functions
• Training ResNet-34 on CIFAR-10 dataset with ℓ𝑞 regularization
18
Sparse Neural Networks with ℓ𝟏 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen has a similar learning curve to the sub-gradient methods, but ProxGen shows
better generalization with the same sparsity level
• ProxGen is superior to Prox-SGD [Yang20] in terms of both learning curve and
generalization
19
1)
Sparse Neural Networks with ℓ𝟐/𝟑 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen shows faster convergence with better generalization overall sparsity levels
• We do not include Prox-SGD [Yang20] since it only considers convex regularizers
20
1)
Sparse Neural Networks with ℓ𝟏/𝟐 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen shows faster convergence with better generalization overall sparsity levels
• As 𝑞 goes to zero, the difference in convergence speed gets bigger
21
1)
Sparse Neural Networks with ℓ𝟎 Regularization
• ResNet-34 on CIFAR-10 dataset
• ℓ0-regularized problems cannot
be solved with sub-gradient methods
• So, we employ ℓ0ℎ𝑐
[Louizos18] as a
baseline, which approximates ℓ0-norm
via hard-concrete distributions
• ProxGen dramatically outperforms
ℓ0ℎ𝑐
in performance overall
sparsity levels
22
1)
Binary Neural Networks
• We consider the following objective function
• Training ResNet on CIFAR-10 dataset
23
Binary Neural Networks
• Binary Neural Networks (Only quantizing network weight parameters)
• ProxGen shows better performance except for ResNet-20, which means our methods would
be more suitable for larger networks.
• Also, our generalized ℓ𝑞 quantization-specific regularizers are more effective than ℓ1.
24
Conclusions & Future Work
• Conclusion
• We propose a general family of stochastic proximal gradient methods.
• By unified framework, we provide better understanding the proximal methods in terms of
both theory and practice.
• Our experiments shows that one should consider the proximal methods in regularized
training.
• Future Work
• We plan to design a proximal update rule for non-diagonal preconditioner
• ex) K-FAC (ICML 2015), other Kronecker-Factored Curvature (ICML 2017), AdaBlock
(our work)
• For non-diagonal preconditioners, we cannot split the proximal mappings into each
coordinate, so it’s very challenging
25
Thank you !
Any Questions ?

More Related Content

Similar to ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...dyyjkd
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Lec 4 (program and network properties)
Lec 4 (program and network properties)Lec 4 (program and network properties)
Lec 4 (program and network properties)Sudarshan Mondal
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...LDBC council
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Online learning for low-latency streaming
Online learning for low-latency streamingOnline learning for low-latency streaming
Online learning for low-latency streamingTheo Karagkioules
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionGuillermo Barbadillo Villanueva
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxssuser2624f71
 
Failure Prediction for APU on a Metro System
Failure Prediction for APU on a Metro SystemFailure Prediction for APU on a Metro System
Failure Prediction for APU on a Metro Systemaaryadevg
 
Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Eran Harel
 
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En..."Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...ssuser2624f71
 
Partition Configuration RTNS 2013 Presentation
Partition Configuration RTNS 2013 PresentationPartition Configuration RTNS 2013 Presentation
Partition Configuration RTNS 2013 PresentationJoseph Porter
 

Similar to ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021) (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
 
Network recasting
Network recastingNetwork recasting
Network recasting
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Acrl
AcrlAcrl
Acrl
 
Lec 4 (program and network properties)
Lec 4 (program and network properties)Lec 4 (program and network properties)
Lec 4 (program and network properties)
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Online learning for low-latency streaming
Online learning for low-latency streamingOnline learning for low-latency streaming
Online learning for low-latency streaming
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
11 types of solvers
11 types of solvers11 types of solvers
11 types of solvers
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
Failure Prediction for APU on a Metro System
Failure Prediction for APU on a Metro SystemFailure Prediction for APU on a Metro System
Failure Prediction for APU on a Metro System
 
Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)
 
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En..."Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
 
Partition Configuration RTNS 2013 Presentation
Partition Configuration RTNS 2013 PresentationPartition Configuration RTNS 2013 Presentation
Partition Configuration RTNS 2013 Presentation
 

Recently uploaded

Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwoodseandesed
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfAbrahamGadissa
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfKamal Acharya
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionjeevanprasad8
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfKamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsAtif Razi
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdfKamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationRobbie Edward Sayers
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdfKamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 

Recently uploaded (20)

Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 

ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)

  • 1. Adaptive Proximal Gradient Methods for Structured Neural Networks Jihun Yun1, Aurelie C. Lozano2, Eunho Yang1,3 1KAIST 2IBM T.J. Watson Research Center 3AITRICS arcprime@kaist.ac.kr Conference on Neural Information Processing Systems (NeurIPS) 2021
  • 2. Regularized Training in Classical ML • Regularized training is ubiquitous in machine learning problems • Such tasks usually solve the following optimization problems 2 Loss function Suitable Regularizer (a) Lasso (b) Graphical Lasso (c) Matrix completion
  • 3. Non-smoothness of Regularizers • In many cases, the regularizer ℛ(⋅) could be non-smooth • For example, ℓ𝑞 regularization at the origin • In this case, we CANNOT use gradient descent algorithm by non-differentiability at the origin • One can use subgradient instead of gradient, but this slows down the convergence of the optimization algorithm 3 Non-smooth!
  • 4. Proximal Gradient Descent (PGD) • Bypass the non-smoothness via proximity operator • This operator avoids taking gradient w.r.t. ℛ ⋅ • Parameter update rule with proximal gradient descent 4 Take gradient w.r.t. only the loss function Take proximal operator corresponding regularizer ℛ(⋅)
  • 5. What about Regularized Training in Deep Learning? • Still important in many practical applications! • They also solve the following optimization problems 5 < Network Quantization > < Network Pruning > Network loss function Suitable Regularizer
  • 6. Optimization in Deep Learning • Network loss is quite complex! • Generally, deep models are trained via adaptive gradient methods! • AdaGrad, RMSprop, Adam, … 6
  • 7. How to Solve the Regularized Problems in Deep Learning? • Modern deep learning libraries employ subgradient-based solvers • However, as we mentioned, the regularized problems should be solved via PGD 7 How to solve the regularized problems with adaptive PGD?
  • 8. AdaGrad: (Online) PGD with Adaptive Learning Rates • AdaGrad (The first algorithm for coordinate-wise adaptive learning rate) • Exploits the past gradient history • AdaGrad provide the proximal update for the above update rule with specific preconditioner 8 1) AdaGrad. J. Duchi. 2011 with stepsize 𝜂 and 𝐺𝑡,𝑖 = 𝑡=1 𝑇 𝑔𝑡,𝑖 2 AdaGrad Update Rule Preconditioning gradient However… the proximal update for the most popular optimizer, such as Adam, has not been studied so far..
  • 9. ProxGen: A Unified Framework for Stochastic PGD • In this paper, beyond Adam, we propose a unified framework for arbitrary preconditioner and any (non-convex) regularizer! • Why arbitrary preconditioner? • There are various preconditioning methods for deep learning, such as.. • AdaGrad • Adam • KFAC • Etc… • Why non-convex regularizer? • Many regularizers are non-convex • In many applications, non-convex regularizers show the superiority both in theory and practice 9
  • 10. ProxGen: A Unified Framework for Stochastic PGD • We consider the following general family 10
  • 11. ProxGen: Detailed Algorithms • ProxGen Algorithm • In terms of theory, our framework can guarantee the convergence for any optimizers used in deep learning • In terms of practice, our update rule is superior to subgradient-based methods 11
  • 12. Brief Comparison with Previous Studies • Simple comparison for stochastic proximal gradient methods • Only our work can cover the proximal version of Adam • Our unified framework is the most general form of stochastic proximal gradient methods 12
  • 13. Examples of Proximal Mappings – ℓ𝒒 regularization • ℓ𝑞 regularization • For 𝑞 ∈ 0, 1 2 , 2 3 , 1 , there exists a closed-form proximal mappings • As an example, the proximal mapping of ℓ1/2 for the following program is known to be • Using these examples, we can derive the proximal updates for preconditioned gradient methods 13
  • 14. Examples of Proximal Mappings – Quantization • Revising the ProxQuant (ICLR 2019) • For training binary neural networks, ProxQuant [ICLR 2019] propose the following W-shape regularizers • This regularizer has a closed-form proximal mappings • For this regularizer, they consider the following proximal update rule (Adam case) 14
  • 15. Examples of Proximal Mappings – Quantization • ProxQuant vs. Our framework • ProxQuant [ICLR 2019] considers the following update rule (Adam case): • 𝑚𝑡: first-order momentum, 𝑉𝑡: second-order momentum • This update rule do NOT consider the preconditioner in the proximal mappings • Our update rule (Revised version) 15
  • 16. Examples of Proximal Mappings – Quantization • Extending ProxQuant (ICLR 2019) • Since we know the proximal mappings for ℓ𝑞 regularization, we propose following regularizers: • We will evaluate our extended regularizers in the experiment section 16
  • 17. Convergence Analysis – Main Theorem • General Convergence • We can derive two corollaries (constant batch size & increasing batch size). 17 Theorem 1 (General Convergence) Under the mild conditions with the initial stepsize 𝛼0 ≤ 𝛿 3𝐿 and non-increasing 𝛼𝑡, our proximal update rule is guaranteed to yield where Δ = 𝑓 𝜃 − 𝑓(𝜃∗) with optimal point 𝜃∗ and 𝑄𝑖 𝑖=1 3 are constants independent of 𝑇. Depend on batch size
  • 18. Experiments – Sparse Neural Networks • We consider the following objective functions • Training ResNet-34 on CIFAR-10 dataset with ℓ𝑞 regularization 18
  • 19. Sparse Neural Networks with ℓ𝟏 Regularization • ResNet-34 on CIFAR-10 dataset • ProxGen has a similar learning curve to the sub-gradient methods, but ProxGen shows better generalization with the same sparsity level • ProxGen is superior to Prox-SGD [Yang20] in terms of both learning curve and generalization 19 1)
  • 20. Sparse Neural Networks with ℓ𝟐/𝟑 Regularization • ResNet-34 on CIFAR-10 dataset • ProxGen shows faster convergence with better generalization overall sparsity levels • We do not include Prox-SGD [Yang20] since it only considers convex regularizers 20 1)
  • 21. Sparse Neural Networks with ℓ𝟏/𝟐 Regularization • ResNet-34 on CIFAR-10 dataset • ProxGen shows faster convergence with better generalization overall sparsity levels • As 𝑞 goes to zero, the difference in convergence speed gets bigger 21 1)
  • 22. Sparse Neural Networks with ℓ𝟎 Regularization • ResNet-34 on CIFAR-10 dataset • ℓ0-regularized problems cannot be solved with sub-gradient methods • So, we employ ℓ0ℎ𝑐 [Louizos18] as a baseline, which approximates ℓ0-norm via hard-concrete distributions • ProxGen dramatically outperforms ℓ0ℎ𝑐 in performance overall sparsity levels 22 1)
  • 23. Binary Neural Networks • We consider the following objective function • Training ResNet on CIFAR-10 dataset 23
  • 24. Binary Neural Networks • Binary Neural Networks (Only quantizing network weight parameters) • ProxGen shows better performance except for ResNet-20, which means our methods would be more suitable for larger networks. • Also, our generalized ℓ𝑞 quantization-specific regularizers are more effective than ℓ1. 24
  • 25. Conclusions & Future Work • Conclusion • We propose a general family of stochastic proximal gradient methods. • By unified framework, we provide better understanding the proximal methods in terms of both theory and practice. • Our experiments shows that one should consider the proximal methods in regularized training. • Future Work • We plan to design a proximal update rule for non-diagonal preconditioner • ex) K-FAC (ICML 2015), other Kronecker-Factored Curvature (ICML 2017), AdaBlock (our work) • For non-diagonal preconditioners, we cannot split the proximal mappings into each coordinate, so it’s very challenging 25
  • 26. Thank you ! Any Questions ?