ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)

Adaptive Proximal Gradient Methods for
Structured Neural Networks
Jihun Yun1, Aurelie C. Lozano2, Eunho Yang1,3
1KAIST 2IBM T.J. Watson Research Center 3AITRICS
arcprime@kaist.ac.kr
Conference on Neural Information Processing Systems (NeurIPS) 2021

Regularized Training in Classical ML
• Regularized training is ubiquitous in machine learning problems
• Such tasks usually solve the following optimization problems
2
Loss
function
Suitable
Regularizer
(a) Lasso (b) Graphical Lasso (c) Matrix completion

Non-smoothness of Regularizers
• In many cases, the regularizer ℛ(⋅) could be non-smooth
• For example, ℓ𝑞 regularization at the origin
• In this case, we CANNOT use gradient descent
algorithm by non-differentiability at the origin
• One can use subgradient instead of gradient,
but this slows down the convergence of the
optimization algorithm
3
Non-smooth!

Proximal Gradient Descent (PGD)
• Bypass the non-smoothness via proximity operator
• This operator avoids taking gradient w.r.t. ℛ ⋅
• Parameter update rule with proximal gradient descent
4
Take gradient w.r.t. only
the loss function
Take proximal operator
corresponding
regularizer ℛ(⋅)

What about Regularized Training in Deep Learning?
• Still important in many practical applications!
• They also solve the following optimization problems
5
< Network Quantization >
< Network Pruning >
Network
loss
function
Suitable
Regularizer

Optimization in Deep Learning
• Network loss is quite complex!
• Generally, deep models are trained via adaptive gradient methods!
• AdaGrad, RMSprop, Adam, …
6

How to Solve the Regularized Problems in Deep Learning?
• Modern deep learning libraries employ subgradient-based solvers
• However, as we mentioned, the regularized problems should be solved via PGD
7
How to solve the regularized problems with
adaptive PGD?

AdaGrad: (Online) PGD with Adaptive Learning Rates
• AdaGrad (The first algorithm for coordinate-wise adaptive learning rate)
• Exploits the past gradient history
• AdaGrad provide the proximal update for the above update rule with specific
preconditioner
8
1) AdaGrad. J. Duchi. 2011
with stepsize 𝜂 and 𝐺𝑡,𝑖 =
𝑡=1
𝑇
𝑔𝑡,𝑖
2
AdaGrad Update Rule
Preconditioning gradient
However… the proximal update for the most popular
optimizer, such as Adam, has not been studied so far..

ProxGen: A Unified Framework for Stochastic PGD
• In this paper, beyond Adam, we propose a unified framework for arbitrary
preconditioner and any (non-convex) regularizer!
• Why arbitrary preconditioner?
• There are various preconditioning methods for deep learning, such as..
• AdaGrad
• Adam
• KFAC
• Etc…
• Why non-convex regularizer?
• Many regularizers are non-convex
• In many applications, non-convex regularizers show the superiority both in theory
and practice
9

ProxGen: A Unified Framework for Stochastic PGD
• We consider the following general family
10

ProxGen: Detailed Algorithms
• ProxGen Algorithm
• In terms of theory, our framework can guarantee the convergence for any optimizers used
in deep learning
• In terms of practice, our update rule is superior to subgradient-based methods
11

Brief Comparison with Previous Studies
• Simple comparison for stochastic proximal gradient methods
• Only our work can cover the proximal version of Adam
• Our unified framework is the most general form of stochastic proximal gradient methods
12

Examples of Proximal Mappings – ℓ𝒒 regularization
• ℓ𝑞 regularization
• For 𝑞 ∈ 0,
1
2
,
2
3
, 1 , there exists a closed-form proximal mappings
• As an example, the proximal mapping of ℓ1/2 for the following program
is known to be
• Using these examples, we can derive the proximal updates for preconditioned
gradient methods
13

Examples of Proximal Mappings – Quantization
• Revising the ProxQuant (ICLR 2019)
• For training binary neural networks, ProxQuant [ICLR 2019] propose the following W-shape
regularizers
• This regularizer has a closed-form proximal mappings
• For this regularizer, they consider the following proximal update rule (Adam case)
14

• ProxQuant vs. Our framework
• ProxQuant [ICLR 2019] considers the following update rule (Adam case):
• 𝑚𝑡: first-order momentum, 𝑉𝑡: second-order momentum
• This update rule do NOT consider the preconditioner in the proximal mappings
• Our update rule (Revised version)
15

• Extending ProxQuant (ICLR 2019)
• Since we know the proximal mappings for ℓ𝑞 regularization, we propose following
regularizers:
• We will evaluate our extended regularizers in the experiment section
16

Convergence Analysis – Main Theorem
• General Convergence
• We can derive two corollaries (constant batch size & increasing batch size).
17
Theorem 1 (General Convergence)
Under the mild conditions with the initial stepsize 𝛼0 ≤
𝛿
3𝐿
and non-increasing 𝛼𝑡,
our proximal update rule is guaranteed to yield
where Δ = 𝑓 𝜃 − 𝑓(𝜃∗) with optimal point 𝜃∗ and 𝑄𝑖 𝑖=1
3
are constants
independent of 𝑇.
Depend on batch size

Experiments – Sparse Neural Networks
• We consider the following objective functions
• Training ResNet-34 on CIFAR-10 dataset with ℓ𝑞 regularization
18

Sparse Neural Networks with ℓ𝟏 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen has a similar learning curve to the sub-gradient methods, but ProxGen shows
better generalization with the same sparsity level
• ProxGen is superior to Prox-SGD [Yang20] in terms of both learning curve and
generalization
19
1)

Sparse Neural Networks with ℓ𝟐/𝟑 Regularization
• ProxGen shows faster convergence with better generalization overall sparsity levels
• We do not include Prox-SGD [Yang20] since it only considers convex regularizers
20
1)

Sparse Neural Networks with ℓ𝟏/𝟐 Regularization
• ProxGen shows faster convergence with better generalization overall sparsity levels
• As 𝑞 goes to zero, the difference in convergence speed gets bigger
21
1)

Sparse Neural Networks with ℓ𝟎 Regularization
• ℓ0-regularized problems cannot
be solved with sub-gradient methods
• So, we employ ℓ0ℎ𝑐
[Louizos18] as a
baseline, which approximates ℓ0-norm
via hard-concrete distributions
• ProxGen dramatically outperforms
ℓ0ℎ𝑐
in performance overall
sparsity levels
22
1)

Binary Neural Networks
• We consider the following objective function
• Training ResNet on CIFAR-10 dataset
23

Binary Neural Networks
• Binary Neural Networks (Only quantizing network weight parameters)
• ProxGen shows better performance except for ResNet-20, which means our methods would
be more suitable for larger networks.
• Also, our generalized ℓ𝑞 quantization-specific regularizers are more effective than ℓ1.
24

Conclusions & Future Work
• Conclusion
• We propose a general family of stochastic proximal gradient methods.
• By unified framework, we provide better understanding the proximal methods in terms of
both theory and practice.
• Our experiments shows that one should consider the proximal methods in regularized
training.
• Future Work
• We plan to design a proximal update rule for non-diagonal preconditioner
• ex) K-FAC (ICML 2015), other Kronecker-Factored Curvature (ICML 2017), AdaBlock
(our work)
• For non-diagonal preconditioners, we cannot split the proximal mappings into each
coordinate, so it’s very challenging
25

ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)

Recommended

Recommended

More Related Content

Similar to ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)

Similar to ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021) (20)

Recently uploaded

Recently uploaded (20)

ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (NeurIPS 2021)