1. Adaptive Proximal Gradient Methods for
Structured Neural Networks
Jihun Yun1, Aurelie C. Lozano2, Eunho Yang1,3
1KAIST 2IBM T.J. Watson Research Center 3AITRICS
arcprime@kaist.ac.kr
Conference on Neural Information Processing Systems (NeurIPS) 2021
2. Regularized Training in Classical ML
• Regularized training is ubiquitous in machine learning problems
• Such tasks usually solve the following optimization problems
2
Loss
function
Suitable
Regularizer
(a) Lasso (b) Graphical Lasso (c) Matrix completion
3. Non-smoothness of Regularizers
• In many cases, the regularizer ℛ(⋅) could be non-smooth
• For example, ℓ𝑞 regularization at the origin
• In this case, we CANNOT use gradient descent
algorithm by non-differentiability at the origin
• One can use subgradient instead of gradient,
but this slows down the convergence of the
optimization algorithm
3
Non-smooth!
4. Proximal Gradient Descent (PGD)
• Bypass the non-smoothness via proximity operator
• This operator avoids taking gradient w.r.t. ℛ ⋅
• Parameter update rule with proximal gradient descent
4
Take gradient w.r.t. only
the loss function
Take proximal operator
corresponding
regularizer ℛ(⋅)
5. What about Regularized Training in Deep Learning?
• Still important in many practical applications!
• They also solve the following optimization problems
5
< Network Quantization >
< Network Pruning >
Network
loss
function
Suitable
Regularizer
6. Optimization in Deep Learning
• Network loss is quite complex!
• Generally, deep models are trained via adaptive gradient methods!
• AdaGrad, RMSprop, Adam, …
6
7. How to Solve the Regularized Problems in Deep Learning?
• Modern deep learning libraries employ subgradient-based solvers
• However, as we mentioned, the regularized problems should be solved via PGD
7
How to solve the regularized problems with
adaptive PGD?
8. AdaGrad: (Online) PGD with Adaptive Learning Rates
• AdaGrad (The first algorithm for coordinate-wise adaptive learning rate)
• Exploits the past gradient history
• AdaGrad provide the proximal update for the above update rule with specific
preconditioner
8
1) AdaGrad. J. Duchi. 2011
with stepsize 𝜂 and 𝐺𝑡,𝑖 =
𝑡=1
𝑇
𝑔𝑡,𝑖
2
AdaGrad Update Rule
Preconditioning gradient
However… the proximal update for the most popular
optimizer, such as Adam, has not been studied so far..
9. ProxGen: A Unified Framework for Stochastic PGD
• In this paper, beyond Adam, we propose a unified framework for arbitrary
preconditioner and any (non-convex) regularizer!
• Why arbitrary preconditioner?
• There are various preconditioning methods for deep learning, such as..
• AdaGrad
• Adam
• KFAC
• Etc…
• Why non-convex regularizer?
• Many regularizers are non-convex
• In many applications, non-convex regularizers show the superiority both in theory
and practice
9
10. ProxGen: A Unified Framework for Stochastic PGD
• We consider the following general family
10
11. ProxGen: Detailed Algorithms
• ProxGen Algorithm
• In terms of theory, our framework can guarantee the convergence for any optimizers used
in deep learning
• In terms of practice, our update rule is superior to subgradient-based methods
11
12. Brief Comparison with Previous Studies
• Simple comparison for stochastic proximal gradient methods
• Only our work can cover the proximal version of Adam
• Our unified framework is the most general form of stochastic proximal gradient methods
12
13. Examples of Proximal Mappings – ℓ𝒒 regularization
• ℓ𝑞 regularization
• For 𝑞 ∈ 0,
1
2
,
2
3
, 1 , there exists a closed-form proximal mappings
• As an example, the proximal mapping of ℓ1/2 for the following program
is known to be
• Using these examples, we can derive the proximal updates for preconditioned
gradient methods
13
14. Examples of Proximal Mappings – Quantization
• Revising the ProxQuant (ICLR 2019)
• For training binary neural networks, ProxQuant [ICLR 2019] propose the following W-shape
regularizers
• This regularizer has a closed-form proximal mappings
• For this regularizer, they consider the following proximal update rule (Adam case)
14
15. Examples of Proximal Mappings – Quantization
• ProxQuant vs. Our framework
• ProxQuant [ICLR 2019] considers the following update rule (Adam case):
• 𝑚𝑡: first-order momentum, 𝑉𝑡: second-order momentum
• This update rule do NOT consider the preconditioner in the proximal mappings
• Our update rule (Revised version)
15
16. Examples of Proximal Mappings – Quantization
• Extending ProxQuant (ICLR 2019)
• Since we know the proximal mappings for ℓ𝑞 regularization, we propose following
regularizers:
• We will evaluate our extended regularizers in the experiment section
16
17. Convergence Analysis – Main Theorem
• General Convergence
• We can derive two corollaries (constant batch size & increasing batch size).
17
Theorem 1 (General Convergence)
Under the mild conditions with the initial stepsize 𝛼0 ≤
𝛿
3𝐿
and non-increasing 𝛼𝑡,
our proximal update rule is guaranteed to yield
where Δ = 𝑓 𝜃 − 𝑓(𝜃∗) with optimal point 𝜃∗ and 𝑄𝑖 𝑖=1
3
are constants
independent of 𝑇.
Depend on batch size
18. Experiments – Sparse Neural Networks
• We consider the following objective functions
• Training ResNet-34 on CIFAR-10 dataset with ℓ𝑞 regularization
18
19. Sparse Neural Networks with ℓ𝟏 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen has a similar learning curve to the sub-gradient methods, but ProxGen shows
better generalization with the same sparsity level
• ProxGen is superior to Prox-SGD [Yang20] in terms of both learning curve and
generalization
19
1)
20. Sparse Neural Networks with ℓ𝟐/𝟑 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen shows faster convergence with better generalization overall sparsity levels
• We do not include Prox-SGD [Yang20] since it only considers convex regularizers
20
1)
21. Sparse Neural Networks with ℓ𝟏/𝟐 Regularization
• ResNet-34 on CIFAR-10 dataset
• ProxGen shows faster convergence with better generalization overall sparsity levels
• As 𝑞 goes to zero, the difference in convergence speed gets bigger
21
1)
22. Sparse Neural Networks with ℓ𝟎 Regularization
• ResNet-34 on CIFAR-10 dataset
• ℓ0-regularized problems cannot
be solved with sub-gradient methods
• So, we employ ℓ0ℎ𝑐
[Louizos18] as a
baseline, which approximates ℓ0-norm
via hard-concrete distributions
• ProxGen dramatically outperforms
ℓ0ℎ𝑐
in performance overall
sparsity levels
22
1)
23. Binary Neural Networks
• We consider the following objective function
• Training ResNet on CIFAR-10 dataset
23
24. Binary Neural Networks
• Binary Neural Networks (Only quantizing network weight parameters)
• ProxGen shows better performance except for ResNet-20, which means our methods would
be more suitable for larger networks.
• Also, our generalized ℓ𝑞 quantization-specific regularizers are more effective than ℓ1.
24
25. Conclusions & Future Work
• Conclusion
• We propose a general family of stochastic proximal gradient methods.
• By unified framework, we provide better understanding the proximal methods in terms of
both theory and practice.
• Our experiments shows that one should consider the proximal methods in regularized
training.
• Future Work
• We plan to design a proximal update rule for non-diagonal preconditioner
• ex) K-FAC (ICML 2015), other Kronecker-Factored Curvature (ICML 2017), AdaBlock
(our work)
• For non-diagonal preconditioners, we cannot split the proximal mappings into each
coordinate, so it’s very challenging
25