This document summarizes a seminar on non-convex optimization methods for machine learning. It discusses continuation methods, curriculum learning, and mollifying networks. Continuation methods construct a series of increasingly difficult loss functions to optimize. Curriculum learning begins training with simple concepts and progresses to complex concepts. Mollifying networks smooth the objective function during training by injecting noise into activations, and gradually reducing the noise level.
4. Motivation
DNNs: highly non-convex optimization of loss function
• Plateaus, Saddle Points and Other Flat Regions
• Cliffs and Exploding Gradients
2
5. Previous Studies
proposed methods to make optimization easier:
• continuation methods:
• blurring / noise injection
• RNNs with diffusion
• curriculum learning
• pre-training (omitted on this talk)
• active learning (omitted on this talk)
• transfer learning (omitted on this talk)
3
7. Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difficult
Recherches de Yoshua Bengio (Web site) 4
8. Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difficult
How designed loss function series?
• Some non-convex functions become approximately convex
when blurred
J(i)
(θ) = Eθ′∼N(θ′;θ,σ(i)2)J(θ′
)
Limit of Application
• applicable, however NP-hard problems remain NP-hard
• not applicable, because not become convex by blurring
• not applicable, because minimum point become different
5
9. Where we use continuation methods on non-convex problems
Classic examples include data clustering (Gold+ 1994), graph
matching (Gold+ 1996; Zaslavskiy+ 2009; Liu+ 2012), semi-supervised
kernel machines (Sindhwani+ 2006), multiple instance learning
(Gehler+ 2007; Kim+ 2010), semi-supervised structured output
(Dhillon+ 2012), language modeling (Bengio+ 2009), robot navigation
(Pretto+ 2010), shape matching (Tirthapura+ 1998), l0 norm
minimization (Trzasko+ 2009), image deblurring (Boccuto+ 2002),
image denoising (Rangarajan+ 1990; Nikolova+ 2010), template
matching (Dufour+ 2002), pixel correspondence (Leordeanu+ 2008),
active contours (Cohen+ 1995), Hough transform (Leich+ 2004), and
image matting (Price+ 2010), finding optimal parameters in computer
programs (Chaudhuri+ 2011) and seeking the optimal proofs
(Chaudhuri+ 2014)...
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 6
10. Theoretical background and Proofs (2015)
Abstract
• Optimization complexity α is derived from the objective function.
• α is computable when the objective function is expressed in
suitable basis functions. e.g. Gaussian RBF
Brief Statement
Let f(x) be a nonconvex function to be minimized, ˆx be the solution
discovered by the continuation method, and f†
be the minimum of
the simplified objective function. Then,
f(ˆx) ≤ w1f†
+ w2
√
α
where w1 > 0 and w2 > 0 are independent of f and α
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 7
11. Formation
Definition
Objective function f : X → R (X = Rd
)
Embedding of f into a family of functions g : X × T (homotopy)
x(t): Minimum under the condition of strict convexity in g(·, t)
Assumption on The curve x(t)
• limt→∞ x(t) = x∞
• x(t) is continuous in t
• ∀t ≥ 0; ∇g(x(t), t) = 0
Continuation methods
x∞ is approximated by arg minx g(x, t)
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 8
12. Formation
Gaussian Homotopy
The Gaussian homotopy g : X × T is defined the convolution of f and
the isotropic Gaussian kernel k as
g(x; σ) := [f ⋆ kσ](x)
The Gaussian convolution obeys the heat equation (Widder 1975)
˙g(x; σ) = σ∆g(x; σ)
Proof Procedure
• g(x(t); t) ≤ F(g(x(t1), t1), a(t), b(t)), where a(t)g + b(t) ≤ ˆg
• a(t), b(t) can be bounded by g
• the boundary is parameterized by α
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 9
14. Curriculum Learning
Curriculum Learning (Benjio+, ICML 2009)
Learning process begins by simple concepts and progresses to more
complex concepts.
How designed Curriculum and defined easy examples?
Easy tasks and curriculum are often defined on the task specific.
• distance from classification boundary (Basu+Christensen, AAAI
2013)
• Short sentence (Spitkovsky+, NIPS 2009)
• Exemplar-SVM (Lapedriza+,arXiv 2013)
stochastic curriculum (Zaremba+Sutskever,2014)
a random mix of easy and difficult examples is presented, and the
average proportion of the more difficult examples is gradually
increased
10
15. Experiments on shape recognition
Environment
• Task: classification into 3 classes (rectangle, ellipse, triangle)
• Curriculum: using BasicShapes, including special cases:
(squares, circles, equilateral triangles) in ”switch epoch”
• Training: total 256 epochs or early stopping
Conclusion
The best generalization is on the 50:50 mixing
Curriculum Learning (Benjio+, ICML 2009) 11
16. Experiments on language modeling
Environment
• Task: prediction of next word from 5 words
• Curriculum: vocabulary increased by 5000 words at each
• Validation: 10,000 windows of text from 20,000 words
Curriculum Learning (Benjio+, ICML 2009) 12
17. Self-Paced Learning
Contribution
Curriculum is defined over the measure of the easiness of samples
Model
At the age of λ, the curriculum is determined by v
min
w,v∈[0,1]n
n∑
i=1
viL(f(xi; w), yi) + r(w) − λ∥v∥1
vi = 1 if L(f(xi; w), yi) ≤ λ else 0
λ is updated to µλ with learning pace µ
Self-Paced Learning for Latent Variable Models (M.P.Kumar+, NIPS 2010) 13
19. Mollifying Networks (Caglar Gulcehre+, 2016-Aug)
paper info
• NIPS workshop (Non-concex opt.) best paper in 2016
• ICLR 2017 accepted (score: [6,6,7] review link)
abstract
• proposed smoothed (mollified) objective function to optimize
highly non-convex neural networks
• the complexity is controlled by a single hyper-parameter
• showed the relationship between recent works on continuation
methods and mollifiers
14
20. Overview
Key Idea
• Non-linearlity around activation
of tanh and sigmoid makes NN
difficult to optimize
• injecting noise to the activation
function during the training
• annealing the noise
Interpretation
it connect the ideas of curriculum learning and continuation
methods with skip connections and with layers that compute
near-identity transformation
15
21. Framework
1. We start the training by opmizing a convex objective function. A
high level of noise controlled with a single scalar p ∈ [0, 1] per
layer
2. As the noise level p is annealed, we move from identity
transformations to arbitrary linear transformations between
layers
3. the decreasing level of noise p allows element-wise activation
functions to become non-linear
16
22. 1. Annealing Schedule for p
pl
t = 1 − e−
kvtl
tL
with hyper-parameter k, where L is the number of layers of the
model, and vt is a moving average of the loss of the network.
Anneal speed for each layer
The noise in the lower layers will anneal faster, defined by the linealy
decaying probability of layers
Anneal speed for training-loss
when the training-loss is high, the noise injected in the system is
large, and vice a versa.
lim
vt→∞
pl
t = 1, and lim
vt→0
pl
t = 0
17
24. 3. Linearizing the network
Noise control
• Adding noise to the activation
function may cause the random
exploration
• The noise injection is bounded
by linear approximation
ψ(xi, ξi; wi) = sgn(u∗
(xi))min(|u∗
(xi)|, |f∗
(xi) + sgn(u∗
(xi))|si||) + u(0)
si ∼ N(0, pcσ(xi))
The noise is sampled from a Normal distribution, controlled by
hyper-parameter c and annealing probability p.
19
25. Experiments
Deep Parity
• 40-dimensional parity
problem
• 6-layer MLP using sigmoid
• SGD with momentum
PTB language modeling
• Wikipedia
• a 2-layer stacked LSTM
without any regularization
20
26. Rationale
Why do the authors think it works?
Algorithm satisfies the generalized and noise mollifiers
Generalized mollifier
A generalized mollifier is an operator, where Tσ(f) defines a
mapping between two functions, such that Tσ : f → f∗
:
lim
σ→0
Tσf = f
f0
= lim
σ→∞
Tσf is an identity function
∂Tσf(x)
∂x
exists ∀x, σ > 0
Noise mollifier
A stochastic function ϕ(x, ξσ) is a noisy mollifier, if satisfied
(Tσf)(x) = E[ϕ(x, ξσ)]
21
27. Mollifiers
Mollifiers
1. A mollifier K is an infinitely differentiable function to be
convoluted with the loss function L
LK(θ) =
∫ ∞
−∞
(L(θ − τ)K(τ))(dτ) = (L ∗ K)(θ)
2. K should converge to the Dirac function when appropriately
rescaled
L(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)K(
τ
ϵ
)dτ
22
28. Weak Gradients (Distributional Gradients)
We’d like to approximate the gradient of the mollified network
∇(L ∗ K)(θ) = (L ∗ ∇K)(θ)
Weak gradients
For an integrable function L in space L ∈ L([a, b]), g ∈ L([a, b]n
is an
n-dimensional weak gradient of L if it satisfies:
∫
C
g(τ)K(τ)dτ = −
∫
C
L(τ)∇K(τ)dτ
where C ∈ [a, b]n
and τ ∈ Rn
23
29. Mollified gradient
The mollified gradients will satisfy
g(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
g(θ − τ)K(
τ
ϵ
)dτ
= − lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)∇K(
τ
ϵ
)dτ
For a differentiable almost everywhere function L, the weak gradient
g(θ) is equal to ∇θL almost everywhere
gK(θ) = −∇θLK(θ)
∫
g(θ − τ)K(τ)dτ = −
∫
∇θL(θ − τ)K(τ)dτ
24
31. Summary
• DNNs are difficult to optimize because of their non-convex
properties
• Continuation methods and Curriculum learning are kinds of
approaches for the non-convex optimization
• The implementations on the DNNs bridge the gap between
classic methods (stimulated annealing) and recent methods
(skip connections)
25
32. Reference
Youtube, NIPS 2015 Workshop (Mobahi) 15602 Non-convex
Optimization for Machine Learning: Theory and Practice
Curriculum Learning survey (Japanese)
Self-Paced Learning survey
Mollifying Networks, ICLR 2017 presentation
26