Reading papers - survey on Non-Convex Optimization

Reading Papers Seminar
Survey on Non-Convex Optimization
Hitoshi Nakanishi
January 20, 2019
UTokyo

Table of contents
1. Non-Convex Optimization for machine learning
2. Continuation methods
3. Curriculum Learning
4. Mollifying networks
5. Conclusion
1

Non-Convex Optimization for
machine learning

Motivation
DNNs: highly non-convex optimization of loss function
• Plateaus, Saddle Points and Other Flat Regions
• Cliffs and Exploding Gradients
2

Previous Studies
proposed methods to make optimization easier:
• continuation methods:
• blurring / noise injection
• RNNs with diffusion
• curriculum learning
• pre-training (omitted on this talk)
• active learning (omitted on this talk)
• transfer learning (omitted on this talk)
3

Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difﬁcult
Recherches de Yoshua Bengio (Web site) 4

Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difﬁcult
How designed loss function series?
• Some non-convex functions become approximately convex
when blurred
J(i)
(θ) = Eθ′∼N(θ′;θ,σ(i)2)J(θ′
)
Limit of Application
• applicable, however NP-hard problems remain NP-hard
• not applicable, because not become convex by blurring
• not applicable, because minimum point become different
5

Where we use continuation methods on non-convex problems
Classic examples include data clustering (Gold+ 1994), graph
matching (Gold+ 1996; Zaslavskiy+ 2009; Liu+ 2012), semi-supervised
kernel machines (Sindhwani+ 2006), multiple instance learning
(Gehler+ 2007; Kim+ 2010), semi-supervised structured output
(Dhillon+ 2012), language modeling (Bengio+ 2009), robot navigation
(Pretto+ 2010), shape matching (Tirthapura+ 1998), l0 norm
minimization (Trzasko+ 2009), image deblurring (Boccuto+ 2002),
image denoising (Rangarajan+ 1990; Nikolova+ 2010), template
matching (Dufour+ 2002), pixel correspondence (Leordeanu+ 2008),
active contours (Cohen+ 1995), Hough transform (Leich+ 2004), and
image matting (Price+ 2010), ﬁnding optimal parameters in computer
programs (Chaudhuri+ 2011) and seeking the optimal proofs
(Chaudhuri+ 2014)...
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 6

Theoretical background and Proofs (2015)
Abstract
• Optimization complexity α is derived from the objective function.
• α is computable when the objective function is expressed in
suitable basis functions. e.g. Gaussian RBF
Brief Statement
Let f(x) be a nonconvex function to be minimized, ˆx be the solution
discovered by the continuation method, and f†
be the minimum of
the simpliﬁed objective function. Then,
f(ˆx) ≤ w1f†
+ w2
√
α
where w1 > 0 and w2 > 0 are independent of f and α

Formation
Definition
Objective function f : X → R (X = Rd
)
Embedding of f into a family of functions g : X × T (homotopy)
x(t): Minimum under the condition of strict convexity in g(·, t)
Assumption on The curve x(t)
• limt→∞ x(t) = x∞
• x(t) is continuous in t
• ∀t ≥ 0; ∇g(x(t), t) = 0
Continuation methods
x∞ is approximated by arg minx g(x, t)

Formation
Gaussian Homotopy
The Gaussian homotopy g : X × T is deﬁned the convolution of f and
the isotropic Gaussian kernel k as
g(x; σ) := [f ⋆ kσ](x)
The Gaussian convolution obeys the heat equation (Widder 1975)
˙g(x; σ) = σ∆g(x; σ)
Proof Procedure
• g(x(t); t) ≤ F(g(x(t1), t1), a(t), b(t)), where a(t)g + b(t) ≤ ˆg
• a(t), b(t) can be bounded by g
• the boundary is parameterized by α

Curriculum Learning
Curriculum Learning (Benjio+, ICML 2009)
Learning process begins by simple concepts and progresses to more
complex concepts.
How designed Curriculum and defined easy examples?
Easy tasks and curriculum are often defined on the task specific.
• distance from classification boundary (Basu+Christensen, AAAI
2013)
• Short sentence (Spitkovsky+, NIPS 2009)
• Exemplar-SVM (Lapedriza+,arXiv 2013)
stochastic curriculum (Zaremba+Sutskever,2014)
a random mix of easy and difficult examples is presented, and the
average proportion of the more difficult examples is gradually
increased
10

Experiments on shape recognition
Environment
• Task: classiﬁcation into 3 classes (rectangle, ellipse, triangle)
• Curriculum: using BasicShapes, including special cases:
(squares, circles, equilateral triangles) in ”switch epoch”
• Training: total 256 epochs or early stopping
Conclusion
The best generalization is on the 50:50 mixing
Curriculum Learning (Benjio+, ICML 2009) 11

Experiments on language modeling
Environment
• Task: prediction of next word from 5 words
• Curriculum: vocabulary increased by 5000 words at each
• Validation: 10,000 windows of text from 20,000 words
Curriculum Learning (Benjio+, ICML 2009) 12

Self-Paced Learning
Contribution
Curriculum is deﬁned over the measure of the easiness of samples
Model
At the age of λ, the curriculum is determined by v
min
w,v∈[0,1]n
n∑
i=1
viL(f(xi; w), yi) + r(w) − λ∥v∥1
vi = 1 if L(f(xi; w), yi) ≤ λ else 0
λ is updated to µλ with learning pace µ
Self-Paced Learning for Latent Variable Models (M.P.Kumar+, NIPS 2010) 13

Mollifying Networks (Caglar Gulcehre+, 2016-Aug)
paper info
• NIPS workshop (Non-concex opt.) best paper in 2016
• ICLR 2017 accepted (score: [6,6,7] review link)
abstract
• proposed smoothed (molliﬁed) objective function to optimize
highly non-convex neural networks
• the complexity is controlled by a single hyper-parameter
• showed the relationship between recent works on continuation
methods and molliﬁers
14

Overview
Key Idea
• Non-linearlity around activation
of tanh and sigmoid makes NN
difﬁcult to optimize
• injecting noise to the activation
function during the training
• annealing the noise
Interpretation
it connect the ideas of curriculum learning and continuation
methods with skip connections and with layers that compute
near-identity transformation
15

Framework
1. We start the training by opmizing a convex objective function. A
high level of noise controlled with a single scalar p ∈ [0, 1] per
layer
2. As the noise level p is annealed, we move from identity
transformations to arbitrary linear transformations between
layers
3. the decreasing level of noise p allows element-wise activation
functions to become non-linear
16

1. Annealing Schedule for p
pl
t = 1 − e−
kvtl
tL
with hyper-parameter k, where L is the number of layers of the
model, and vt is a moving average of the loss of the network.
Anneal speed for each layer
The noise in the lower layers will anneal faster, deﬁned by the linealy
decaying probability of layers
Anneal speed for training-loss
when the training-loss is high, the noise injected in the system is
large, and vice a versa.
lim
vt→∞
pl
t = 1, and lim
vt→0
pl
t = 0
17

2. Simplifying the Objective Function for Feedforward Networks
˜hl
= ψ(hl−1
, ξ; Wl
)
hl
= πl
⊙ hl−1
+ (1 − πl
) ⊙ ˜hl
πl
∼ Bin(pl
)
18

3. Linearizing the network
Noise control
• Adding noise to the activation
function may cause the random
exploration
• The noise injection is bounded
by linear approximation
ψ(xi, ξi; wi) = sgn(u∗
(xi))min(|u∗
(xi)|, |f∗
(xi) + sgn(u∗
(xi))|si||) + u(0)
si ∼ N(0, pcσ(xi))
The noise is sampled from a Normal distribution, controlled by
hyper-parameter c and annealing probability p.
19

Experiments
Deep Parity
• 40-dimensional parity
problem
• 6-layer MLP using sigmoid
• SGD with momentum
PTB language modeling
• Wikipedia
• a 2-layer stacked LSTM
without any regularization
20

Rationale
Why do the authors think it works?
Algorithm satisfies the generalized and noise mollifiers
Generalized mollifier
A generalized mollifier is an operator, where Tσ(f) defines a
mapping between two functions, such that Tσ : f → f∗
:
lim
σ→0
Tσf = f
f0
= lim
σ→∞
Tσf is an identity function
∂Tσf(x)
∂x
exists ∀x, σ > 0
Noise mollifier
A stochastic function ϕ(x, ξσ) is a noisy mollifier, if satisfied
(Tσf)(x) = E[ϕ(x, ξσ)]
21

Mollifiers
Mollifiers
1. A molliﬁer K is an inﬁnitely differentiable function to be
convoluted with the loss function L
LK(θ) =
∫ ∞
−∞
(L(θ − τ)K(τ))(dτ) = (L ∗ K)(θ)
2. K should converge to the Dirac function when appropriately
rescaled
L(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)K(
τ
ϵ
)dτ
22

Weak Gradients (Distributional Gradients)
We’d like to approximate the gradient of the molliﬁed network
∇(L ∗ K)(θ) = (L ∗ ∇K)(θ)
Weak gradients
For an integrable function L in space L ∈ L([a, b]), g ∈ L([a, b]n
is an
n-dimensional weak gradient of L if it satisﬁes:
∫
C
g(τ)K(τ)dτ = −
∫
C
L(τ)∇K(τ)dτ
where C ∈ [a, b]n
and τ ∈ Rn
23

Mollified gradient
The molliﬁed gradients will satisfy
g(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
g(θ − τ)K(
τ
ϵ
)dτ
= − lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)∇K(
τ
ϵ
)dτ
For a differentiable almost everywhere function L, the weak gradient
g(θ) is equal to ∇θL almost everywhere
gK(θ) = −∇θLK(θ)
∫
g(θ − τ)K(τ)dτ = −
∫
∇θL(θ − τ)K(τ)dτ
24

Summary
• DNNs are difﬁcult to optimize because of their non-convex
properties
• Continuation methods and Curriculum learning are kinds of
approaches for the non-convex optimization
• The implementations on the DNNs bridge the gap between
classic methods (stimulated annealing) and recent methods
(skip connections)
25

Reference
Youtube, NIPS 2015 Workshop (Mobahi) 15602 Non-convex
Optimization for Machine Learning: Theory and Practice
Curriculum Learning survey (Japanese)
Self-Paced Learning survey
Mollifying Networks, ICLR 2017 presentation
26

Reading papers - survey on Non-Convex Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Reading papers - survey on Non-Convex Optimization

Similar to Reading papers - survey on Non-Convex Optimization (20)

More from X 37

More from X 37 (9)

Recently uploaded

Recently uploaded (20)

Reading papers - survey on Non-Convex Optimization