SlideShare a Scribd company logo
1 of 32
Download to read offline
Reading Papers Seminar
Survey on Non-Convex Optimization
Hitoshi Nakanishi
January 20, 2019
UTokyo
Table of contents
1. Non-Convex Optimization for machine learning
2. Continuation methods
3. Curriculum Learning
4. Mollifying networks
5. Conclusion
1
Non-Convex Optimization for
machine learning
Motivation
DNNs: highly non-convex optimization of loss function
• Plateaus, Saddle Points and Other Flat Regions
• Cliffs and Exploding Gradients
2
Previous Studies
proposed methods to make optimization easier:
• continuation methods:
• blurring / noise injection
• RNNs with diffusion
• curriculum learning
• pre-training (omitted on this talk)
• active learning (omitted on this talk)
• transfer learning (omitted on this talk)
3
Continuation methods
Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difficult
Recherches de Yoshua Bengio (Web site) 4
Continuation Methods
Continuation methods (Allgower+Georg, 1980)
constructs a series of loss functions {J(0), ..., J(n)}, whose costs are
designed to be increasingly difficult
How designed loss function series?
• Some non-convex functions become approximately convex
when blurred
J(i)
(θ) = Eθ′∼N(θ′;θ,σ(i)2)J(θ′
)
Limit of Application
• applicable, however NP-hard problems remain NP-hard
• not applicable, because not become convex by blurring
• not applicable, because minimum point become different
5
Where we use continuation methods on non-convex problems
Classic examples include data clustering (Gold+ 1994), graph
matching (Gold+ 1996; Zaslavskiy+ 2009; Liu+ 2012), semi-supervised
kernel machines (Sindhwani+ 2006), multiple instance learning
(Gehler+ 2007; Kim+ 2010), semi-supervised structured output
(Dhillon+ 2012), language modeling (Bengio+ 2009), robot navigation
(Pretto+ 2010), shape matching (Tirthapura+ 1998), l0 norm
minimization (Trzasko+ 2009), image deblurring (Boccuto+ 2002),
image denoising (Rangarajan+ 1990; Nikolova+ 2010), template
matching (Dufour+ 2002), pixel correspondence (Leordeanu+ 2008),
active contours (Cohen+ 1995), Hough transform (Leich+ 2004), and
image matting (Price+ 2010), finding optimal parameters in computer
programs (Chaudhuri+ 2011) and seeking the optimal proofs
(Chaudhuri+ 2014)...
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 6
Theoretical background and Proofs (2015)
Abstract
• Optimization complexity α is derived from the objective function.
• α is computable when the objective function is expressed in
suitable basis functions. e.g. Gaussian RBF
Brief Statement
Let f(x) be a nonconvex function to be minimized, ˆx be the solution
discovered by the continuation method, and f†
be the minimum of
the simplified objective function. Then,
f(ˆx) ≤ w1f†
+ w2
√
α
where w1 > 0 and w2 > 0 are independent of f and α
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 7
Formation
Definition
Objective function f : X → R (X = Rd
)
Embedding of f into a family of functions g : X × T (homotopy)
x(t): Minimum under the condition of strict convexity in g(·, t)
Assumption on The curve x(t)
• limt→∞ x(t) = x∞
• x(t) is continuous in t
• ∀t ≥ 0; ∇g(x(t), t) = 0
Continuation methods
x∞ is approximated by arg minx g(x, t)
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 8
Formation
Gaussian Homotopy
The Gaussian homotopy g : X × T is defined the convolution of f and
the isotropic Gaussian kernel k as
g(x; σ) := [f ⋆ kσ](x)
The Gaussian convolution obeys the heat equation (Widder 1975)
˙g(x; σ) = σ∆g(x; σ)
Proof Procedure
• g(x(t); t) ≤ F(g(x(t1), t1), a(t), b(t)), where a(t)g + b(t) ≤ ˆg
• a(t), b(t) can be bounded by g
• the boundary is parameterized by α
A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 9
Curriculum Learning
Curriculum Learning
Curriculum Learning (Benjio+, ICML 2009)
Learning process begins by simple concepts and progresses to more
complex concepts.
How designed Curriculum and defined easy examples?
Easy tasks and curriculum are often defined on the task specific.
• distance from classification boundary (Basu+Christensen, AAAI
2013)
• Short sentence (Spitkovsky+, NIPS 2009)
• Exemplar-SVM (Lapedriza+,arXiv 2013)
stochastic curriculum (Zaremba+Sutskever,2014)
a random mix of easy and difficult examples is presented, and the
average proportion of the more difficult examples is gradually
increased
10
Experiments on shape recognition
Environment
• Task: classification into 3 classes (rectangle, ellipse, triangle)
• Curriculum: using BasicShapes, including special cases:
(squares, circles, equilateral triangles) in ”switch epoch”
• Training: total 256 epochs or early stopping
Conclusion
The best generalization is on the 50:50 mixing
Curriculum Learning (Benjio+, ICML 2009) 11
Experiments on language modeling
Environment
• Task: prediction of next word from 5 words
• Curriculum: vocabulary increased by 5000 words at each
• Validation: 10,000 windows of text from 20,000 words
Curriculum Learning (Benjio+, ICML 2009) 12
Self-Paced Learning
Contribution
Curriculum is defined over the measure of the easiness of samples
Model
At the age of λ, the curriculum is determined by v
min
w,v∈[0,1]n
n∑
i=1
viL(f(xi; w), yi) + r(w) − λ∥v∥1
vi = 1 if L(f(xi; w), yi) ≤ λ else 0
λ is updated to µλ with learning pace µ
Self-Paced Learning for Latent Variable Models (M.P.Kumar+, NIPS 2010) 13
Mollifying networks
Mollifying Networks (Caglar Gulcehre+, 2016-Aug)
paper info
• NIPS workshop (Non-concex opt.) best paper in 2016
• ICLR 2017 accepted (score: [6,6,7] review link)
abstract
• proposed smoothed (mollified) objective function to optimize
highly non-convex neural networks
• the complexity is controlled by a single hyper-parameter
• showed the relationship between recent works on continuation
methods and mollifiers
14
Overview
Key Idea
• Non-linearlity around activation
of tanh and sigmoid makes NN
difficult to optimize
• injecting noise to the activation
function during the training
• annealing the noise
Interpretation
it connect the ideas of curriculum learning and continuation
methods with skip connections and with layers that compute
near-identity transformation
15
Framework
1. We start the training by opmizing a convex objective function. A
high level of noise controlled with a single scalar p ∈ [0, 1] per
layer
2. As the noise level p is annealed, we move from identity
transformations to arbitrary linear transformations between
layers
3. the decreasing level of noise p allows element-wise activation
functions to become non-linear
16
1. Annealing Schedule for p
pl
t = 1 − e−
kvtl
tL
with hyper-parameter k, where L is the number of layers of the
model, and vt is a moving average of the loss of the network.
Anneal speed for each layer
The noise in the lower layers will anneal faster, defined by the linealy
decaying probability of layers
Anneal speed for training-loss
when the training-loss is high, the noise injected in the system is
large, and vice a versa.
lim
vt→∞
pl
t = 1, and lim
vt→0
pl
t = 0
17
2. Simplifying the Objective Function for Feedforward Networks
˜hl
= ψ(hl−1
, ξ; Wl
)
hl
= πl
⊙ hl−1
+ (1 − πl
) ⊙ ˜hl
πl
∼ Bin(pl
)
18
3. Linearizing the network
Noise control
• Adding noise to the activation
function may cause the random
exploration
• The noise injection is bounded
by linear approximation
ψ(xi, ξi; wi) = sgn(u∗
(xi))min(|u∗
(xi)|, |f∗
(xi) + sgn(u∗
(xi))|si||) + u(0)
si ∼ N(0, pcσ(xi))
The noise is sampled from a Normal distribution, controlled by
hyper-parameter c and annealing probability p.
19
Experiments
Deep Parity
• 40-dimensional parity
problem
• 6-layer MLP using sigmoid
• SGD with momentum
PTB language modeling
• Wikipedia
• a 2-layer stacked LSTM
without any regularization
20
Rationale
Why do the authors think it works?
Algorithm satisfies the generalized and noise mollifiers
Generalized mollifier
A generalized mollifier is an operator, where Tσ(f) defines a
mapping between two functions, such that Tσ : f → f∗
:
lim
σ→0
Tσf = f
f0
= lim
σ→∞
Tσf is an identity function
∂Tσf(x)
∂x
exists ∀x, σ > 0
Noise mollifier
A stochastic function ϕ(x, ξσ) is a noisy mollifier, if satisfied
(Tσf)(x) = E[ϕ(x, ξσ)]
21
Mollifiers
Mollifiers
1. A mollifier K is an infinitely differentiable function to be
convoluted with the loss function L
LK(θ) =
∫ ∞
−∞
(L(θ − τ)K(τ))(dτ) = (L ∗ K)(θ)
2. K should converge to the Dirac function when appropriately
rescaled
L(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)K(
τ
ϵ
)dτ
22
Weak Gradients (Distributional Gradients)
We’d like to approximate the gradient of the mollified network
∇(L ∗ K)(θ) = (L ∗ ∇K)(θ)
Weak gradients
For an integrable function L in space L ∈ L([a, b]), g ∈ L([a, b]n
is an
n-dimensional weak gradient of L if it satisfies:
∫
C
g(τ)K(τ)dτ = −
∫
C
L(τ)∇K(τ)dτ
where C ∈ [a, b]n
and τ ∈ Rn
23
Mollified gradient
The mollified gradients will satisfy
g(θ) = lim
ϵ→0
∫ ∞
−∞
ϵ−1
g(θ − τ)K(
τ
ϵ
)dτ
= − lim
ϵ→0
∫ ∞
−∞
ϵ−1
L(θ − τ)∇K(
τ
ϵ
)dτ
For a differentiable almost everywhere function L, the weak gradient
g(θ) is equal to ∇θL almost everywhere
gK(θ) = −∇θLK(θ)
∫
g(θ − τ)K(τ)dτ = −
∫
∇θL(θ − τ)K(τ)dτ
24
Conclusion
Summary
• DNNs are difficult to optimize because of their non-convex
properties
• Continuation methods and Curriculum learning are kinds of
approaches for the non-convex optimization
• The implementations on the DNNs bridge the gap between
classic methods (stimulated annealing) and recent methods
(skip connections)
25
Reference
Youtube, NIPS 2015 Workshop (Mobahi) 15602 Non-convex
Optimization for Machine Learning: Theory and Practice
Curriculum Learning survey (Japanese)
Self-Paced Learning survey
Mollifying Networks, ICLR 2017 presentation
26

More Related Content

What's hot

Lesson 27: The Fundamental Theorem of Calculus
Lesson 27: The Fundamental Theorem of CalculusLesson 27: The Fundamental Theorem of Calculus
Lesson 27: The Fundamental Theorem of CalculusMatthew Leingang
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Chapter 2 Image Processing: Pixel Relation
Chapter 2 Image Processing: Pixel RelationChapter 2 Image Processing: Pixel Relation
Chapter 2 Image Processing: Pixel RelationVarun Ojha
 
Overlap save method and overlap add method in dsp
Overlap save method and overlap add method in dspOverlap save method and overlap add method in dsp
Overlap save method and overlap add method in dspchitra raju
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisDaniel Bruggisser
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsDaniel Bruggisser
 
Higher dimensional image analysis using brunn minkowski theorem, convexity an...
Higher dimensional image analysis using brunn minkowski theorem, convexity an...Higher dimensional image analysis using brunn minkowski theorem, convexity an...
Higher dimensional image analysis using brunn minkowski theorem, convexity an...Alexander Decker
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)Matthew Leingang
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Overlap Add, Overlap Save(digital signal processing)
Overlap Add, Overlap Save(digital signal processing)Overlap Add, Overlap Save(digital signal processing)
Overlap Add, Overlap Save(digital signal processing)Gourab Ghosh
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTJiahao Chen
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 

What's hot (19)

Lesson 27: The Fundamental Theorem of Calculus
Lesson 27: The Fundamental Theorem of CalculusLesson 27: The Fundamental Theorem of Calculus
Lesson 27: The Fundamental Theorem of Calculus
 
Gtti 10032021
Gtti 10032021Gtti 10032021
Gtti 10032021
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Chapter 2 Image Processing: Pixel Relation
Chapter 2 Image Processing: Pixel RelationChapter 2 Image Processing: Pixel Relation
Chapter 2 Image Processing: Pixel Relation
 
Overlap save method and overlap add method in dsp
Overlap save method and overlap add method in dspOverlap save method and overlap add method in dsp
Overlap save method and overlap add method in dsp
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
 
Higher dimensional image analysis using brunn minkowski theorem, convexity an...
Higher dimensional image analysis using brunn minkowski theorem, convexity an...Higher dimensional image analysis using brunn minkowski theorem, convexity an...
Higher dimensional image analysis using brunn minkowski theorem, convexity an...
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 021 slides)
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Overlap Add, Overlap Save(digital signal processing)
Overlap Add, Overlap Save(digital signal processing)Overlap Add, Overlap Save(digital signal processing)
Overlap Add, Overlap Save(digital signal processing)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 

Similar to Reading papers - survey on Non-Convex Optimization

Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdfAnaNeacsu5
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks品媛 陳
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Aalto University
 
Lagrange Interpolation
Lagrange InterpolationLagrange Interpolation
Lagrange InterpolationSaloni Singhal
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Effective Field Theories in the Quest for BSM Physics
Effective Field Theories in the Quest for BSM PhysicsEffective Field Theories in the Quest for BSM Physics
Effective Field Theories in the Quest for BSM PhysicsRaquel Gomez Ambrosio
 
An Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization ProblemsAn Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization ProblemsLisa Muthukumar
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Introduction to TreeNet (2004)
Introduction to TreeNet (2004)Introduction to TreeNet (2004)
Introduction to TreeNet (2004)Salford Systems
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.pptLadallaRajKumar
 
Efficient Hill Climber for Multi-Objective Pseudo-Boolean Optimization
Efficient Hill Climber for Multi-Objective Pseudo-Boolean OptimizationEfficient Hill Climber for Multi-Objective Pseudo-Boolean Optimization
Efficient Hill Climber for Multi-Objective Pseudo-Boolean Optimizationjfrchicanog
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbfkylin
 
Lab manual uoh_ee370
Lab manual uoh_ee370Lab manual uoh_ee370
Lab manual uoh_ee370slatano
 

Similar to Reading papers - survey on Non-Convex Optimization (20)

Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdf
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Lagrange Interpolation
Lagrange InterpolationLagrange Interpolation
Lagrange Interpolation
 
CN.ppt
CN.pptCN.ppt
CN.ppt
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Effective Field Theories in the Quest for BSM Physics
Effective Field Theories in the Quest for BSM PhysicsEffective Field Theories in the Quest for BSM Physics
Effective Field Theories in the Quest for BSM Physics
 
An Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization ProblemsAn Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization Problems
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Integration
IntegrationIntegration
Integration
 
Introduction to TreeNet (2004)
Introduction to TreeNet (2004)Introduction to TreeNet (2004)
Introduction to TreeNet (2004)
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.ppt
 
Efficient Hill Climber for Multi-Objective Pseudo-Boolean Optimization
Efficient Hill Climber for Multi-Objective Pseudo-Boolean OptimizationEfficient Hill Climber for Multi-Objective Pseudo-Boolean Optimization
Efficient Hill Climber for Multi-Objective Pseudo-Boolean Optimization
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
 
Lab manual uoh_ee370
Lab manual uoh_ee370Lab manual uoh_ee370
Lab manual uoh_ee370
 

More from X 37

20200510 37
20200510 3720200510 37
20200510 37X 37
 
20200322 inpainting
20200322 inpainting20200322 inpainting
20200322 inpaintingX 37
 
20200209 research note of "superhuman AI for multiplayer poker"
20200209 research note of "superhuman AI for multiplayer poker"20200209 research note of "superhuman AI for multiplayer poker"
20200209 research note of "superhuman AI for multiplayer poker"X 37
 
20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEsX 37
 
20191027 bread house seminar
20191027 bread house seminar20191027 bread house seminar
20191027 bread house seminarX 37
 
20190818 Bread Seminar
20190818 Bread Seminar20190818 Bread Seminar
20190818 Bread SeminarX 37
 
paper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintypaper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintyX 37
 
Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -X 37
 
20180520 MLPHS
20180520 MLPHS20180520 MLPHS
20180520 MLPHSX 37
 

More from X 37 (9)

20200510 37
20200510 3720200510 37
20200510 37
 
20200322 inpainting
20200322 inpainting20200322 inpainting
20200322 inpainting
 
20200209 research note of "superhuman AI for multiplayer poker"
20200209 research note of "superhuman AI for multiplayer poker"20200209 research note of "superhuman AI for multiplayer poker"
20200209 research note of "superhuman AI for multiplayer poker"
 
20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs20191215 rate distortion theory and VAEs
20191215 rate distortion theory and VAEs
 
20191027 bread house seminar
20191027 bread house seminar20191027 bread house seminar
20191027 bread house seminar
 
20190818 Bread Seminar
20190818 Bread Seminar20190818 Bread Seminar
20190818 Bread Seminar
 
paper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertaintypaper repo - pre training for model robustness and uncertainty
paper repo - pre training for model robustness and uncertainty
 
Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -Anomaly detection and change detection - sparse structure analysis -
Anomaly detection and change detection - sparse structure analysis -
 
20180520 MLPHS
20180520 MLPHS20180520 MLPHS
20180520 MLPHS
 

Recently uploaded

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 

Recently uploaded (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 

Reading papers - survey on Non-Convex Optimization

  • 1. Reading Papers Seminar Survey on Non-Convex Optimization Hitoshi Nakanishi January 20, 2019 UTokyo
  • 2. Table of contents 1. Non-Convex Optimization for machine learning 2. Continuation methods 3. Curriculum Learning 4. Mollifying networks 5. Conclusion 1
  • 4. Motivation DNNs: highly non-convex optimization of loss function • Plateaus, Saddle Points and Other Flat Regions • Cliffs and Exploding Gradients 2
  • 5. Previous Studies proposed methods to make optimization easier: • continuation methods: • blurring / noise injection • RNNs with diffusion • curriculum learning • pre-training (omitted on this talk) • active learning (omitted on this talk) • transfer learning (omitted on this talk) 3
  • 7. Continuation Methods Continuation methods (Allgower+Georg, 1980) constructs a series of loss functions {J(0), ..., J(n)}, whose costs are designed to be increasingly difficult Recherches de Yoshua Bengio (Web site) 4
  • 8. Continuation Methods Continuation methods (Allgower+Georg, 1980) constructs a series of loss functions {J(0), ..., J(n)}, whose costs are designed to be increasingly difficult How designed loss function series? • Some non-convex functions become approximately convex when blurred J(i) (θ) = Eθ′∼N(θ′;θ,σ(i)2)J(θ′ ) Limit of Application • applicable, however NP-hard problems remain NP-hard • not applicable, because not become convex by blurring • not applicable, because minimum point become different 5
  • 9. Where we use continuation methods on non-convex problems Classic examples include data clustering (Gold+ 1994), graph matching (Gold+ 1996; Zaslavskiy+ 2009; Liu+ 2012), semi-supervised kernel machines (Sindhwani+ 2006), multiple instance learning (Gehler+ 2007; Kim+ 2010), semi-supervised structured output (Dhillon+ 2012), language modeling (Bengio+ 2009), robot navigation (Pretto+ 2010), shape matching (Tirthapura+ 1998), l0 norm minimization (Trzasko+ 2009), image deblurring (Boccuto+ 2002), image denoising (Rangarajan+ 1990; Nikolova+ 2010), template matching (Dufour+ 2002), pixel correspondence (Leordeanu+ 2008), active contours (Cohen+ 1995), Hough transform (Leich+ 2004), and image matting (Price+ 2010), finding optimal parameters in computer programs (Chaudhuri+ 2011) and seeking the optimal proofs (Chaudhuri+ 2014)... A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 6
  • 10. Theoretical background and Proofs (2015) Abstract • Optimization complexity α is derived from the objective function. • α is computable when the objective function is expressed in suitable basis functions. e.g. Gaussian RBF Brief Statement Let f(x) be a nonconvex function to be minimized, ˆx be the solution discovered by the continuation method, and f† be the minimum of the simplified objective function. Then, f(ˆx) ≤ w1f† + w2 √ α where w1 > 0 and w2 > 0 are independent of f and α A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 7
  • 11. Formation Definition Objective function f : X → R (X = Rd ) Embedding of f into a family of functions g : X × T (homotopy) x(t): Minimum under the condition of strict convexity in g(·, t) Assumption on The curve x(t) • limt→∞ x(t) = x∞ • x(t) is continuous in t • ∀t ≥ 0; ∇g(x(t), t) = 0 Continuation methods x∞ is approximated by arg minx g(x, t) A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 8
  • 12. Formation Gaussian Homotopy The Gaussian homotopy g : X × T is defined the convolution of f and the isotropic Gaussian kernel k as g(x; σ) := [f ⋆ kσ](x) The Gaussian convolution obeys the heat equation (Widder 1975) ˙g(x; σ) = σ∆g(x; σ) Proof Procedure • g(x(t); t) ≤ F(g(x(t1), t1), a(t), b(t)), where a(t)g + b(t) ≤ ˆg • a(t), b(t) can be bounded by g • the boundary is parameterized by α A Theoretical Analysis of Optimization by Gaussian Continuation (H. Mobahi+J. W. Fisher III, AAAI 2015) 9
  • 14. Curriculum Learning Curriculum Learning (Benjio+, ICML 2009) Learning process begins by simple concepts and progresses to more complex concepts. How designed Curriculum and defined easy examples? Easy tasks and curriculum are often defined on the task specific. • distance from classification boundary (Basu+Christensen, AAAI 2013) • Short sentence (Spitkovsky+, NIPS 2009) • Exemplar-SVM (Lapedriza+,arXiv 2013) stochastic curriculum (Zaremba+Sutskever,2014) a random mix of easy and difficult examples is presented, and the average proportion of the more difficult examples is gradually increased 10
  • 15. Experiments on shape recognition Environment • Task: classification into 3 classes (rectangle, ellipse, triangle) • Curriculum: using BasicShapes, including special cases: (squares, circles, equilateral triangles) in ”switch epoch” • Training: total 256 epochs or early stopping Conclusion The best generalization is on the 50:50 mixing Curriculum Learning (Benjio+, ICML 2009) 11
  • 16. Experiments on language modeling Environment • Task: prediction of next word from 5 words • Curriculum: vocabulary increased by 5000 words at each • Validation: 10,000 windows of text from 20,000 words Curriculum Learning (Benjio+, ICML 2009) 12
  • 17. Self-Paced Learning Contribution Curriculum is defined over the measure of the easiness of samples Model At the age of λ, the curriculum is determined by v min w,v∈[0,1]n n∑ i=1 viL(f(xi; w), yi) + r(w) − λ∥v∥1 vi = 1 if L(f(xi; w), yi) ≤ λ else 0 λ is updated to µλ with learning pace µ Self-Paced Learning for Latent Variable Models (M.P.Kumar+, NIPS 2010) 13
  • 19. Mollifying Networks (Caglar Gulcehre+, 2016-Aug) paper info • NIPS workshop (Non-concex opt.) best paper in 2016 • ICLR 2017 accepted (score: [6,6,7] review link) abstract • proposed smoothed (mollified) objective function to optimize highly non-convex neural networks • the complexity is controlled by a single hyper-parameter • showed the relationship between recent works on continuation methods and mollifiers 14
  • 20. Overview Key Idea • Non-linearlity around activation of tanh and sigmoid makes NN difficult to optimize • injecting noise to the activation function during the training • annealing the noise Interpretation it connect the ideas of curriculum learning and continuation methods with skip connections and with layers that compute near-identity transformation 15
  • 21. Framework 1. We start the training by opmizing a convex objective function. A high level of noise controlled with a single scalar p ∈ [0, 1] per layer 2. As the noise level p is annealed, we move from identity transformations to arbitrary linear transformations between layers 3. the decreasing level of noise p allows element-wise activation functions to become non-linear 16
  • 22. 1. Annealing Schedule for p pl t = 1 − e− kvtl tL with hyper-parameter k, where L is the number of layers of the model, and vt is a moving average of the loss of the network. Anneal speed for each layer The noise in the lower layers will anneal faster, defined by the linealy decaying probability of layers Anneal speed for training-loss when the training-loss is high, the noise injected in the system is large, and vice a versa. lim vt→∞ pl t = 1, and lim vt→0 pl t = 0 17
  • 23. 2. Simplifying the Objective Function for Feedforward Networks ˜hl = ψ(hl−1 , ξ; Wl ) hl = πl ⊙ hl−1 + (1 − πl ) ⊙ ˜hl πl ∼ Bin(pl ) 18
  • 24. 3. Linearizing the network Noise control • Adding noise to the activation function may cause the random exploration • The noise injection is bounded by linear approximation ψ(xi, ξi; wi) = sgn(u∗ (xi))min(|u∗ (xi)|, |f∗ (xi) + sgn(u∗ (xi))|si||) + u(0) si ∼ N(0, pcσ(xi)) The noise is sampled from a Normal distribution, controlled by hyper-parameter c and annealing probability p. 19
  • 25. Experiments Deep Parity • 40-dimensional parity problem • 6-layer MLP using sigmoid • SGD with momentum PTB language modeling • Wikipedia • a 2-layer stacked LSTM without any regularization 20
  • 26. Rationale Why do the authors think it works? Algorithm satisfies the generalized and noise mollifiers Generalized mollifier A generalized mollifier is an operator, where Tσ(f) defines a mapping between two functions, such that Tσ : f → f∗ : lim σ→0 Tσf = f f0 = lim σ→∞ Tσf is an identity function ∂Tσf(x) ∂x exists ∀x, σ > 0 Noise mollifier A stochastic function ϕ(x, ξσ) is a noisy mollifier, if satisfied (Tσf)(x) = E[ϕ(x, ξσ)] 21
  • 27. Mollifiers Mollifiers 1. A mollifier K is an infinitely differentiable function to be convoluted with the loss function L LK(θ) = ∫ ∞ −∞ (L(θ − τ)K(τ))(dτ) = (L ∗ K)(θ) 2. K should converge to the Dirac function when appropriately rescaled L(θ) = lim ϵ→0 ∫ ∞ −∞ ϵ−1 L(θ − τ)K( τ ϵ )dτ 22
  • 28. Weak Gradients (Distributional Gradients) We’d like to approximate the gradient of the mollified network ∇(L ∗ K)(θ) = (L ∗ ∇K)(θ) Weak gradients For an integrable function L in space L ∈ L([a, b]), g ∈ L([a, b]n is an n-dimensional weak gradient of L if it satisfies: ∫ C g(τ)K(τ)dτ = − ∫ C L(τ)∇K(τ)dτ where C ∈ [a, b]n and τ ∈ Rn 23
  • 29. Mollified gradient The mollified gradients will satisfy g(θ) = lim ϵ→0 ∫ ∞ −∞ ϵ−1 g(θ − τ)K( τ ϵ )dτ = − lim ϵ→0 ∫ ∞ −∞ ϵ−1 L(θ − τ)∇K( τ ϵ )dτ For a differentiable almost everywhere function L, the weak gradient g(θ) is equal to ∇θL almost everywhere gK(θ) = −∇θLK(θ) ∫ g(θ − τ)K(τ)dτ = − ∫ ∇θL(θ − τ)K(τ)dτ 24
  • 31. Summary • DNNs are difficult to optimize because of their non-convex properties • Continuation methods and Curriculum learning are kinds of approaches for the non-convex optimization • The implementations on the DNNs bridge the gap between classic methods (stimulated annealing) and recent methods (skip connections) 25
  • 32. Reference Youtube, NIPS 2015 Workshop (Mobahi) 15602 Non-convex Optimization for Machine Learning: Theory and Practice Curriculum Learning survey (Japanese) Self-Paced Learning survey Mollifying Networks, ICLR 2017 presentation 26