New Insights and Perspectives on the Natural Gradient Method

1 / 18
New Insights and Perspectives on the Natural
Gradient Method
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
March 13, 2018

2 / 18
Motivation
In parameter space (µ, σ), each pair of distributions have the same
distance.

3 / 18
Motivation
We often talk about the parameter space of a neural network,
instead of the functions that those parameters represent.

4 / 18
Gradient Descent
The gradient descent update
∆θ = −α θL
is the solution to the following optimization problem:
arg min
∆θ
[ θL(θ) · ∆θ]
s.t. ∆θ ≤ δ.
We are optimizing a linear approximation of L within a trust region
deﬁned by the Euclidean metric in parameter space.

5 / 18
Natural Gradient
Consider a family of density functions F : θ → p(z) where θ ∈ Rn:
pθ(z) := F(θ)(z).
We naturally obtain a notion of closeness among diﬀerent values of
θ:
d(θ1, θ2) := KL(pθ1 (z)||pθ2 (z)).

6 / 18
Natural Gradient
d(θ1, θ2) := KL(pθ1 (z)||pθ2 (z)).
Let’s see how d behaves near a given θ.
KL(pθ(z)||pθ+∆θ(z)) = Ez [log pθ] − Ez [log pθ]
−Ez [ log pθ] ∆θ + O(∆θ2
)
= O(∆θ2
),
so no useful information from ﬁrst order approximation.

7 / 18
Natural Gradient
d(θ1, θ2) := KL(pθ1 (z)||pθ2 (z)).
Let’s see how d behaves near a given θ.
KL(pθ(z)||pθ+∆θ(z)) = −
1
2
∆θ Ez
2
log pθ ∆θ + O(∆θ3
)
≈
1
2
∆θ Fθ∆θ,
where
Fθ = Ez − 2
log pθ = Ez log pθ log pθ .
The Fisher Information Matrix is the expected negative Hessian of
log p.

8 / 18
Natural Gradient
We change the constraint of
arg min
∆θ
[ θL(θ) · ∆θ]
s.t. ∆θ = const
to
s.t.KL(pθ(z)||pθ+∆θ(z)) = const.

9 / 18
Natural Gradient
We change the constraint of
arg min
∆θ
[ θL(θ) · ∆θ]
s.t. ∆θ = const
to
s.t.KL(pθ(z)||pθ+∆θ(z)) = const.
Solution after using Lagrange multiplier
θL(θ) · ∆θ +
1
2
λ∆θ Fθ∆θ
is
∆θ ∝ F−1
θ θL

10 / 18
Natural Gradient
For η deﬁned as
η − η0 = F
1
2 (θ − θ0),
the Fisher ball is a unit circle. Thus in the parameter space of η,
the natural gradient is the same as the gradient.

11 / 18
Second-order Optimization
∆θ = arg min
δ
M(δ), M(δ) :=
1
2
δ Bδ + h(θ) δ + h(θ)
The solution is ∆θ = −B−1 h.

12 / 18
∆θ = arg min
δ
M(δ), M(δ) :=
1
2
δ Bδ + h(θ) δ + h(θ)
The solution is ∆θ = −B−1 h.
B = βI: Gradient Descent.
B = H(θ): Newton’s Method.
Newton’s method assumes h is convex, and fails otherwise.

13 / 18
The Generalized Gauss-Newton Matrix
Let our loss function be L(f (x, θ)).
Hij =
∂
∂θj
∂L(f (x, θ))
∂θi
=
n
k=0
∂
∂θj
∂L (f (x, θ))
∂fk(x, θ)
∂fk(x, θ)
∂θi
=
n
k=0
n
l=0
∂L2 (f (x, θ))
∂fk(x, θ)∂fl (x, θ)
∂fk(x, θ)
∂θj
∂fk(x, θ)
∂θi
+
n
k=0
∂L (f (x, θ))
∂fk(x, θ)
∂fk(x, θ)2
∂θj ∂θi
∴ H = Jf HLJf +
n
k=0
∂L
∂fk
Hfk
≈ Jf HLJf = G

14 / 18
Why G instead of H
G is positive semideﬁnite, H is not.
Let’s expand H further, assuming f is a feedforward neural
network:
· · ·
Wi
−−→ si
φ
−→ ai
Wi+1
−−−→ si+1 · · ·
H − G =
l
i=1
mi
j=1
( ai L) Jsi
H[φ(si )]j
Jsi
The remaining term is the sum of curvature terms coming
from each intermediate activation. These curvature terms are
subject to more frequent change.
In ReLU networks, H[φ(si )] = 0 a.e..

16 / 18
F and G
We know that F = G when FR = HL. When does this occur?
Let L(y, z) = − log r(y|z).
FR = −ERy|f (x,θ)
[Hlog r ]
HL = −E(x,y) [Hlog r ]
∴ FR = HL holds when ERy|f (x,θ)
[Hlog r ] = E(x,y) [Hlog r ]

17 / 18
F and G
We know that F = G when FR = HL. When does this occur?
Let L(y, z) = − log r(y|z).
FR = −ERy|f (x,θ)
[Hlog r ]
HL = −E(x,y) [Hlog r ]
∴ FR = HL holds when ERy|f (x,θ)
[Hlog r ] = E(x,y) [Hlog r ]
This holds when r(y|z) is an exponential family:
log r(y|z) = z T(y) − log Z(z)
since in this case, the hessian Hlog r does not depend on y.
Most commonly-used losses satisfy this property, including the
MSE loss and cross-entropy loss.

18 / 18
Summary
Roughly speaking, the natural gradient is the direction of
steepest change of loss in function space.
The natural gradient is invariant under reparameterization.
For most neural networks of interest, natural gradient descent
is identical to a second-order method (GGN).

19 / 18
References I
[1] Shun-Ichi Amari. “Natural Gradient Works Eﬃciently in
Learning”. In: Neural Comput. (1998).
[2] James Martens. “Deep learning via Hessian-free
optimization”. In: Proceedings of the International Conference
on Machine Learning (ICML) (2010).
[3] James Martens. New insights and perspectives on the natural
gradient method. Preprint arXiv:1412.1193. 2014.
[4] James Martens and Roger B. Grosse. “Optimizing Neural
Networks with Kronecker-factored Approximate Curvature”.
In: Proceedings of the International Conference on Machine
Learning (ICML) (2015).
[5] Hyeyoung Park, Shun-Ichi Amari, and Kenji Fukumizu.
“Adaptive natural gradient learning algorithms for various
stochastic models”. In: Neural Networks (2000).

20 / 18
References II
[6] Razvan Pascanu and Yoshua Bengio. “Revisiting Natural
Gradient for Deep Networks”. In: Proceedings of the
International Conference on Learning Representations (ICLR)
(2014).
[7] Oriol Vinyals and Daniel Povey. “Krylov Subspace Descent for
Deep Learning”. In: Proceedings of the International
Conference on Artiﬁcial Intelligence and Statistics (AISTATS)
(2012).

New Insights and Perspectives on the Natural Gradient Method

More Related Content

What's hot

Similar to New Insights and Perspectives on the Natural Gradient Method

More from Yoonho Lee

Recently uploaded

New Insights and Perspectives on the Natural Gradient Method