Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

[course
site]

Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 3 Lecture 2
#DLUPC

Previously
in
DLAI…

•  Mul.layer
perceptron

•  Training:
(stochas.c
/
mini-‐batch)
gradient
descent

•  Backpropaga.on

•  Loss
func.on

but…

What
type
of
op.miza.on
problem?

Do
local
minima
and
saddle
points
cause
problems?

Does
gradient
descent
perform
well?

How
to
set
the
learning
rate?

How
to
ini.alize
weights?

How
does
batch
size
aﬀect
training?

2

Index

•  Op6miza6on
for
a
machine
learning
task;
difference
between
learning
and
pure
op6miza6on

•  Expected
and
empirical
risk

•  Surrogate
loss
func.ons
and
early
stopping

•  Batch
and
mini-‐batch
algorithms

•  Challenges
for
deep
models

•  Local
minima

•  Saddle
points
and
other
flat
regions

•  Cliffs
and
exploding
gradients

•  Prac6cal
algorithms

•  Stochas.c
Gradient
Descent

•  Momentum

•  Nesterov
Momentum

•  Learning
rate

•  Adap.ve
learning
rates:
adaGrad,
RMSProp,
Adam

•  Parameter
ini6aliza6on

•  Batch
Normaliza6on

3

Diﬀerences
between
learning
and
pure

op6miza6on

Op6miza6on
for
NN
training

•  Goal:
Find
the
parameters
that
minimize
the
expected
risk
(generaliza.on
error)

•  x
input,

predicted
output,
y
target
output,
E
expecta.on

•  pdata
true
(unknown)
data
distribu.on,
L

loss
func6on
(how
wrong
predic6ons
are)

•  But
we
only
have
a
training
set
of
samples:
we
minimize
the
empirical
risk,
average

loss
on
a
ﬁnite
dataset
D

J(θ) = Ε(x,y)∼pdata
L( fθ
(x), y)
fθ
(x)
J(θ) = Ε(x,y)∼ ˆpdata
L( fθ
(x), y) =
1
D
L( fθ
(x(i)
), y(i)
)
(x(i)
,y(i)
)∈D
∑
where

is
the
empirical
distribu.on,
|D|
is
the
number
of
examples
in
D

5

ˆpdata

Surrogate
loss

•  O]en
minimizing
the
real
loss
is
intractable
(can’t
be
used
with
gradient
descent)

•  e.g.
0-‐1
loss
(0
if
correctly
classiﬁed,
1
if
it
is
not)

(intractable
even
for
linear
classiﬁers
(Marcobe
1992)

•  Minimize
a
surrogate
loss
instead

•  e.g.
for
the
0-‐1
loss

hinge

square

logis.c

6

0-‐1
loss
(blue)
and
surrogate
losses

(green:
square,
purple:
hinge,
yellow:
logis.c)

L( f (x) , y) = I( f (x)≠y)
L( f (x), y) = max(0,1− yf (x))
L( f (x), y) = (1− yf (x))2
L( f (x), y) = log(1+ e− yf (x)
)

Surrogate
loss
func6ons

7

Probabilistic
classifier
Outputs
probability
of
class
1

f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x)
Binary cross-entropy loss:
L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x))
Decision
func.on: F(x) = If(x)>0.5
Outputs
a
vector
of
probabili.es:

f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(f(x),y) = -log f(x)y
Decision
func.on:
F(x) = argmax(f(x))
Non-
Hinge
loss:
probabilistic
classifier
Outputs a «score» f(x) for class 1.
score for the other class is -f(x)
L(f(x),t) = max(0, 1-t f(x)) where t=2y-1
Decision
func.on:

F(x) = If(x)>0
Outputs
a
vector
f(x) of
real-‐valued

scores
for
the
m
classes.
Mul.class
margin
loss

L(f(x),y) = max(0,1+max(f(x)k)-f(x)y )
k≠y
Decision
func.on:

F(x) = argmax(f(x))
Binary classifier Multiclass classifier

Early
stopping

•  Training
algorithms
usually
do
not
halt
at
a
local
minimum

•  Convergence
criterion
based
on
early
stopping:

•  based
on
surrogate
loss
or
true
underlying
loss
(ex
0-‐1
loss)
measured
on
a
valida6on
set

•  #
training
steps
=
hyperparameter
controlling
the
eﬀec.ve
capacity
of
the
model

•  simple,
eﬀec.ve,
must
keep
a
copy
of
the
best
parameters

•  acts
as
a
regularizer
(Bishop
1995,…)

8

Training
error
decreases
steadily

Valida.on
error
begins
to
increase

Return
parameters
at
point
with

lowest
valida6on
error

Batch
and
mini-‐batch
algorithms

•  Gradient
descent
at
each
itera.on
computes
gradients
over
the
en.re
dataset
for
one
update

•  ↑
Gradients
are
stable

•  ↓
Using
the
complete
training
set
can
be
very
expensive

•  the
gain
of
using
more
samples
is
less
than
linear:

•  standard
error
of
the
mean
es.mated
from
m
samples
is

(σ
is
true
std)

•  ↓
Training
set
may
be
redundant

•  Use
a
subset
of
the
training
set

Loop:

1.  sample
a
subset
of
data

2.  forward
prop
through
the
network

3.  backprop
to
calculate
gradients

4.  update
parameters
using
gradients
9

∇θ
J(θ) =
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
SE =
σ
m
Minibatch
gradient
descent

Batch
and
mini-‐batch
algorithms

•  How
many
samples
in
each
update
step?

•  Determinis.c
or
batch
gradient
methods:
process
all
training
samples
in
a
large
batch

•  Mini-‐batch
stochas.c
methods:
use
several
(not
all)
samples

•  Stochas.c
methods:
use
a
single
example
at
a
.me

•  online
methods:
samples
are
drawn
from
a
stream
of
con.nually
created
samples

10
batch
vs
minibatch
gradient
descent

Batch
and
mini-‐batch
algorithms

Mini-‐batch
size?

•  Larger
batches:
more
accurate
es.mate
of
the
gradient
but
less
than
linear
return

•  Very
small
batches:
Mul.core
architectures
under-‐u.lized

•  Smaller
batches
provide
noisier
gradient
es.mates

•  Small
batches
may
offer
a
regularizing
effect

(add
noise)

•  but
may
require
small
learning
rate

•  may
increase
number
of
steps
for
convergence

•  If
small
training
set,
use
batch
gradient
descent

•  If
large
training
set,
use
mini
batches

•  Minbatches
should
be
selected
randomly
(shuffle
samples)

•  unbiased
es.mate
of
gradients

•  Typical
mini-‐batch
size:
32,
64,
128,
256

•  (2p,
make
sure
mini-‐batch
fits
in
CPU-‐GPU
memory)

11

Challenges
in
deep
NN
op6miza6on

Convex
/
Non-‐convex
op6miza6on

A
func.on

deﬁned
on
an
n-‐dim
interval
is
convex
if
for
any

13

f : X → !
f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x')
x,x' ∈X λ ∈[0,1]
f (λx + (1− λ)x')
λ f (x) + (1− λ) f (x')

Convex
/
Non-‐convex
op6miza6on

•  Convex
op.miza.on

•  any
local
minimum
is
a
global
minimum

•  there
are
several
opt.
algorithms
(polynomial-‐.me)

•  Non-‐convex
op.miza.on

•  objec6ve
func6on
in
deep
networks
is
non-‐convex

•  deep
models
may
have
several
local
minima

•  but
this
is
not
necessarily
a
major
problem!

14

Local
minima
and
saddle
points

•  Cri6cal
points:

•  For
high
dimensional
loss
func.ons,
local
minima
are
rare
compared
to
saddle
points

•  Hessian
matrix:

real,
symmetric

eigenvector/eigenvalue
decomposi.on

•  Intui.on:
eigenvalues
of
the
Hessian
matrix

•  local
minimum/maximum:
all
posi.ve
/
all
nega.ve
eigenvalues:
exponen.ally
unlikely
as
n
grows

•  saddle
points:
both
posi.ve
and
nega.ve
eigenvalues

15
Dauphin
et
al.
Iden.fying
and
abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014

Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0

Local
minima
and
saddle
points

•  It
is
believed
that
for
many
problems

including
learning
deep
nets,
almost
all
local

minimum
have
very
similar
func.on
value
to

the
global
op.mum

•  Finding
a
local
minimum
is
good
enough

16

Value
of
local
minima
found
by
running
SGD
for
200

itera.ons
on
a
simpliﬁed
version
of
MNIST
from
diﬀerent

ini.al
star.ng
points.
As
number
of
parameters
increases,

local
minima
tend
to
cluster
more
.ghtly.

•  For
many
random
func.ons
local
minima
are
more
likely
to
have
low
cost
than
high

cost.

Choromanska
et
al.
The
loss
surfaces
of
mul.layer
networks,
AISTATS
2015

Saddle
points

How
to
escape
from
saddle
points?

•  First
order
methods

•  ini.ally
abracted
to
saddle
points,
but
unless

exact
hit,
it
will
be
repelled
when
close

•  hitng
cri.cal
point
exactly
is
unlikely
(es.mated

gradient
is
noisy)

•  saddle
points
are
very
unstable:
noise
(stochas.c

gradient
descent)
helps
convergence,
trajectory

escapes
quickly

•  Second
order
moments:

•  Netwon’s
method
can
jump
to
saddle
points

(where
gradient
is
0)

17
S.
Credit:
K.McGuinness

SGD
tends
to
oscillate
between
slowly
approaching

a
saddle
point
and
quickly
escaping
from
it

Other
difficul6es

•  Cliffs
and
exploding
gradients

•  Nets
with
many
layers
/
recurrent
nets
can
contain
very
steep
regions
(cliffs):

gradient
descent
can
move
parameters
too
far,
jumping
off
of
the
cliff.
(solu.ons:

gradient
clipping)

•  Long
term
dependencies

•  computa.onal
graph
becomes
very
deep
(deep
nets
/
recurrent
nets):
vanishing

and
exploding
gradients

18

cost
func.on
of
highly

non
linear
deep
nets

or
recurrent
net

(Pascanu2013)

Mini-‐batch
Gradient
Descent

•  Most
used
algorithm
for
deep
learning

Algorithm

•  Require:
ini.al
parameter
θ,
learning
rate
α,

•  while
stopping
criterion
not
met
do

•  sample
a
minibatch
of
m
examples
from
the
training
set

with

corresponding
targets

•  compute
gradient
es.mate

•  apply
update

•  end
while

20

{x(i)
}i=1...m
{y(i)
}i=1...m
g ← +
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
θ ←θ −αg

Problems
with
GD

•  GD
can
be
very
slow.

•  Can
get
stuck
in
local
minima
or
saddle
points

•  If
the
loss
changes
quickly
in
one
direc.on
and
slowly
in
another,
GD
makes
slow

progress
along
shallow
dimension,
jibers
along
steep
direc.on

21

Loss
func.on
has
a
high
condi6on
number
(5):
ra.o
of

largest
to
smallest
singular
value
of
Hessian
matrix
is
large

Momentum

•  Momentum
is
designed
to
accelerate
learning,
especially
for
high
curvature,
small
but

consistent
gradients
or
noisy
gradients

•  New
variable
velocity
v
(direc.on
and
speed
at
which
parameters
move)

•  exponen.ally
decaying
average
of
nega.ve
gradient

Algorithm

•  Require:
ini.al
parameter
θ,
learning
rate
α,

momentum
parameter
λ

,
ini6al
velocity
v
•  Update
rule:
(g
is
gradient
es.mate)

•  compute
velocity
update

•  apply
update

•  Typical
values
v0=0,

λ=0.5,
0.9,0.99

(in
[0,1})

•  Read
physical
analogy
in
Deep
Learning
book
(Goodfellow
et
al):
velocity
=
momentum
of
unit
mass
par.cle

22

θ ←θ + v
v ← λv −αg

Nesterov
accelerated
gradient
(NAG)

•  A
variant
of
momentum,
where
gradient
is
evaluated
a]er
current
velocity
is
applied:

•  Approximate
where
the
parameters
will
be
on
the
next
.me
step
using
current
velocity

•  Update
velocity
using
gradient
where
we
predict
parameters
will
be

Algorithm

•  Require:
ini.al
parameter
θ,
learning
rate
α,
momentum
parameter
λ

,
ini.al
velocity
v
•  Update:

•  apply
interim
update

•  compute
gradient
(at
interim
point)

•  compute
velocity
update

•  apply
update

•  Interpreta.on:
add
a
correc.on
factor
to
momentum

23

g ← +
1
m
∇!θ
L!θ
( f (x(i)
), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
interim

Nesterov
accelerated
gradient
(NAG)

24

current
loca.on
wt
vt
∇L(wt) vt+1
S.
Credit:
K.
McGuinness

predicted
loca.on
based
on
velocity
alone
wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1

GD:
learning
rate

•  Learning
rate
is
a
crucial
parameter
for
GD

•  Too
large:
overshoots
local
minimum,
loss
increases

•  Too
small:
makes
very
slow
progress,
can
get
stuck

•  Good
learning
rate:
makes
steady
progress
toward
local
minimum

25

too
small
too
large

GD:
learning
rate
decay

•  In
prac.ce
it
is
necessary
to
gradually
decrease
learning
rate
to
speed
up
the
training

•  step
decay
(e.g.
reduce
by
half
every
few
epochs)

•  exponen6al
decay

•  1/t
decay

•  manual
decay

•  Suﬃcient
condi.ons
for
convergence:

•  Usually:
adapt
learning
rate
by
monitoring
learning
curves
that
plot
the
objec.ve

func.on
as
a
func.on
of
.me
(more
of
an
art
than
a
science!)

26

αt
= ∞
t=1
∞
∑ αt
2
< ∞
t=1
∞
∑
α = α0
e−kt
α =
α0
1+ kt
k decay rate
t iteration number
α0
initial learning rate

Adap6ve
learning
rates

•  Cost
if
o]en
sensi.ve
to
some
direc.ons
and
insensi.ve
to
others

•  Momentum/Nesterov
mi.gate
this
issue
but
introduce
another
hyperparameter

•  Solu6on:
Use
a
separate
learning
rate
for
each
parameter
and
automa6cally
adapt
it

through
the
course
of
learning

•  Algorithms
(mini-‐batch
based)

•  AdaGrad

•  RMSProp

•  Adam

27

AdaGrad

•  Adapts
the
learning
rate
of
each
parameter
based
on
sizes
of
previous
updates:

•  scales
updates
to
be
larger
for
parameters
that
are
updated
less

•  scales
updates
to
be
smaller
for
parameters
that
are
updated
more

•  The
net
eﬀect
is
greater
progress
in
the
more
gently
sloped
direc.ons
of
parameter
space

•  Require:
ini.al
parameter
θ,
learning
rate
α,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
•  Update:

•  accumulate
squared
gradient

•  compute
update

•  apply
update

28

θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum
of

all
previous
squared
gradients

updates
inversely
propor.onal
to
the

square
root
of
the
sum

(elementwise
mul.plica.on)

Duchi
et
al.
Adap.ve
Subgradient
Methods
for
Online
Learning
and
Stochas.c
Op.miza.on.
JMRL
2011

Root
Mean
Square
Propaga6on
(RMSProp)

•  AdaGrad
can
result
in
a
premature
and
excessive
decrease
in
effec6ve
learning
rate

•  RMSProp
modifies
AdaGrad
to
perform
beber
in
non-‐convex
surfaces

•  Changes
gradient
accumula.on
by
an
exponen6ally
decaying
average
of
sum
of

squares
of
gradients

•  Requires:
ini.al
parameter
θ,
learning
rate
α,
decay
rate
ρ,
small
constant
δ
(e.g.
10-‐7)

•  Update:

•  accumulate
squared
gradient

•  compute
update

•  apply
update

29

θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
Geoff
Hinton,
Unpublished

ADAp6ve
Moments
(Adam)

•  Combina.on
of
RMSProp
and
momentum,
but:

•  Keep
decaying
average
of
both
first-‐order
moment
of
gradient
(momentum)
and
second-‐
order
moment
(RMSProp)

•  Includes
bias
correc.ons
(first
and
second
moments)
to
account
for
their
ini.aliza.on
at

origin

Update:

•  updated
biased
first
moment
es6mate

•  update
biased
second
moment

•  correct
biases

•  compute
update

(opera.ons
applied
elementwise)

•  apply
update

30

θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma
et
al.
Adam:
a
Method
for
Stochas.c
Op.miza.on.
ICLR
2015

δ=10-‐8,
ρ1=0.9,
ρ2=0.999

Example:
test
func6on

31

Image
credit:
Alec
Radford.

Beale’s
func.on

Example:
saddle
point

32

Image
credit:
Alec
Radford.

Ini6aliza6on
-‐
Normaliza6on

Parameter
ini6aliza6on

•  Weights

•  Can’t
ini.alize
weights
to
0

(gradients
would
be
0)

•  Can’t
ini.alize
all
weights
to
the
same
value
(all
hidden
units
in
a
layer
will
always

behave
the
same;
need
to
break
symmetry)

•  Small
random
number,
e.g.,
uniform
or
gaussian
distribu.on

•  if
weights
start
too
small,
the
signal
shrinks
as
it
passes
through
each
layer
un.l
it
is
too
.ny

to
be
useful

•  Xavier
ini.aliza.on
(calibra.ng
variances,
for
tanh
ac.va.ons)
sqrt(1/n)

•  each
neuron:
w
=
randn(n)
/
sqrt(n)
,
n
inputs

•  He
ini.aliza.on
(for
ReLu
ac.va.ons)
sqrt(2/n)

•  each
neuron
w
=
randn(n)
*
sqrt(2.0
/n)
,
n
inputs

•  Biases

•  ini.alize
all
to
0
(except
for
output
unit
for
skewed
distribu.ons,
0.01
to
avoid
satura.ng
RELU)

•  Alterna6ve:
Ini.alize
using
machine
learning;
parameters
learned
by
unsupervised
model

trained
on
the
same
inputs
/
trained
on
unrelated
task

34

Normalizing
inputs

•  Normalizing
inputs
to
speed
up
learning

•  For
input
layers:
data
preprocessing
mean
=
1,
std=1

•  For
hidden
layers:
batch
normaliza.on

35

original
data

mean=0
mean
=0,
std=1

Loss
for
unnormalized
data

Loss
for
normalized
data

Batch
normaliza6on

•  As
learning
progresses,
the
distribu.on
of
the
layer
inputs
changes
due

to
parameter
updates
(
internal
covariate
shi])

•  This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of

the
ac.va.on
func.on,
slowing
down
learning

•  Bach
normaliza.on
is
a
technique
to
reduce
this
eﬀect

•  Explicitly
force
the
layer
ac.va.ons
to
have
zero
mean
and
unit

variance
w.r.t
running
batch
es.mates

•  Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s.ll

use
the
nonlinearity

36

Ioﬀe
and
Szegedy,
2015.
“Batch
normaliza.on:
accelera.ng
deep
network
training
by
reducing
internal
covariate
shi]”

FC
/
Conv

Batch
norm

ReLu

FC
/
Conv

Batch
norm

ReLu

Batch
normaliza6on

•  Can
be
applied
to
any
input
or
hidden
layer

•  For
a
mini-‐batch
of
m
ac.va.ons
of
the
layer

1.  Compute
empirical
mean
and
variance
for
each
dimension
D

2.  Normalize

3.  Scale
and
shi]

(two
learnable
parameters
)

37

ˆxi
=
xi
− µB
σ B
2
+ ε
m
D
x
yi
= γ ˆxi
+ β
B = xi{ }i=1....m
µB
=
1
m
xi
i=1
m
∑ σ B
2
=
1
m
(xi
− µB
)2
i=1
m
∑
Note:
normaliza.on
can
reduce
the
expressive
power
of
the
network
(e.g.
normalize
inputs
of
a

sigmoid
would
constrain
them
to
the
linear
regime

To
recover
the
iden.ty
mapping.
The
network
can
lean

Then

β = µBγ = σ B
2
+ ε
ˆyi
= xi

Batch
normaliza6on

Each
mini-‐batch
is
scaled
by
the
mean/variance
computed
on
just
that
mini-‐batch.

This
adds
some
noise
to
the
hidden
layer’s
ac.va.ons
within
that
minibatch,
having
a

slight
regulariza.on
effect:

•  Improves
gradient
flow
through
the
network

•  Allows
higher
learning
rates

•  Reduces
the
strong
dependency
on
ini.aliza.on

•  Reduces
the
need
of
regulariza.on

At
test
.me
BN
layers
func.on
differently:

•  Mean
and
std
are
not
computed
on
the
batch.

•  Instead,
a
single
fixed
empirical
mean
and
std
of
ac.va.ons
computed
during
training
is

used
(can
be
es.mated
with
exponen.ally
decaying
weighted
averages)

38

Summary

39

•  Op.miza.on
for
NN
is
diﬀerent
from
pure
op.miza.on:

•  GD
with
mini-‐batches

•  early
stopping

•  non-‐convex
surface,
saddle
points

•  Learning
rate
has
a
signiﬁcant
impact
on
model
performance

•  Several
extensions
to
GD
can
improve
convergence

•  Adap.ve
learning-‐rate
methods
are
likely
to
achieve
best
results

•  RMSProp,
Adam

•  Weight
ini.aliza.on:
He

w=
randn(n)/
sqrt(2/n)

•  Batch
normaliza.on
to
reduce
the
internal
covariance
shi]

Bibliograpy

•  Goodfellow,
I.,
Bengio,
Y.,
and
A.,
C.
(2016),
Deep
Learning,
MIT
Press.

•  Choromanska,
A.,
Henaﬀ,
M.,
Mathieu,
M.,
Arous,
G.
B.,
and
LeCun,
Y.
(2015),
The
loss
surfaces
of

mul.layer
networks.
In
AISTATS.

•  Dauphin,
Y.
N.,
Pascanu,
R.,
Gulcehre,
C.,
Cho,
K.,
Ganguli,
S.,
and
Bengio,
Y.
(2014).
Iden.fying
and

abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
In
Advances
in

Neural
Informa.on
Processing.
Systems,
pages
2933–2941.

•  Duchi,
J.,
Hazan,
E.,
and
Singer,
Y.
(2011).
Adap.ve
subgradient
methods
for
online
learning
and

stochas.c
op.miza.on.
Journal
of
Machine
Learning
Research,
12(Jul):2121–2159.

•  Goodfellow,
I.
J.,
Vinyals,
O.,
and
Saxe,
A.
M.
(2015).
Qualita.vely
characterizing
neural
network

op.miza.on
problems.
In
Interna.onal
Conference
on
Learning
Representa.ons.

•  Hinton,
G.
(2012).
Neural
networks
for
machine
learning.
Coursera,
video
lectures

•  Jacobs,
R.
A.
(1988).
Increased
rates
of
convergence
through
learning
rate
adapta.on.
Neural

networks,
1(4):295–307.

•  Kingma,
D.
and
Ba,
J.
(2014)-‐
Adam:
A
method
for
stochas.c
op.miza.on.
arXiv
preprint
arXiv:
1412.6980.

•  Saxe,
A.
M.,
McClelland,
J.
L.,
and
Ganguli,
S.
(2013).
Exact
solu.ons
to
the
nonlinear
dynamics
of

learning
in
deep
linear
neural
networks.
In
Interna.onal
Conference
on
Learning
Representa.ons
40

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

More Related Content

What's hot

Similar to Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

More from Universitat Politècnica de Catalunya

Recently uploaded

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018