Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

[course
site]

Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 4 Lecture 1
#DLUPC

Previously
in
DLAI…

•  Mul.layer
perceptron

•  Training:
(stochas.c
/
mini-‐batch)
gradient
descent

•  Backpropaga.on

but…

What
type
of
op.miza.on
problem?

Do
local
minima
and
saddle
points
cause
problems?

Does
gradient
descent
perform
well?

How
to
set
the
learning
rate?

How
to
ini.alize
weights?

How
does
batch
size
aﬀect
training?

2

Index

•  Op6miza6on
for
a
machine
learning
task;
difference
between
learning
and
pure
op6miza6on

•  Expected
and
empirical
risk

•  Surrogate
loss
func.ons
and
early
stopping

•  Batch
and
mini-‐batch
algorithms

•  Challenges

•  Local
minima

•  Saddle
points
and
other
flat
regions

•  Cliffs
and
exploding
gradients

•  Prac6cal
algorithms

•  Stochas.c
Gradient
Descent

•  Momentum

•  Nesterov
Momentum

•  Learning
rate

•  Adap.ve
learning
rates:
adaGrad,
RMSProp,
Adam

•  Approximate
second-‐order
methods

•  Parameter
ini6aliza6on

•  Batch
Normaliza6on

3

Diﬀerences
between
learning
and
pure

op6miza6on

Op6miza6on
for
NN
training

•  Goal:
Find
the
parameters
that
minimize
the
expected
risk
(generaliza.on
error)

•  x
input,

predicted
output,
y
target
output,
E
expecta.on

•  pdata
true
(unknown)
data
distribu.on

•  L

loss
func6on
(how
wrong
predic6ons
are)

•  But
we
only
have
a
training
set
of
samples:
we
minimize
the
empirical
risk,
average

loss
on
a
ﬁnite
dataset
D

J(θ) = Ε(x,y)∼pdata
L( f (x;θ), y)
f (x,θ)
J(θ) = Ε(x,y)∼ ˆpdata
L( f (x;θ), y) =
1
D
L( f (x(i)
,θ), y(i)
)
(x(i)
,y(i)
)∈D
∑
where

is
the
empirical
distribu.on,
|D|
is
the
number
of
examples
in
D

5

ˆpdata

Surrogate
loss

•  O]en
minimizing
the
real
loss
is
intractable

•  e.g.
0-‐1
loss
(0
if
correctly
classiﬁed,
1
if
it
is
not)

(intractable
even
for
linear
classiﬁers
(Marcoae
1992)

•  Minimize
a
surrogate
loss
instead

•  e.g.
nega.ve
log-‐likelihood
for
the
0-‐1
loss

•  Some.mes
the
surrogate
loss
may
learn
more

•  test
error
0-‐1
loss
keeps
decreasing
even
a]er

training
0-‐1
loss
is
zero

•  further
pushing
classes
apart
from
each
other

6

0-‐1
loss
(blue)
and
surrogate

losses
(square,
hinge,
logis.c)

0-‐1
loss
(blue)
and

nega.ve
log
likelihood
(red)

Surrogate
loss
func6ons

7

Probabilistic
classifier
Outputs
probability
of
class
1

g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x)
Binary cross-entropy loss:
L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x))
Decision function:f(x) = Ig(x)>0.5
Outputs
a
vector
of
probabili.es:

g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(g(x),y) = -log g(x)y
Decision function:f(x) = argmax(g(x))
Non-
Hinge
loss:
probabilistic
classifier
Outputs a «score» g(x) for class 1.
score for the other class is -g(x)
L(g(x),t) = max(0, 1-tg(x)) where t=2y-1
Decision function: f(x) = Ig(x)>0
Outputs
a
vector
g(x) of
real-‐valued

scores
for
the
m
classes.
Mul.class
margin
loss

L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )
k≠y
Decision function: f(x) = argmax(g(x))
Binary classifier Multiclass classifier

Early
stopping

•  Training
algorithms
usually
do
not
halt
at
a
local
minimum

•  Early
stopping:

•  based
on
the
true
underlying
loss
(ex
0-‐1
loss)
measured
on
a
valida6on
set

•  #
training
steps
=
hyperparameter
controlling
the
eﬀec.ve
capacity
of
the
model

•  simple,
eﬀec.ve,
must
keep
a
copy
of
the
best
parameters

•  acts
as
a
regularizer
(Bishop
1995,…)

8

Training
error
decreases
steadily

Valida.on
error
begins
to
increase

Return
parameters
at
point
with

lowest
valida6on
error

Batch
and
mini-‐batch
algorithms

•  In
most
op.miza.on
methods
used
in
ML
the
objec.ve
func.on
decomposes
as
a
sum
over
a

training
set

•  Gradient
descent:

examples
from
the
training
set

with
corresponding
targets

•  Using
the
complete
training
set
can
be
very
expensive
(the
gain
of
using
more
samples
is
less

than
linear
–
standard
error
of
mean
drops
propor.onally
to
sqrt(m)-‐,
training
set
may
be

redundant:
use
a
subset
of
the
training
set

•  How
many
samples
in
each
update
step?

•  Determinis.c
or
batch
gradient
methods:
process
all
training
samples
in
a
large
batch

•  Stochas.c
methods:
use
a
single
example
at
a
.me

•  online
methods:
samples
are
drawn
from
a
stream
of
con.nually
created
samples

•  Mini-‐batch
stochas.c
methods:
use
several
(not
all)
samples

9

∇θ
J(θ) = Ε(x,y)∼ ˆpdata
∇θ
L( f (x;θ), y) =
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
{x(i)
}i=1...m
{y(i)
}i=1...m

Batch
and
mini-‐batch
algorithms

Mini-‐batch
size?

•  Larger
batches:
more
accurate
es.mate
of
the
gradient
but
less
than
linear
return

•  Very
small
batches:
Mul.core
architectures
under-‐u.lized

•  If
samples
processed
in
parallel:
memory
scales
with
batch
size

•  Smaller
batches
provide
noisier
gradient
es.mates

•  Small
batches
may
offer
a
regularizing
effect

(add
noise)

•  but
may
require
small
learning
rate

•  may
increase
number
of
steps
for
convergence

Minbatches
should
be
selected
randomly
(shuffle
samples)

•  unbiased
es.mate
of
gradients

10

Challenges
in
NN
op6miza6on

Local
minima

•  Convex
op.miza.on

•  any
local
minimum
is
a
global
minimum

•  there
are
several
opt.
algorithms
(polynomial-‐.me)

•  Non-‐convex
op.miza.on

•  objec6ve
func6on
in
deep
networks
is
non-‐convex

•  deep
models
may
have
several
local
minima

•  but
this
is
not
necessarily
a
major
problem!

12

Local
minima
and
saddle
points

•  Cri.cal
points:

•  For
high
dimensional
loss
func.ons,
local
minima
are
rare
compared
to
saddle
points

•  Hessian
matrix:

real,
symmetric

eigenvector/eigenvalue
decomposi.on

•  Intui.on:
eigenvalues
of
the
Hessian
matrix

•  local
minimum/maximum:
all
posi.ve
/
all
nega.ve
eigenvalues:
exponen.ally
unlikely
as
n
grows

•  saddle
points:
both
posi.ve
and
nega.ve
eigenvalues

13
Dauphin
et
al.
Iden.fying
and
aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014

Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0

Local
minima
and
saddle
points

•  It
is
believed
that
for
many
problems

including
learning
deep
nets,
almost
all
local

minimum
have
very
similar
func.on
value
to

the
global
op.mum

•  Finding
a
local
minimum
is
good
enough

14

Value
of
local
minima
found
by
running
SGD
for
200

itera.ons
on
a
simpliﬁed
version
of
MNIST
from
diﬀerent

ini.al
star.ng
points.
As
number
of
parameters
increases,

local
minima
tend
to
cluster
more
.ghtly.

•  For
many
random
func.ons
local
minima
are
more
likely
to
have
low
cost
than
high

cost.

Dauphin
et
al.
Iden.fying
and
aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014

Saddle
points

•  How
to
escape
from
saddle
points?

•  First
order
methods

•  ini.ally
aaracted
to
saddle
points,
but
unless

exact
hit,
it
will
be
repelled
when
close

•  hiqng
cri.cal
point
exactly
is
unlikely
(es.mated

gradient
is
noisy)

•  saddle
points
are
very
unstable:
noise
(stochas.c

gradient
descent)
helps
convergence,
trajectory

escapes
quickly

•  Second
order
moments:

•  Netwon’s
method
can
jump
to
saddle
points

(where
gradient
is
0)

15
S.
Credit:
K.McGuinness

SGD
tends
to
oscillate
between
slowly
approaching

a
saddle
point
and
quickly
escaping
from
it

Other
difficul6es

•  Cliffs
and
exploding
gradients

•  Nets
with
many
layers
/
recurrent
nets
can
contain
very
steep
regions
(cliffs)

mul.plica.on
of
several
parameters):
gradient
descent
can
move
parameters
too

far,
jumping
off
of
the
cliff.
(solu.ons:
gradient
clipping)

•  Long
term
dependencies:

•  computa.onal
graph
becomes
very
deep:
vanishing
and
exploding
gradients

16

Stochas6c
Gradient
Descent
(SGD)

•  Most
used
algorithm
for
deep
learning

•  Do
not
confuse
with
determinis.c
gradient
descent:
stochas.c
uses
mini-‐batches

Algorithm

•  Require:
Learning
rate
α,
ini.al
parameter
θ

•  while
stopping
criterion
not
met
do

•  sample
a
minibatch
of
m
examples
from
the
training
set

with

corresponding
targets

•  compute
gradient
es.mate

•  apply
update

•  end
while

18

{x(i)
}i=1...m
{y(i)
}i=1...m
ˆg ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ −α ˆg

Momentum

19

•  Designed
to
accelerate
learning,
especially
for
high
curvature,
small
but
consistent

gradients
or
noisy
gradients

•  Momentum
aims
to
solve:
poor
condi.oning
of
Hessian
matrix
and
variance
in
the

stochas.c
gradient

Contour
lines=
a
quadra.c
loss
with
poor
condi.oning
of
Hessian

Path
(red)
followed
by
SGD
(le])
and
momentum
(right)

Momentum

•  New
variable
v
(velocity).
direc.on
and
speed
at
which
parameters
move:

exponen.ally
decaying
average
of
nega.ve
gradient

Algorithm

•  Require:
learning
rate
α,
ini.al
parameter
θ,

momentum
parameter
λ

,
ini6al
velocity
v
•  Update
rule:

•  compute
gradient
es.mate

•  compute
velocity
update

•  apply
update

•  Typical
values
λ=.5,
.9,.99
(in
[0,1})

•  Size
of
step
depends
on
how
large
and
aligned
a
sequence
of
gradients
are.

•  Read
physical
analogy
in
Deep
Learning
book
(Goodfellow
et
al)

20

g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg

Nesterov
accelerated
gradient
(NAG)

•  A
variant
of
momentum,
where
gradient
is
evaluated
a]er
current
velocity
is
applied:

•  Approximate
where
the
parameters
will
be
on
the
next
.me
step
using
current
velocity

•  Update
velocity
using
gradient
where
we
predict
parameters
will
be

Algorithm

•  Require:
learning
rate
α,
ini.al
parameter
θ,
momentum
parameter
λ

,
ini.al
velocity
v
•  Update:

•  apply
interim
update

•  compute
gradient
(at
interim
point)

•  compute
velocity
update

•  apply
update

•  Interpreta.on:
add
a
correc.on
factor
to
momentum

21

g ← +
1
m
∇!θ
L( f (x(i)
; !θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv

Nesterov
accelerated
gradient
(NAG)

22

current
loca.on
wt
vt
∇L(wt) vt+1
S.
Credit:
K.
McGuinness

predicted
loca.on
based
on
velocity
alone
wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1

SGD:
learning
rate

•  Learning
rate
is
a
crucial
parameter
for
SGD

•  To
large:
overshoots
local
minimum,
loss
increases

•  Too
small:
makes
very
slow
progress,
can
get
stuck

•  Good
learning
rate:
makes
steady
progress
toward
local
minimum

•  In
prac.ce
it
is
necessary
to
gradually
decrease
learning
rate

•  step
decay
(e.g.
decay
by
half
every
few
epochs)

•  exponen6al
decay

•  1/t
decay

•  Suﬃcient
condi.ons
for
convergence:

•  Usually:
adapt
learning
rate
by
monitoring
learning
curves
that
plot
the
objec.ve
func.on

as
a
func.on
of
.me
(more
of
an
art
than
a
science!)

23

αt
= ∞
t=1
∞
∑ αt
2
= ∞
t=1
∞
∑
α = α0
e−kt
α = α0
/ (1+ kt)
t
=
itera/on
number

Adap6ve
learning
rates

•  Learning
rate
is
one
of
the
hyperparameters
that
is
the
most
diﬃcult
to
set;
it
has
a

signiﬁcant
impact
on
the
model
performance

•  Cost
if
o]en
sensi.ve
to
some
direc.ons
and
insensi.ve
to
others

•  Momentum/Nesterov
mi.gate
this
issue
but
introduce
another
hyperparameter

•  Solu6on:
Use
a
separate
learning
rate
for
each
parameter
and
automa6cally
adapt
it

through
the
course
of
learning

•  Algorithms
(mini-‐batch
based)

•  AdaGrad

•  RMSProp

•  Adam

•  RMSProp
with
Nesterov
momentum

24

AdaGrad

•  Adapts
the
learning
rate
of
each
parameter
based
on
sizes
of
previous
updates:

•  scales
updates
to
be
larger
for
parameters
that
are
updated
less

•  scales
updates
to
be
smaller
for
parameters
that
are
updated
more

•  The
net
eﬀect
is
greater
progress
in
the
more
gently
sloped
direc.ons
of
parameter
space

•  Desirable
theore.cal
proper.es
but
empirically
(for
deep
models)
can
result
in
a
premature
and

excessive
decrease
in
eﬀec6ve
learning
rate

•  Require:
learning
rate
α,
ini.al
parameter
θ,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
•  Update:

•  compute
gradient

•  accumulate
squared
gradient

•  compute
update

•  apply
update

25

g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum
of

all
previous
squared
gradients

updates
inversely
propor.onal
to
the

square
root
of
the
sum

Duchi
et
al.
Adap.ve
Subgradient
Methods
for
Online
Learning
and
Stochas.c
Op.miza.on.
JMRL
2011

Root
Mean
Square
Propaga6on
(RMSProp)

•  Modiﬁes
AdaGrad
to
perform
beaer
in
non-‐convex
surfaces,
for
aggressively
decaying

learning
rates

•  Changes
gradient
accumula.on
by
an
exponen6ally
decaying
average
of
sum
of

squares
of
gradients

•  Requires:
learning
rate
α,
ini.al
parameter
θ,
decay
rate
ρ,
small
constant
δ
(e.g.
10-‐7)

for
numerical
stability
•  Update:

•  compute
gradient

•  accumulate
squared
gradient

•  compute
update

•  apply
update

26

θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
It
can
be
combined
with

Nesterov
momentum

Geoﬀ
Hinton,
Unpublished

ADAp6ve
Moments
(Adam)

•  Combina.on
of
RMSProp
and
momentum,
but:

•  Keep
decaying
average
of
both
first-‐order
moment
of
gradient
(momentum)
and
second-‐order

moment
(RMSProp)

•  Includes
bias
correc.ons
(firs
and
second
moments)
to
account
for
their
ini.aliza.on
at
origin

Update:

•  compute
gradient

•  updated
biased
first
moment
es6mate

•  update
biased
second
moment

•  correct
biases

•  compute
update

(opera.ons
applied
elementwise)

•  apply
update

27

θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma
et
al.
Adam:
a
Method
for
Stochas.c
Op.miza.on.
ICLR
2015

δ=e-‐8,
ρ1=0.9,
ρ2=0.999

Example:
test
func6on

28

Image
credit:
Alec
Radford.

Beale’s
func.on

Example:
saddle
point

29

Image
credit:
Alec
Radford.

Second
order
op6miza6on

30

First
order

Second
order

Second
order
op6miza6on

•  Second
order
Taylor
expansion

•  Solving
for
the
cri.cal
point
we
obtain
the
Newton
parameter
update:

•  Problem:
Hessian
has
O(N2)
elements,
inver.ng
H
O(N3)
(N
parameters
≈millions)

•  Alterna6ves:

•  Quasi-‐Newton
methods
(BGFS
Broyden-‐Fletcher-‐Goldfarb-‐Shanno):
instead
of

inver.ng
Hessian,
approximate
inverse
Hessianwith
rank
1
updates
over
.me

O(N2)
each

•  L-‐BFGS
(Limited
memory
BFGS):
does
not
form/store
the
full
inverse
Hessian

31

J(θ) ≈ J(θ0
) + (θ −θ0
)T
∇θ
J(θ0
) +
1
2
(θ −θ0
)T
H(θ −θ0
)
θ*
= θ0
− H−1
∇θ
J(θ0
)

Parameter
ini6aliza6on

•  Weights

•  Can’t
ini.alize
weights
to
0

(gradients
would
be
0)

•  Can’t
ini.alize
all
weights
to
the
same
value
(all
hidden
units
in
a
layer
will
always

behave
the
same;
need
to
break
symmetry)

•  Small
random
number,
e.g.,
uniform
or
gaussian
distribu6on

•  if
weights
start
too
small,
the
signal
shrinks
as
it
passes
through
each
layer
un.l
it
is

too
.ny
to
be
useful

•  Calibra6ng
variances
with
1/sqrt(n)

(Xavier
Ini6aliza6on)

•  each
neuron:
w
=
randn(n)
/
sqrt(n)
,
n
inputs

•  He
ini6aliza6on
(for
ReLu
ac6va6ons)
sqrt(2/n)

•  each
neuron
w
=
randn(n)
*
sqrt(2.0
/n)
,
n
inputs

•  Biases

•  ini.alize
all
to
0
(except
for
output
unit
for
skewed
distribu.ons,
0.01
to
avoid
satura.ng
RELU)

•  Alterna6ve:
Ini.alize
using
machine
learning;
parameters
learned
by
unsupervised
model

trained
on
the
same
inputs
/
trained
on
unrelated
task
32

N(0,10−2
)

Batch
normaliza6on

•  As
learning
progresses,
the
distribu.on
of
the
layer
inputs
changes
due

to
parameter
updates
(
internal
covariate
shi])

•  This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of

the
ac.va.on
func.on,
slowing
down
learning

•  Bach
normaliza.on
is
a
technique
to
reduce
this
eﬀect

•  Explicitly
force
the
layer
ac.va.ons
to
have
zero
mean
and
unit

variance
w.r.t
running
batch
es.mates

•  Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s.ll

use
the
nonlinearity

33

Ioﬀe
and
Szegedy,
2015.
“Batch
normaliza.on:
accelera.ng
deep
network
training
by
reducing
internal
covariate
shi]”

FC
/
Conv

Batch
norm

ReLu

FC
/
Conv

Batch
norm

ReLu

Batch
normaliza6on

•  Can
be
applied
to
any
input
or
hidden
layer

•  For
a
mini-‐batch
of
N
ac.va.ons
of
the
layer

1.  compute
empirical
mean
and
variance
for
each
dimension
D

2.  normalize

•  Note:
normaliza.on
can
reduce
the
expressive
power
of
the
network
(e.g.
normalize

intputs
of
a
sigmoid
would
constrain
them
to
the
linear
regime

•  So
let
the
network
learn
the
iden.ty:
scale
and
shi]

•  To
recover
the
iden.ty
mapping
the
newtork
can
lean

34

ˆx(k )
=
x(k )
− E(x(k )
)
var(x(k )
)
N
D
X
β(k )
= E(x(k )
)γ (k )
= var(x(k )
)
y(k )
= γ (k )
ˆx(k )
+ β(k )

Batch
normaliza6on

35

1.  Improves
gradient
ﬂow

through
the
network

2.  Allows
higher
learning
rates

3.  Reduces
the
strong

dependency
on

ini.aliza.on

4.  Reduces
the
need
of

regulariza.on

Batch
normaliza6on

36

At
test
6me
BN
layers
func6ons

diﬀerently:

1.  Mean
and
std
are
not

computed
on
the
batch.

2.  Instead,
a
single
ﬁxed

empirical
mean
and
std
of

ac.va.ons
computed
during

training
is
used

(can
be
es.mated
with

running
averages)

Summary

37

•  Op.miza.on
for
NN
is
diﬀerent
from
pure
op.miza.on:

•  GD
with
mini-‐batches

•  early
stopping

•  non-‐convex
surface,
local
minima
and
saddle
points

•  Learning
rate
has
a
signiﬁcant
impact
on
model
performance

•  Several
extensions
to
SGD
can
improve
convergence

•  Ada.ve
learning-‐rate
methods
are
likely
to
achieve
best
results

•  RMSProp,
Adam

•  Weight
ini.aliza.on:
He

w=
randn(n)
sqrt(2/n)

•  Batch
normaliza.on
to
reduce
the
internal
covariance
shi]

Bibliograpy

•  Goodfellow,
I.,
Bengio,
Y.,
and
A.,
C.
(2016),
Deep
Learning,
MIT
Press.

•  Choromanska,
A.,
Henaﬀ,
M.,
Mathieu,
M.,
Arous,
G.
B.,
and
LeCun,
Y.
(2015),
The
loss
surfaces
of

mul.layer
networks.
In
AISTATS.

•  Dauphin,
Y.
N.,
Pascanu,
R.,
Gulcehre,
C.,
Cho,
K.,
Ganguli,
S.,
and
Bengio,
Y.
(2014).
Iden.fying
and

aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
In
Advances
in

Neural
Informa.on
Processing.
Systems,
pages
2933–2941.

•  Duchi,
J.,
Hazan,
E.,
and
Singer,
Y.
(2011).
Adap.ve
subgradient
methods
for
online
learning
and

stochas.c
op.miza.on.
Journal
of
Machine
Learning
Research,
12(Jul):2121–2159.

•  Goodfellow,
I.
J.,
Vinyals,
O.,
and
Saxe,
A.
M.
(2015).
Qualita.vely
characterizing
neural
network

op.miza.on
problems.
In
Interna.onal
Conference
on
Learning
Representa.ons.

•  Hinton,
G.
(2012).
Neural
networks
for
machine
learning.
Coursera,
video
lectures

•  Jacobs,
R.
A.
(1988).
Increased
rates
of
convergence
through
learning
rate
adapta.on.
Neural

networks,
1(4):295–307.

•  Kingma,
D.
and
Ba,
J.
(2014)-‐
Adam:
A
method
for
stochas.c
op.miza.on.
arXiv
preprint
arXiv:
1412.6980.

•  Saxe,
A.
M.,
McClelland,
J.
L.,
and
Ganguli,
S.
(2013).
Exact
solu.ons
to
the
nonlinear
dynamics
of

learning
in
deep
linear
neural
networks.
In
Interna.onal
Conference
on
Learning
Representa.ons
38

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

More Related Content

What's hot

Similar to Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

More from Universitat Politècnica de Catalunya

Recently uploaded

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)