Learning Deep Learning

Learning
Deep
Learning

M1
Sonse
Shimaoka

Neural
Network
(From
the
lecture
slide
of

Nando
de
Freitas
)

Supervised
Learning
Input
output
[0,0,1,0,1,1,0,0,0,1,1]
1
[1,1,1,0,1,1,1,0,0,1,1]
0
[1,1,1,0,1,1,0,0,0,1,1]
0
[0,0,0,0,1,1,1,0,0,0,0]
1
[1,0,1,0,1,1,0,0,0,0,0]
1
[1,0,1,0,0,0,0,0,0,1,1]
0
[0,0,0,0,1,1,0,1,0,1,1]
1
Training
data
Input
output
[1,0,1,0,1,1,0,0,0,1,0]
?
[1,1,1,1,1,1,1,0,0,1,1]
?
[1,0,1,0,1,1,0,1,0,1,1]
?
Test
data
GeneralizaGon

Perceptron
∑ sign
x1
x2
x3
w1
w3
w2
b
y
y = sign wj xj
j=1
3
∑ + b
"
#
$$
%
&
''

Perceptron
∑ sign
1
3
−2
2
1.5
1
0.5
1
1*2 + 3*1
−2*1.5+ 0.5
= 2.5
y = sign wj xj
j=1
3
∑ + b
"
#
$$
%
&
''

Perceptron
(x1, x2, x3 ) = (1,3,−2)
(w1, w2, w3 ) = (2,1,1.5)
b = 0.5
= sign 1*2 + 3*1− 2*1.5+ 0.5( ) = sign(2.5) =1
y = sign wi xi
i=1
3
∑ + b
"
#
$
%
&
' = sign w1x1 + w2 x2 + w3x3 + b( )

Perceptron
x1
x2
w1x1 + w2 x2 + b = 0

Problem
with
Perceptron
x1
x2
w1x1 + w2 x2 + b = 0
What
is
the
probability
that
this

point
belongs
to
the
posiGve
class?

Perceptron
can’t
answer
this!

Problem
with
Perceptron
x1
x2
Impossible

to
separate

linearly
!!

LogisGc
Regression
∑ sigmoid
x1
x2
x3
w1
w3
w2
b
y
y = sigmoid wj xj
j=1
3
∑ + b
"
#
$$
%
&
''

LogisGc
Regression
sigmoid x( ) =
1
1+ exp(−x)

LogisGc
Regression
∑ sigmoid
y = sigmoid wj xj
j=1
3
∑ + b
"
#
$$
%
&
''
1
3
−2
2
1.5
1
0.5
1*2 + 3*1
−2*1.5+ 0.5
= 2.5
0.924
Probability!!

Feature
TransformaGon

x1
x2
New
Space
Φ
Non
Linear
TransformaGon
φ1(x1, x2 )
φ2 (x1, x2 )
Original
Space
But,
we
must

sGll
design
the

transformaGon…

Feed
Forward
Neural
Network
∑ f
A
neuron
AcGvaGon
funcGon

Feed
Forward
Neural
Network
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y1

Feed
Forward
Neural
Network
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y1
input
layer
hidden
layer
output
layer

AbstracGon
by
Layer
Linear
Linear
V
f g
W
x h yWx Vh

FFN
can
learn
representaGons!!

AcGvaGon
FuncGons
sigmoid x( ) =
1
1+ exp(−x)
=
exp(x)
exp(x)+1

AcGvaGon
FuncGons
tanh x( ) =
exp(x)− exp(−x)
exp(x)+ exp(−x)

AcGvaGon
FuncGons
rectifier(x) = max(0, x)

AcGvaGon
FuncGons
soft max(x1,..., xm )c =
exp(xc )
exp(xk )
k=1
m
∑

Loss
FuncGons
•  When
you
want
a
model
to
learn
to
do

something,
you
give
it
feedback
on
how
well
it
is

doing.

•  This
funcGon
that
computes
an
objecGve

measure
of
the
model's
performance
is
called
a

loss
func1on.

•  A
typical
loss
funcGon
takes
in
the
model's
output

and
the
ground
truth
and
computes
a
value
that

quanGﬁes
the
model's
performance.

•  The
model
then
corrects
itself
to
have
a
smaller

loss.

L2
norm
(y1,..., yn )
L =
1
n
ti − yi
2
2
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
Regression

Cross
Entropy
(y1,..., yn )
L =
1
n
−ti log yi −(1− ti )log(1− yi )
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
Binary
ClassiﬁcaGon

Class
NegaGve
Log
Likelihood
(y1,..., yn )
L = −
1
n
ti,k log yi,k
k
m
∑
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
MulG
Class
ClassiﬁcaGon

Output
acGvaGon
funcGons

and
Loss
funcGons
Task
Output

ac1va1on
Loss
func1on
Regression
Linear
L2
norm
Binary

ClassiﬁcaGon
Sigmoid
Cross
Entropy
MulG
Class

ClassiﬁcaGon
So]max
Class
NLL

ProbabilisGc
PerspecGve
•  We
can
assume
NNs
are
compuGng

condiGonal
probabiliGes
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
p(t1 | x1, x2, x3 )

ProbabilisGc
PerspecGve
•  When

NLL = −log p(ti | xi )
i=1
n
∏ = −log
1
2πσ
exp −
ti − yi( )
2
2σ 2
#
$
%
%
&
'
(
(
i=1
n
∏
=
1
2σ 2
ti − yi( )
2
− n 2πσ
i=1
n
∑
p(t | x) =
1
2πσ
exp −
t − y( )
2
2σ 2
"
#
$
$
%
&
'
'

L2
norm

ProbabilisGc
PerspecGve
•  When

i=1
n
∏ = −log yi
ti
(1−
i=1
n
∏ yi )1−ti
= −ti log yi − (1− ti )log(1− yi )
i=1
n
∑
p(t | x) = yt
(1− y)1−t

Cross
Entropy

ProbabilisGc
PerspecGve
•  When

i=1
n
∏ = −log y
ti,k
i,k
k=1
m
∏
i=1
n
∏
= − ti,k log yi,k
k=1
m
∑
i=1
n
∑
p(t | x) = yk
tk
k=1
m
∏

Class
NegaGve
Log
Likelihood

Gradient
Descent

•  Gradient

•  Gradient
Descent

Gradient
Descent

FuncGon
to
be
minimized

IniGal
point

Learning
rate

Update
rule

L(w)
winit
wnew
← wold
−α
∂L
∂w w=wold
α

Gradient
Descent

Big
learning
rate
Small
learning
rate

Loss
funcGon
for
LogisGc
regression
L(w,b;D) = log yti
i
i=1
n
∏ (1− yi )1−ti
= ti log yi + (1− ti )log(1− yi )
i=1
n
∑
yi =
1
1+ exp(−wT
xi − b)

Gradient
with
respect
to
w
∂L(w,b;D)
∂w
=
∂
∂w
ti log yi + (1− ti )log(1− yi )
i=1
n
∑
=
∂
∂w
ti log yi + (1− ti )log(1− yi )( )
i=1
n
∑
=
∂yi
∂w
∂
∂yi
i=1
n
∑
=
∂yi
∂w
ti
yi
−
1− ti
1− yi
$
%
&
'
(
)
i=1
n
∑ =
∂yi
∂w
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= xi yi (1− yi )
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= xi (ti − yi )
i=1
n
∑
∵
∂yi
∂w
=
∂
∂w
1
1+ exp(−wT
xi − b)
#
$
%
&
'
(
=
−
∂
∂w
1+ exp(−wT
xi − b)( )
1+ exp(−wT
xi − b)( )
2
=
xi exp(−wT
xi − b)
1+ exp(−wT
xi − b)( )
2
= xi yi (1− yi )

Gradient
with
respect
to
b
∂L(w,b;D)
∂b
=
∂
∂b
ti log yi + (1− ti )log(1− yi )
i=1
n
∑
=
∂
∂b
i=1
n
∑
=
∂yi
∂b
∂
∂yi
i=1
n
∑
=
∂yi
∂b
ti
yi
−
1− ti
1− yi
$
%
&
'
(
)
i=1
n
∑ =
∂yi
∂b
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= yi (1− yi )
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= ti − yi
i=1
n
∑
∵
∂yi
∂b
=
∂
∂b
1
1+ exp(−wT
xi − b)
#
$
%
&
'
(
=
−
∂
∂b
1+ exp(−wT
xi − b)( )
1+ exp(−wT
xi − b)( )
2
=
exp(−wT
xi − b)
1+ exp(−wT
xi − b)( )
2
= yi (1− yi )

Gradient
Descent

for
LogisGc
Regression

FuncGon
to
be
minimized

Update
rule

bnew
← bold
−α ti − yi
i=1
n
∑
L(w,b;D)= ti log yi + (1− ti )log(1− yi )
i=1
n
∑
wnew
← wold
−α xi (ti − yi )
i=1
n
∑

Exercise:
Gradient
Descent

for
Linear
Regression

L(w,b;D) = ti − yi( )
2
i=1
n
∑
yi = wT
xi + b

Answer
FuncGon
to
be
minimized

Update
rule

L(w,b;D) = ti − yi( )
2
i=1
n
∑
bnew
← bold
−α ti − yi
i=1
n
∑
wnew
← wold
−α xi (ti − yi )
i=1
n
∑

BackpropagaGon
How
do
we
compute

and

?
∂L
∂W
∂L
∂V
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2

BackpropagaGon
Use
the
Chain
Rule!!!
∂
∂x
q s x( )( )=
∂s(x)
∂x
∂q(s(x))
∂s(x)
x
s(x)
qs q(p(x))
∂q(s(x))
∂s(x)
∂s(x)
∂x
∂q(s(x))
∂s(x)

BackpropagaGon
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
Start
from

Output
layer:

∂L
∂y1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1

BackpropagaGon
Apply

Chain
Rule
:
∂L
∂l1
=
∂y1
∂l1
∂L
∂y1
= "g l1( )
∂L
∂y1
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1
∂L
∂l1

BackpropagaGon
Apply

Chain
Rule
:
∂L
∂V3,1
=
∂l1
∂V3,1
∂L
∂l1
= h3
∂L
∂l1
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1
∂L
∂l1
∂L
∂V3,1

BackpropagaGon
Apply

Chain
Rule
:
∂L
∂h3
=
∂l1
∂h3
∂L
∂l1
+
∂l1
∂h3
∂L
∂l1
= V3,1
∂L
∂l1
+V3,2
∂L
∂l2
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1
∂L
∂l1
∂L
∂V3,1
∂L
∂h3

BackpropagaGon
Apply

Chain
Rule
:
∂L
∂u3
=
∂h3
∂u3
∂L
∂h3
= "f u3( )
∂L
∂h3
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1
∂L
∂l1
∂L
∂V3,1
∂L
∂h3
∂L
∂u3

BackpropagaGon
Apply

Chain
Rule
:
∂L
∂W2,3
=
∂u3
∂W2,3
∂L
∂u3
= u3
∂L
∂u3
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
y2
L
W2,3
V3,1
∑ g y1
u1
u2
u3
l1
l2
∂L
∂y1
∂L
∂l1
∂L
∂V3,1
∂L
∂h3
∂L
∂u3
∂L
∂W2,3

AbstracGon
by
Layer
Linear
Linear
V
f
∂L
∂W
∂L
∂V
g
W
Lx
t
∂L
∂x
h
∂L
∂h
y
∂L
∂y
Wx Vh
∂L
∂ Wx( )
∂L
∂ Vh( )

AbstracGon
by
Layer
input output
∂loss
∂input
∂loss
∂outputLayer

AbstracGon
by
Layer
input output
Layer
Forward
ComputaGon
output = Layer. forward input( )

AbstracGon
by
Layer
∂loss
∂input
∂loss
∂outputLayer
Backward
ComputaGon
∂loss
∂input
= Layer. backward input,
∂loss
∂output
"
#
$
%
&
'
input

BackpropagaGon
①
Execute
the
forward
computaGon

Linear
Linear
V
f g
W
Lx
t
h yWx Vh

BackpropagaGon
②
Compute
the
derivaGve
of
the
loss
funcGon

with
respect
to
the
output
Linear
Linear
V
f g
W
Lx
t
h y
∂L
∂y
Wx Vh

BackpropagaGon
③
StarGng
from
the
ﬁnal
layer,
backpropagate

derivaGves
through
layers

Linear
Linear
V
f g
W
Lx
t
h y
∂L
∂y
Wx Vh
∂L
∂ Vh( )

Classifying
Digits
32×32=1024
pixels
Class:
10
digits
(0~9)

Training:
60000
examples

TesGng:
60000
examples

Classifying
Digits
x ∈ R1024
0
0
0
0
1
0
0
0
0
0
!
"
#
#
#
#
#
#
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
&
&
&
&
&
&
t =

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
u = Wx + b
u

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h = Tanh(u)
hu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
l = Vh + c
h lu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
y = soft max l( )
h ylu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h y
L = tk log yk
k=1
10
∑ = tT
log y
L
lu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
Wx + b h Vh + c y
∂L
∂y
=
∂
∂y
tT
log y = t1
y1
,...,t10
y10
"
#$
%
&'
T
L
∂L
∂y

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h y
∂L
∂l
= y !(t − y)!
∂L
∂y
= y1 t1 − y1( ),..., y10 t10 − y10( )#$ %&
T
! t1
y1
,...,t10
y10
#
$'
%
&(
T
= t1 t1 − y1( ),...,t10 t10 − y10( )#$ %&
T
L
∂L
∂y
∂L
∂l
lu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h y
∂L
∂h
= VT ∂L
∂l
L
∂L
∂y
∂L
∂V
=
∂L
∂l
hT ∂L
∂c
=
∂L
∂l
∂L
∂h
∂L
∂V
,
∂L
∂c
∂L
∂l
lu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h y
∂L
∂u
= 1+ h( )! 1− h( )!
∂L
∂h
L
∂L
∂y
∂L
∂h
∂L
∂V
,
∂L
∂c
∂L
∂u
∂L
∂l
lu

Classifying
Digits
Linear
Linear
V,cW,b
x
t
So]max
Tanh
Class
NLL
h y
∂L
∂W
=
∂L
∂u
xT
L
∂L
∂y
∂L
∂h
∂L
∂V
,
∂L
∂c
∂L
∂u
∂L
∂b
=
∂L
∂u
∂L
∂x
= W T ∂L
∂u
∂L
∂W
,
∂L
∂b
∂L
∂x
∂L
∂l
lu

Classifying
Digits
bnew
← b −α
∂L
∂b
W new
← W −α
∂L
∂W
V new
← V −α
∂L
∂V
cnew
← c −α
∂L
∂c

Learning Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Learning Deep Learning

Similar to Learning Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Learning Deep Learning