9. Problem
with
Perceptron
x1
x2
w1x1 + w2 x2 + b = 0
What
is
the
probability
that
this
point
belongs
to
the
posiGve
class?
Perceptron
can’t
answer
this!
14. Feature
TransformaGon
x1
x2
New
Space
Φ
Non
Linear
TransformaGon
φ1(x1, x2 )
φ2 (x1, x2 )
Original
Space
But,
we
must
sGll
design
the
transformaGon…
25. Loss
FuncGons
• When
you
want
a
model
to
learn
to
do
something,
you
give
it
feedback
on
how
well
it
is
doing.
• This
funcGon
that
computes
an
objecGve
measure
of
the
model's
performance
is
called
a
loss
func1on.
• A
typical
loss
funcGon
takes
in
the
model's
output
and
the
ground
truth
and
computes
a
value
that
quanGfies
the
model's
performance.
• The
model
then
corrects
itself
to
have
a
smaller
loss.
26. L2
norm
(y1,..., yn )
L =
1
n
ti − yi
2
2
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
Regression
27. Cross
Entropy
(y1,..., yn )
L =
1
n
−ti log yi −(1− ti )log(1− yi )
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
Binary
ClassificaGon
28. Class
NegaGve
Log
Likelihood
(y1,..., yn )
L = −
1
n
ti,k log yi,k
k
m
∑
i=1
n
∑
(t1,...,tn )
Output:
Target:
Loss:
Task:
MulG
Class
ClassificaGon
29. Output
acGvaGon
funcGons
and
Loss
funcGons
Task
Output
ac1va1on
Loss
func1on
Regression
Linear
L2
norm
Binary
ClassificaGon
Sigmoid
Cross
Entropy
MulG
Class
ClassificaGon
So]max
Class
NLL
30. ProbabilisGc
PerspecGve
• We
can
assume
NNs
are
compuGng
condiGonal
probabiliGes
∑ f
∑ f
∑ f
∑ g
x1
x3
x2
h1
h2
h3
p(t1 | x1, x2, x3 )
31. ProbabilisGc
PerspecGve
• When
NLL = −log p(ti | xi )
i=1
n
∏ = −log
1
2πσ
exp −
ti − yi( )
2
2σ 2
#
$
%
%
&
'
(
(
i=1
n
∏
=
1
2σ 2
ti − yi( )
2
− n 2πσ
i=1
n
∑
p(t | x) =
1
2πσ
exp −
t − y( )
2
2σ 2
"
#
$
$
%
&
'
'
L2
norm
32. ProbabilisGc
PerspecGve
• When
NLL = −log p(ti | xi )
i=1
n
∏ = −log yi
ti
(1−
i=1
n
∏ yi )1−ti
= −ti log yi − (1− ti )log(1− yi )
i=1
n
∑
p(t | x) = yt
(1− y)1−t
Cross
Entropy
33. ProbabilisGc
PerspecGve
• When
NLL = −log p(ti | xi )
i=1
n
∏ = −log y
ti,k
i,k
k=1
m
∏
i=1
n
∏
= − ti,k log yi,k
k=1
m
∑
i=1
n
∑
p(t | x) = yk
tk
k=1
m
∏
Class
NegaGve
Log
Likelihood
37. Loss
funcGon
for
LogisGc
regression
L(w,b;D) = log yti
i
i=1
n
∏ (1− yi )1−ti
= ti log yi + (1− ti )log(1− yi )
i=1
n
∑
yi =
1
1+ exp(−wT
xi − b)
38. Gradient
with
respect
to
w
∂L(w,b;D)
∂w
=
∂
∂w
ti log yi + (1− ti )log(1− yi )
i=1
n
∑
=
∂
∂w
ti log yi + (1− ti )log(1− yi )( )
i=1
n
∑
=
∂yi
∂w
∂
∂yi
ti log yi + (1− ti )log(1− yi )( )
i=1
n
∑
=
∂yi
∂w
ti
yi
−
1− ti
1− yi
$
%
&
'
(
)
i=1
n
∑ =
∂yi
∂w
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= xi yi (1− yi )
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= xi (ti − yi )
i=1
n
∑
∵
∂yi
∂w
=
∂
∂w
1
1+ exp(−wT
xi − b)
#
$
%
&
'
(
=
−
∂
∂w
1+ exp(−wT
xi − b)( )
1+ exp(−wT
xi − b)( )
2
=
xi exp(−wT
xi − b)
1+ exp(−wT
xi − b)( )
2
= xi yi (1− yi )
39. Gradient
with
respect
to
b
∂L(w,b;D)
∂b
=
∂
∂b
ti log yi + (1− ti )log(1− yi )
i=1
n
∑
=
∂
∂b
ti log yi + (1− ti )log(1− yi )( )
i=1
n
∑
=
∂yi
∂b
∂
∂yi
ti log yi + (1− ti )log(1− yi )( )
i=1
n
∑
=
∂yi
∂b
ti
yi
−
1− ti
1− yi
$
%
&
'
(
)
i=1
n
∑ =
∂yi
∂b
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= yi (1− yi )
ti − yi
yi (1− yi )
$
%
&
'
(
)
i=1
n
∑
= ti − yi
i=1
n
∑
∵
∂yi
∂b
=
∂
∂b
1
1+ exp(−wT
xi − b)
#
$
%
&
'
(
=
−
∂
∂b
1+ exp(−wT
xi − b)( )
1+ exp(−wT
xi − b)( )
2
=
exp(−wT
xi − b)
1+ exp(−wT
xi − b)( )
2
= yi (1− yi )
40. Gradient
Descent
for
LogisGc
Regression
FuncGon
to
be
minimized
Update
rule
bnew
← bold
−α ti − yi
i=1
n
∑
L(w,b;D)= ti log yi + (1− ti )log(1− yi )
i=1
n
∑
wnew
← wold
−α xi (ti − yi )
i=1
n
∑