Output Units and Cost Function in FNN

Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Deep Neural Network
Cost Functions and Output Units
Jiaming Lin
jmlin@arbor.ee.ntu.edu.tw
DATALab@III
NetDBLab@NTU
January 9, 2017
1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

Introduction
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions

Introduction
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classiﬁcation: sigmoid, softmax or linear.
– Linear Regression: linear.

Introduction
Introduction
problems.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(ﬁrst order derivative)?

Introduction
Introduction
problems.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function is
infinitely differentiable on the domain.

Introduction
Binary
Multinoulli
Outline
1 Introduction
Binary
Multinoulli

Introduction
Binary
Multinoulli
Binary
index x1 · · · xn target
1 0 · · · 1 Class A
2 1 · · · 0 Class B
3 1 · · · 1 Class A
· · · · · · · · · · · · · · ·
m 0 · · · 0 Class B

Introduction
Binary
Multinoulli
Binary
where
S is the sigmoid function,
z is the input of output layer
z = w h + b (1)
with w is weight, h is output of hidden layer and b is bias.

Introduction
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Mean Square Error
Cross Entropy

Introduction
Binary
Multinoulli
Cost Function
Mean Square Error
Let y(i)
denotes the data label, and ˆy(i)
= S(z(i)
) as the
prediction. We may deﬁne the cost function Cmse by
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
(2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.

Introduction
Binary
Multinoulli
Cost Function
Cross Entropy
Adapting the symbols above, the cost function deﬁned by
Cross Entropy is
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
) (2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.

Introduction
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Problem: Which one is better?
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)

Introduction
Binary
Multinoulli
Analyticity:
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
)
Computationally, the value of ˆy(i)
= S(z(i)
) could overflow to
1 or underflow to 0 when z(i)
is very positive or very negative.
Therefore, given a fixed y(i)
∈ {0, 1},
Cce is undefined at ˆy(i)
is 0 or 1.
Cmse is polynomial and thus analytic every where.

Introduction
Binary
Multinoulli
Learning Ability: compare the gradients
∂Cmse
∂w
= [S(z) − y] [1 − S(z)] S(z)h, (3)
∂Cce
∂w
= [y − S(z)] h (4)
respectively, where S is sigmoid, z = w h + b.

Introduction
Binary
Multinoulli
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
If y = 1 and ˆy → 1,
steps → 0
steps → 0
steps → 0
steps → 0
steps → 0
steps → 1
steps → −1
steps → 0

Introduction
Binary
Multinoulli
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
steps → 0
steps → 0
steps → 0
steps → 0
steps → 0
steps → 1
steps → −1
steps → 0
In the ceas of Mean Square Error, the progress get stuck when
z is very positive or very negative.

Introduction
Binary
Multinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of cross
entropy.

Introduction
Binary
Multinoulli
We have mentioned about the unstable issue of cross
entropy.
Precisely,
ˆy = S(z) underflow to 0 when z is very negative,
ˆy = S(z) overflow to 1 when z is very positive.
Therefore, given a fixed y ∈ {0, 1}, then the function
C = y ln ˆy + (1 − y) ln(1 − ˆy)
could be undefined when z is very positive or very
negative.

Introduction
Binary
Multinoulli
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.

Introduction
Binary
Multinoulli
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
We may obtain the analyticity of C by showing the dC
dz
is
multiple of analytic functions.

Introduction
Binary
Multinoulli
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
In the cases of right answer
y = 1 and ˆy = S(z) → 1 ⇒ z → ∞, C → 0,
y = 0 and ˆy = S(z) → 0 ⇒ z → −∞, C → 0.
In the cases of wrong answer
y = 1 and ˆy = S(z) → 0 ⇒ z → −∞, C → −1,
y = 0 and ˆy = S(z) → 1 ⇒ z → ∞, C → −1.

Introduction
Binary
Multinoulli
Outline
1 Introduction
Binary
Multinoulli

Introduction
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.

Introduction
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.
Cost Function in Multinoulli Problems
Suppose the size of dataset is m and there are K classes, then
we can obtain the cost function from cross entropy
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} ln
exp(z
(i)
k )
K
j=1 exp(z
(i)
j )
(7)
where z
(i)
k = wk h(i)
+ bk and h(i)
is the output of hidden layer
corresponding to example data xi.

Introduction
Binary
Multinoulli
A Lemma for Cost Function Simplify

Introduction
Binary
Multinoulli
To claim above properties, We should show a lemma at very
ﬁrst,
Lemma 1
For the output z = w h + b and z = [z1, . . . , zK], we have
sup
z
ln
K
j=1
exp(zj) = max
j
{zj}. (8)

Introduction
Binary
Multinoulli
Proof.
Without loss of generality, we may assume z1 > . . . > zK,
then the remaining work is to show, for all > 0.
ln ez1
1 +
K
j=2
ezj−z1
= z1 + ln 1 +
K
j=2
ezj−z1
≤ z1 +
Intuitively, the ln
K
j=1
exp (zj) can be well approximated
by max
j
{zj}.

Introduction
Binary
Multinoulli
Analyticity
We may rewrite the cost function as
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} z
(i)
k − ln
K
j=1
exp(z
(i)
j ) .
For each summand, it is substraction of analytic function and
thus analytic, and the term 1{y(i)
= k} is acturally a constant.
The total cost is summation of analytic functions and thus
analytic.

Introduction
Binary
Multinoulli
Learning Ability
Property 2
By the rule of sum in derivatives, we may simplify the (7) as
following
C(i)
=
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj) , (8)
this cost is contributed by the example xi in the total cost C.
1 Assume the model gives the right answer, then the
errors would close to 0.
2 Assume the model gives the wrong answer, then the
learning can prograss well.

Introduction
Binary
Multinoulli
Learning Ability
Proof (The Right Answer).
Suppose the true label is class n. By the assumption, we
know zn is the maxmal. Then
− ≤
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj)
= zn − ln
K
j=1
exp(zj)
< zn − max
j
{zj} = 0.
This shows that − ≤ C(i)
< 0 for an arbitrary small .

Introduction
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer).
Suppose the true label is class n. By assumption, the
prediction zn given by model is not the maxmal. On the other
hand, using the fact
zn = max
j
{zj} ⇒ softmax(zn) 1.
This implies that there exist a suﬃcient large δ > 0 such that
| softmax(zn) − 1 |> δ.

Introduction
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer, Conti.)
Then
∂C(i)
∂zn
=
∂
∂zn
zn − ln
K
j=1
ezj
= 1 − softmax(zn)
> δ
This shows the gradient is suﬃcently large and also
predictable(bounded by 1), therefore the learning can progress
well.

Introduction
Outline
1 Introduction
Binary
Multinoulli

Introduction
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution

Introduction
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
To describe some complicate data, it’s easier to build model
with generic method.

Introduction
Generic Modeling for Binary Classiﬁcation
Step1: Using Bernoulli distribution as likelihood function.
p(y | x) = py
(1 − p)1−y
= S(z)y
(1 − S(z))1−y
Step2: Minimizing negative log-likelihood
ln p(y | x(i)
) = y ln S(z) + (1 − y) ln(1 − S(z))
Step3: We an learn the full distribution.
p(y | x ) = S(z )y
(1 − S(z ))1−y
,
where we denote z = w x + b and S is sigmoid.

Introduction
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution as
likelihood function

Introduction
Given a training feature x, using Gaussian distribution as
likelihood function
p(y | x) =
1
√
2σ2π
exp
−(µ − y)2
2σ2
,
where we denote the output of hidden layer as hx, weight
w = [w1, w2] and bias b = [b1, b2], then
µ = w1 hx + b1
σ = w2 hx + b2
Intuitively, µ and σ are two linear output units, they are
functions of x.

Introduction
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)

Introduction
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)
However, for each summand,
Cx = ln p(y | x) =
−1
2
ln(2πσ2
) +
(µ − y)2
σ2
∂Cx
∂σ
= (πσ)−1
− 2σ−3
(µ − y)
the gradients and errors become unstable when σ close 0.

Introduction
To prevent the gradients and errors from being unstable, we
may substitute the term 1
2σ2 with v, then for each summand in
the negative log-likelihood
Cx = ln π − ln v − (µ − y)2
v,
∂Cx
∂µ
= −2v(µ − y),
∂Cx
∂v
=
1
v
− (µ − y)2
.
Note that, this substitution valid only when the variance isn’t
too large.

Introduction
If the variance σ is ﬁxed and chosen by user, then by
comparing the negative log-likelihood and MSE, we can see
that minimizing NLL is equivalent to minimizing MSE.
Cmse =
1
m
m
i=1
ˆy(i)
− y(y) 2
Cnll =
m
i=1
Cx(i)
=
−1
2
m ln(2πσ2
) +
m
i=1
µx(i) − y(i) 2
σ2

Introduction
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.

Introduction
Experiment(ref): generate random data base on the formula
y = x + 7.0 sin(0.75x) +
where is the gaussian noise with µ = 0, σ = 1

Introduction
FNN conﬁg:
#(hidden layey) = 1, width = 20 and hidden unit is tanh.
Gerneric Deterministic

Introduction
More Complicated Cases
Complicated data distributions.
In some cases, it’s almost impossible to describe data via
deterministic methods.
Generic methods might perform better in complicated
case.

Introduction
Mixture Density Network
Generate random data based on the formula
x = y + 7.0 sin(0.75y) +
where is the gaussian noise with µ = 0, σ = 1

Introduction
Firstly, just try to using MSE to deﬁne cost function and one
hidden layer with width = 20, hidden unit is tanh.

Introduction
Firstly, just try to using MSE to deﬁne cost function and one
hidden layer with width = 20, hidden unit is tanh.
The reason is, minimizing MSE is
equivalant to minimizing nagetive log-likelihood for simple
Gaussian.

Introduction
The mixture density network. The Gaussian mixture with n
components is defined by the conditional probability
distribution
p(y | x) =
n
i=1
p(c = i|x)ℵ(y; µ(i)
(x); Σ(i)
(x)). (9)
Network configuration,
1 Number of components n, need to be fine tuned(try and
error).
2 3 × n output units.

Introduction
Experiment(ref):
#(components) = 24,
two hidden layers with width = 24 and activation is tanh,
#(output units) = 3 × 24 and they are linear.

Introduction
Outline
1 Introduction
Binary
Multinoulli

Introduction
In classiﬁcation problems, cross entropy is naturally
good to evaluate errors than other methods.

Introduction
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorﬂow.

Introduction
unstable.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?

Introduction
unstable.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learn
full distribution.
– When data distribution is not normal(high kurtosis or fat
tail), generic might be better.
– Generic methods is easier to apply to complicated cases.

Introduction
Thank you.

Output Units and Cost Function in FNN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Output Units and Cost Function in FNN

Similar to Output Units and Cost Function in FNN (20)

Recently uploaded

Recently uploaded (20)

Output Units and Cost Function in FNN