Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Deep Neural Network
Cost Functions and Output Units
Jiaming Lin
jmlin@arbor.ee.ntu.edu.tw
DATALab@III
NetDBLab@NTU
January 9, 2017
1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
2 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(first order derivative)?
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function is
infinitely differentiable on the domain.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
4 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
5 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Binary
index x1 · · · xn target
1 0 · · · 1 Class A
2 1 · · · 0 Class B
3 1 · · · 1 Class A
· · · · · · · · · · · · · · ·
m 0 · · · 0 Class B
6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Binary
where
S is the sigmoid function,
z is the input of output layer
z = w h + b (1)
with w is weight, h is output of hidden layer and b is bias.
6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Mean Square Error
Cross Entropy
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Mean Square Error
Let y(i)
denotes the data label, and ˆy(i)
= S(z(i)
) as the
prediction. We may define the cost function Cmse by
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
(2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Cross Entropy
Adapting the symbols above, the cost function defined by
Cross Entropy is
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
) (2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Problem: Which one is better?
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Analyticity:
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
)
Computationally, the value of ˆy(i)
= S(z(i)
) could overflow to
1 or underflow to 0 when z(i)
is very positive or very negative.
Therefore, given a fixed y(i)
∈ {0, 1},
Cce is undefined at ˆy(i)
is 0 or 1.
Cmse is polynomial and thus analytic every where.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Learning Ability: compare the gradients
∂Cmse
∂w
= [S(z) − y] [1 − S(z)] S(z)h, (3)
∂Cce
∂w
= [y − S(z)] h (4)
respectively, where S is sigmoid, z = w h + b.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 0
If y = 0 and ˆy → 1,
steps → 0
If y = 0 and ˆy → 0,
steps → 0
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 1
If y = 0 and ˆy → 1,
steps → −1
If y = 0 and ˆy → 0,
steps → 0
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 0
If y = 0 and ˆy → 1,
steps → 0
If y = 0 and ˆy → 0,
steps → 0
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 1
If y = 0 and ˆy → 1,
steps → −1
If y = 0 and ˆy → 0,
steps → 0
In the ceas of Mean Square Error, the progress get stuck when
z is very positive or very negative.
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of cross
entropy.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of cross
entropy.
Precisely,
ˆy = S(z) underflow to 0 when z is very negative,
ˆy = S(z) overflow to 1 when z is very positive.
Therefore, given a fixed y ∈ {0, 1}, then the function
C = y ln ˆy + (1 − y) ln(1 − ˆy)
could be undefined when z is very positive or very
negative.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
We may obtain the analyticity of C by showing the dC
dz
is
multiple of analytic functions.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
In the cases of right answer
y = 1 and ˆy = S(z) → 1 ⇒ z → ∞, C → 0,
y = 0 and ˆy = S(z) → 0 ⇒ z → −∞, C → 0.
In the cases of wrong answer
y = 1 and ˆy = S(z) → 0 ⇒ z → −∞, C → −1,
y = 0 and ˆy = S(z) → 1 ⇒ z → ∞, C → −1.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
12 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.
Cost Function in Multinoulli Problems
Suppose the size of dataset is m and there are K classes, then
we can obtain the cost function from cross entropy
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} ln
exp(z
(i)
k )
K
j=1 exp(z
(i)
j )
(7)
where z
(i)
k = wk h(i)
+ bk and h(i)
is the output of hidden layer
corresponding to example data xi.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
To claim above properties, We should show a lemma at very
first,
Lemma 1
For the output z = w h + b and z = [z1, . . . , zK], we have
sup
z
ln
K
j=1
exp(zj) = max
j
{zj}. (8)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Proof.
Without loss of generality, we may assume z1 > . . . > zK,
then the remaining work is to show, for all > 0.
ln ez1
1 +
K
j=2
ezj−z1
= z1 + ln 1 +
K
j=2
ezj−z1
≤ z1 +
Intuitively, the ln
K
j=1
exp (zj) can be well approximated
by max
j
{zj}.
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Analyticity
We may rewrite the cost function as
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} z
(i)
k − ln
K
j=1
exp(z
(i)
j ) .
For each summand, it is substraction of analytic function and
thus analytic, and the term 1{y(i)
= k} is acturally a constant.
The total cost is summation of analytic functions and thus
analytic.
15 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Property 2
By the rule of sum in derivatives, we may simplify the (7) as
following
C(i)
=
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj) , (8)
this cost is contributed by the example xi in the total cost C.
1 Assume the model gives the right answer, then the
errors would close to 0.
2 Assume the model gives the wrong answer, then the
learning can prograss well.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Right Answer).
Suppose the true label is class n. By the assumption, we
know zn is the maxmal. Then
− ≤
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj)
= zn − ln
K
j=1
exp(zj)
< zn − max
j
{zj} = 0.
This shows that − ≤ C(i)
< 0 for an arbitrary small .
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer).
Suppose the true label is class n. By assumption, the
prediction zn given by model is not the maxmal. On the other
hand, using the fact
zn = max
j
{zj} ⇒ softmax(zn) 1.
This implies that there exist a sufficient large δ > 0 such that
| softmax(zn) − 1 |> δ.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer, Conti.)
Then
∂C(i)
∂zn
=
∂
∂zn
zn − ln
K
j=1
ezj
= 1 − softmax(zn)
> δ
This shows the gradient is sufficently large and also
predictable(bounded by 1), therefore the learning can progress
well.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
17 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
To describe some complicate data, it’s easier to build model
with generic method.
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Binary Classification
Step1: Using Bernoulli distribution as likelihood function.
p(y | x) = py
(1 − p)1−y
= S(z)y
(1 − S(z))1−y
Step2: Minimizing negative log-likelihood
ln p(y | x(i)
) = y ln S(z) + (1 − y) ln(1 − S(z))
Step3: We an learn the full distribution.
p(y | x ) = S(z )y
(1 − S(z ))1−y
,
where we denote z = w x + b and S is sigmoid.
19 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution as
likelihood function
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution as
likelihood function
p(y | x) =
1
√
2σ2π
exp
−(µ − y)2
2σ2
,
where we denote the output of hidden layer as hx, weight
w = [w1, w2] and bias b = [b1, b2], then
µ = w1 hx + b1
σ = w2 hx + b2
Intuitively, µ and σ are two linear output units, they are
functions of x.
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)
21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)
However, for each summand,
Cx = ln p(y | x) =
−1
2
ln(2πσ2
) +
(µ − y)2
σ2
∂Cx
∂σ
= (πσ)−1
− 2σ−3
(µ − y)
the gradients and errors become unstable when σ close 0.
21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
To prevent the gradients and errors from being unstable, we
may substitute the term 1
2σ2 with v, then for each summand in
the negative log-likelihood
Cx = ln π − ln v − (µ − y)2
v,
∂Cx
∂µ
= −2v(µ − y),
∂Cx
∂v
=
1
v
− (µ − y)2
.
Note that, this substitution valid only when the variance isn’t
too large.
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
If the variance σ is fixed and chosen by user, then by
comparing the negative log-likelihood and MSE, we can see
that minimizing NLL is equivalent to minimizing MSE.
Cmse =
1
m
m
i=1
ˆy(i)
− y(y) 2
Cnll =
m
i=1
Cx(i)
=
−1
2
m ln(2πσ2
) +
m
i=1
µx(i) − y(i) 2
σ2
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
Experiment(ref): generate random data base on the formula
y = x + 7.0 sin(0.75x) +
where is the gaussian noise with µ = 0, σ = 1
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
FNN config:
#(hidden layey) = 1, width = 20 and hidden unit is tanh.
Gerneric Deterministic
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
More Complicated Cases
Complicated data distributions.
In some cases, it’s almost impossible to describe data via
deterministic methods.
Generic methods might perform better in complicated
case.
24 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Generate random data based on the formula
x = y + 7.0 sin(0.75y) +
where is the gaussian noise with µ = 0, σ = 1
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and one
hidden layer with width = 20, hidden unit is tanh.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and one
hidden layer with width = 20, hidden unit is tanh.
The reason is, minimizing MSE is
equivalant to minimizing nagetive log-likelihood for simple
Gaussian.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
The mixture density network. The Gaussian mixture with n
components is defined by the conditional probability
distribution
p(y | x) =
n
i=1
p(c = i|x)ℵ(y; µ(i)
(x); Σ(i)
(x)). (9)
Network configuration,
1 Number of components n, need to be fine tuned(try and
error).
2 3 × n output units.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Experiment(ref):
#(components) = 24,
two hidden layers with width = 24 and activation is tanh,
#(output units) = 3 × 24 and they are linear.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
26 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learn
full distribution.
– When data distribution is not normal(high kurtosis or fat
tail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Thank you.
28 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

Output Units and Cost Function in FNN

  • 1.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Deep Neural Network Cost Functions and Output Units Jiaming Lin jmlin@arbor.ee.ntu.edu.tw DATALab@III NetDBLab@NTU January 9, 2017 1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 2.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 2 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 3.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 4.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. Determine and analyse the cost function. – Is the cost function †analytic? – Can the learning progress well(first order derivative)? 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 5.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. Determine and analyse the cost function. – Is the cost function †analytic? – Can the learning progress well(first order derivative)? Deterministic and Generic Model. – Data is more complicated in many cases. Note: †For simplicity, we mean analytic to say a function is infinitely differentiable on the domain. 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 6.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 4 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 7.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 5 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 8.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Binary index x1 · · · xn target 1 0 · · · 1 Class A 2 1 · · · 0 Class B 3 1 · · · 1 Class A · · · · · · · · · · · · · · · m 0 · · · 0 Class B 6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 9.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Binary where S is the sigmoid function, z is the input of output layer z = w h + b (1) with w is weight, h is output of hidden layer and b is bias. 6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 10.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Mean Square Error Cross Entropy 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 11.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Mean Square Error Let y(i) denotes the data label, and ˆy(i) = S(z(i) ) as the prediction. We may define the cost function Cmse by Cmse = 1 m m i=1 (ˆy(i) − y(i) )2 (2) where m is the data size, and z(i) , ˆy(i) and y(i) are real numbers. 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 12.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Cross Entropy Adapting the symbols above, the cost function defined by Cross Entropy is Cce = 1 m m i=1 y(i) ln(ˆy(i) ) + (1 − y(i) ) ln(1 − ˆy(i) ) (2) where m is the data size, and z(i) , ˆy(i) and y(i) are real numbers. 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 13.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Problem: Which one is better? Analyticity(infinitely differentiable) Learning ability(first order derivatives) 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 14.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Analyticity: Cmse = 1 m m i=1 (ˆy(i) − y(i) )2 Cce = 1 m m i=1 y(i) ln(ˆy(i) ) + (1 − y(i) ) ln(1 − ˆy(i) ) Computationally, the value of ˆy(i) = S(z(i) ) could overflow to 1 or underflow to 0 when z(i) is very positive or very negative. Therefore, given a fixed y(i) ∈ {0, 1}, Cce is undefined at ˆy(i) is 0 or 1. Cmse is polynomial and thus analytic every where. 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 15.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Learning Ability: compare the gradients ∂Cmse ∂w = [S(z) − y] [1 − S(z)] S(z)h, (3) ∂Cce ∂w = [y − S(z)] h (4) respectively, where S is sigmoid, z = w h + b. 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 16.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy MSE Cross Entropy [S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 0 If y = 0 and ˆy → 1, steps → 0 If y = 0 and ˆy → 0, steps → 0 If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 1 If y = 0 and ˆy → 1, steps → −1 If y = 0 and ˆy → 0, steps → 0 9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 17.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy MSE Cross Entropy [S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 0 If y = 0 and ˆy → 1, steps → 0 If y = 0 and ˆy → 0, steps → 0 If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 1 If y = 0 and ˆy → 1, steps → −1 If y = 0 and ˆy → 0, steps → 0 In the ceas of Mean Square Error, the progress get stuck when z is very positive or very negative. 9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 18.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy We have mentioned about the unstable issue of cross entropy. 10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 19.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy We have mentioned about the unstable issue of cross entropy. Precisely, ˆy = S(z) underflow to 0 when z is very negative, ˆy = S(z) overflow to 1 when z is very positive. Therefore, given a fixed y ∈ {0, 1}, then the function C = y ln ˆy + (1 − y) ln(1 − ˆy) could be undefined when z is very positive or very negative. 10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 20.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 21.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. We may obtain the analyticity of C by showing the dC dz is multiple of analytic functions. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 22.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. In the cases of right answer y = 1 and ˆy = S(z) → 1 ⇒ z → ∞, C → 0, y = 0 and ˆy = S(z) → 0 ⇒ z → −∞, C → 0. In the cases of wrong answer y = 1 and ˆy = S(z) → 0 ⇒ z → −∞, C → −1, y = 0 and ˆy = S(z) → 1 ⇒ z → ∞, C → −1. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 23.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 12 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 24.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Multinoulli: Output Unit and Cost Function Generalize the binary case to multiple classes. Linear output units and #(output units) = #(classes). Cost function evaluated by cross entropy. 13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 25.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Multinoulli: Output Unit and Cost Function Generalize the binary case to multiple classes. Linear output units and #(output units) = #(classes). Cost function evaluated by cross entropy. Cost Function in Multinoulli Problems Suppose the size of dataset is m and there are K classes, then we can obtain the cost function from cross entropy C(w) = − m i=1 K k=1 1{y(i) = k} ln exp(z (i) k ) K j=1 exp(z (i) j ) (7) where z (i) k = wk h(i) + bk and h(i) is the output of hidden layer corresponding to example data xi. 13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 26.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Analyticity(infinitely differentiable) Learning ability(first order derivatives) 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 27.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Analyticity(infinitely differentiable) Learning ability(first order derivatives) To claim above properties, We should show a lemma at very first, Lemma 1 For the output z = w h + b and z = [z1, . . . , zK], we have sup z ln K j=1 exp(zj) = max j {zj}. (8) 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 28.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Proof. Without loss of generality, we may assume z1 > . . . > zK, then the remaining work is to show, for all > 0. ln ez1 1 + K j=2 ezj−z1 = z1 + ln 1 + K j=2 ezj−z1 ≤ z1 + Intuitively, the ln K j=1 exp (zj) can be well approximated by max j {zj}. 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 29.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Analyticity We may rewrite the cost function as C(w) = − m i=1 K k=1 1{y(i) = k} z (i) k − ln K j=1 exp(z (i) j ) . For each summand, it is substraction of analytic function and thus analytic, and the term 1{y(i) = k} is acturally a constant. The total cost is summation of analytic functions and thus analytic. 15 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 30.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Property 2 By the rule of sum in derivatives, we may simplify the (7) as following C(i) = K k=1 1{y = k} zk − ln K j=1 exp(zj) , (8) this cost is contributed by the example xi in the total cost C. 1 Assume the model gives the right answer, then the errors would close to 0. 2 Assume the model gives the wrong answer, then the learning can prograss well. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 31.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Right Answer). Suppose the true label is class n. By the assumption, we know zn is the maxmal. Then − ≤ K k=1 1{y = k} zk − ln K j=1 exp(zj) = zn − ln K j=1 exp(zj) < zn − max j {zj} = 0. This shows that − ≤ C(i) < 0 for an arbitrary small . 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 32.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Wrong Answer). Suppose the true label is class n. By assumption, the prediction zn given by model is not the maxmal. On the other hand, using the fact zn = max j {zj} ⇒ softmax(zn) 1. This implies that there exist a sufficient large δ > 0 such that | softmax(zn) − 1 |> δ. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 33.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Wrong Answer, Conti.) Then ∂C(i) ∂zn = ∂ ∂zn zn − ln K j=1 ezj = 1 − softmax(zn) > δ This shows the gradient is sufficently large and also predictable(bounded by 1), therefore the learning can progress well. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 34.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 17 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 35.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Learning Processes Overview Deterministic Generic Step1 Model function Linear Sigmoid Probability distribution Gaussian Bernoulli Step2 Design errors evals MSE Cross Entropy Maximum Likelihood Es- timate Step3 Learning one statistic Mean Median Learning full distribution 18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 36.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Learning Processes Overview Deterministic Generic Step1 Model function Linear Sigmoid Probability distribution Gaussian Bernoulli Step2 Design errors evals MSE Cross Entropy Maximum Likelihood Es- timate Step3 Learning one statistic Mean Median Learning full distribution To describe some complicate data, it’s easier to build model with generic method. 18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 37.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Binary Classification Step1: Using Bernoulli distribution as likelihood function. p(y | x) = py (1 − p)1−y = S(z)y (1 − S(z))1−y Step2: Minimizing negative log-likelihood ln p(y | x(i) ) = y ln S(z) + (1 − y) ln(1 − S(z)) Step3: We an learn the full distribution. p(y | x ) = S(z )y (1 − S(z ))1−y , where we denote z = w x + b and S is sigmoid. 19 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 38.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step1 Given a training feature x, using Gaussian distribution as likelihood function 20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 39.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step1 Given a training feature x, using Gaussian distribution as likelihood function p(y | x) = 1 √ 2σ2π exp −(µ − y)2 2σ2 , where we denote the output of hidden layer as hx, weight w = [w1, w2] and bias b = [b1, b2], then µ = w1 hx + b1 σ = w2 hx + b2 Intuitively, µ and σ are two linear output units, they are functions of x. 20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 40.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 Recall that the maximum likelihood estimate is equivalent to minimize the negative log-likelihood, that is (ˆµ, ˆσ) = arg min (µ,σ) − x ln p(y | x) (8) 21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 41.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 Recall that the maximum likelihood estimate is equivalent to minimize the negative log-likelihood, that is (ˆµ, ˆσ) = arg min (µ,σ) − x ln p(y | x) (8) However, for each summand, Cx = ln p(y | x) = −1 2 ln(2πσ2 ) + (µ − y)2 σ2 ∂Cx ∂σ = (πσ)−1 − 2σ−3 (µ − y) the gradients and errors become unstable when σ close 0. 21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 42.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 To prevent the gradients and errors from being unstable, we may substitute the term 1 2σ2 with v, then for each summand in the negative log-likelihood Cx = ln π − ln v − (µ − y)2 v, ∂Cx ∂µ = −2v(µ − y), ∂Cx ∂v = 1 v − (µ − y)2 . Note that, this substitution valid only when the variance isn’t too large. 22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 43.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 If the variance σ is fixed and chosen by user, then by comparing the negative log-likelihood and MSE, we can see that minimizing NLL is equivalent to minimizing MSE. Cmse = 1 m m i=1 ˆy(i) − y(y) 2 Cnll = m i=1 Cx(i) = −1 2 m ln(2πσ2 ) + m i=1 µx(i) − y(i) 2 σ2 22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 44.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 45.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. Experiment(ref): generate random data base on the formula y = x + 7.0 sin(0.75x) + where is the gaussian noise with µ = 0, σ = 1 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 46.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. FNN config: #(hidden layey) = 1, width = 20 and hidden unit is tanh. Gerneric Deterministic 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 47.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions More Complicated Cases Complicated data distributions. In some cases, it’s almost impossible to describe data via deterministic methods. Generic methods might perform better in complicated case. 24 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 48.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Generate random data based on the formula x = y + 7.0 sin(0.75y) + where is the gaussian noise with µ = 0, σ = 1 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 49.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Firstly, just try to using MSE to define cost function and one hidden layer with width = 20, hidden unit is tanh. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 50.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Firstly, just try to using MSE to define cost function and one hidden layer with width = 20, hidden unit is tanh. The reason is, minimizing MSE is equivalant to minimizing nagetive log-likelihood for simple Gaussian. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 51.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network The mixture density network. The Gaussian mixture with n components is defined by the conditional probability distribution p(y | x) = n i=1 p(c = i|x)ℵ(y; µ(i) (x); Σ(i) (x)). (9) Network configuration, 1 Number of components n, need to be fine tuned(try and error). 2 3 × n output units. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 52.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Experiment(ref): #(components) = 24, two hidden layers with width = 24 and activation is tanh, #(output units) = 3 × 24 and they are linear. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 53.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 26 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 54.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 55.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 56.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. Determine the cost function is good or not. – Is the cost function analytic? – Can the learning progress well? 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 57.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. Determine the cost function is good or not. – Is the cost function analytic? – Can the learning progress well? Deterministic v.s. Generic – Deterministic learns single statistic while generic learn full distribution. – When data distribution is not normal(high kurtosis or fat tail), generic might be better. – Generic methods is easier to apply to complicated cases. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 58.
    Introduction Output Units andCost Functions Deterministic and Generic Model Concludsions and Discussions Thank you. 28 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network