SlideShare a Scribd company logo
1 of 58
Download to read offline
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Deep Neural Network
Cost Functions and Output Units
Jiaming Lin
jmlin@arbor.ee.ntu.edu.tw
DATALab@III
NetDBLab@NTU
January 9, 2017
1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
2 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(first order derivative)?
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learning
problems.
– Classification: sigmoid, softmax or linear.
– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?
– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function is
infinitely differentiable on the domain.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
4 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
5 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Binary
index x1 · · · xn target
1 0 · · · 1 Class A
2 1 · · · 0 Class B
3 1 · · · 1 Class A
· · · · · · · · · · · · · · ·
m 0 · · · 0 Class B
6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Binary
where
S is the sigmoid function,
z is the input of output layer
z = w h + b (1)
with w is weight, h is output of hidden layer and b is bias.
6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Mean Square Error
Cross Entropy
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Mean Square Error
Let y(i)
denotes the data label, and ˆy(i)
= S(z(i)
) as the
prediction. We may define the cost function Cmse by
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
(2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Cost Function
Cost function can be derived from many methods, we discuss
two of the most common
Cross Entropy
Adapting the symbols above, the cost function defined by
Cross Entropy is
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
) (2)
where m is the data size, and z(i)
, ˆy(i)
and y(i)
are real
numbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Problem: Which one is better?
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Analyticity:
Cmse =
1
m
m
i=1
(ˆy(i)
− y(i)
)2
Cce =
1
m
m
i=1
y(i)
ln(ˆy(i)
) + (1 − y(i)
) ln(1 − ˆy(i)
)
Computationally, the value of ˆy(i)
= S(z(i)
) could overflow to
1 or underflow to 0 when z(i)
is very positive or very negative.
Therefore, given a fixed y(i)
∈ {0, 1},
Cce is undefined at ˆy(i)
is 0 or 1.
Cmse is polynomial and thus analytic every where.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
Learning Ability: compare the gradients
∂Cmse
∂w
= [S(z) − y] [1 − S(z)] S(z)h, (3)
∂Cce
∂w
= [y − S(z)] h (4)
respectively, where S is sigmoid, z = w h + b.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 0
If y = 0 and ˆy → 1,
steps → 0
If y = 0 and ˆy → 0,
steps → 0
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 1
If y = 0 and ˆy → 1,
steps → −1
If y = 0 and ˆy → 0,
steps → 0
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy
[S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 0
If y = 0 and ˆy → 1,
steps → 0
If y = 0 and ˆy → 0,
steps → 0
If y = 1 and ˆy → 1,
steps → 0
If y = 1 and ˆy → 0,
steps → 1
If y = 0 and ˆy → 1,
steps → −1
If y = 0 and ˆy → 0,
steps → 0
In the ceas of Mean Square Error, the progress get stuck when
z is very positive or very negative.
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of cross
entropy.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of cross
entropy.
Precisely,
ˆy = S(z) underflow to 0 when z is very negative,
ˆy = S(z) overflow to 1 when z is very positive.
Therefore, given a fixed y ∈ {0, 1}, then the function
C = y ln ˆy + (1 − y) ln(1 − ˆy)
could be undefined when z is very positive or very
negative.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
We may obtain the analyticity of C by showing the dC
dz
is
multiple of analytic functions.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y ln S(z) + (1 − y) ln(1 − S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
In the cases of right answer
y = 1 and ˆy = S(z) → 1 ⇒ z → ∞, C → 0,
y = 0 and ˆy = S(z) → 0 ⇒ z → −∞, C → 0.
In the cases of wrong answer
y = 1 and ˆy = S(z) → 0 ⇒ z → −∞, C → −1,
y = 0 and ˆy = S(z) → 1 ⇒ z → ∞, C → −1.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
12 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.
Linear output units and #(output units) = #(classes).
Cost function evaluated by cross entropy.
Cost Function in Multinoulli Problems
Suppose the size of dataset is m and there are K classes, then
we can obtain the cost function from cross entropy
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} ln
exp(z
(i)
k )
K
j=1 exp(z
(i)
j )
(7)
where z
(i)
k = wk h(i)
+ bk and h(i)
is the output of hidden layer
corresponding to example data xi.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
To claim above properties, We should show a lemma at very
first,
Lemma 1
For the output z = w h + b and z = [z1, . . . , zK], we have
sup
z
ln
K
j=1
exp(zj) = max
j
{zj}. (8)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
A Lemma for Cost Function Simplify
Proof.
Without loss of generality, we may assume z1 > . . . > zK,
then the remaining work is to show, for all > 0.
ln ez1
1 +
K
j=2
ezj−z1
= z1 + ln 1 +
K
j=2
ezj−z1
≤ z1 +
Intuitively, the ln
K
j=1
exp (zj) can be well approximated
by max
j
{zj}.
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Analyticity
We may rewrite the cost function as
C(w) = −
m
i=1
K
k=1
1{y(i)
= k} z
(i)
k − ln
K
j=1
exp(z
(i)
j ) .
For each summand, it is substraction of analytic function and
thus analytic, and the term 1{y(i)
= k} is acturally a constant.
The total cost is summation of analytic functions and thus
analytic.
15 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Property 2
By the rule of sum in derivatives, we may simplify the (7) as
following
C(i)
=
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj) , (8)
this cost is contributed by the example xi in the total cost C.
1 Assume the model gives the right answer, then the
errors would close to 0.
2 Assume the model gives the wrong answer, then the
learning can prograss well.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Right Answer).
Suppose the true label is class n. By the assumption, we
know zn is the maxmal. Then
− ≤
K
k=1
1{y = k} zk − ln
K
j=1
exp(zj)
= zn − ln
K
j=1
exp(zj)
< zn − max
j
{zj} = 0.
This shows that − ≤ C(i)
< 0 for an arbitrary small .
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer).
Suppose the true label is class n. By assumption, the
prediction zn given by model is not the maxmal. On the other
hand, using the fact
zn = max
j
{zj} ⇒ softmax(zn) 1.
This implies that there exist a sufficient large δ > 0 such that
| softmax(zn) − 1 |> δ.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Binary
Multinoulli
Learning Ability
Proof (The Wrong Answer, Conti.)
Then
∂C(i)
∂zn
=
∂
∂zn
zn − ln
K
j=1
ezj
= 1 − softmax(zn)
> δ
This shows the gradient is sufficently large and also
predictable(bounded by 1), therefore the learning can progress
well.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
17 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic Generic
Step1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
Bernoulli
Step2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-
timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
To describe some complicate data, it’s easier to build model
with generic method.
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Binary Classification
Step1: Using Bernoulli distribution as likelihood function.
p(y | x) = py
(1 − p)1−y
= S(z)y
(1 − S(z))1−y
Step2: Minimizing negative log-likelihood
ln p(y | x(i)
) = y ln S(z) + (1 − y) ln(1 − S(z))
Step3: We an learn the full distribution.
p(y | x ) = S(z )y
(1 − S(z ))1−y
,
where we denote z = w x + b and S is sigmoid.
19 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution as
likelihood function
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution as
likelihood function
p(y | x) =
1
√
2σ2π
exp
−(µ − y)2
2σ2
,
where we denote the output of hidden layer as hx, weight
w = [w1, w2] and bias b = [b1, b2], then
µ = w1 hx + b1
σ = w2 hx + b2
Intuitively, µ and σ are two linear output units, they are
functions of x.
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)
21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent to
minimize the negative log-likelihood, that is
(ˆµ, ˆσ) = arg min
(µ,σ)
−
x
ln p(y | x) (8)
However, for each summand,
Cx = ln p(y | x) =
−1
2
ln(2πσ2
) +
(µ − y)2
σ2
∂Cx
∂σ
= (πσ)−1
− 2σ−3
(µ − y)
the gradients and errors become unstable when σ close 0.
21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
To prevent the gradients and errors from being unstable, we
may substitute the term 1
2σ2 with v, then for each summand in
the negative log-likelihood
Cx = ln π − ln v − (µ − y)2
v,
∂Cx
∂µ
= −2v(µ − y),
∂Cx
∂v
=
1
v
− (µ − y)2
.
Note that, this substitution valid only when the variance isn’t
too large.
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
If the variance σ is fixed and chosen by user, then by
comparing the negative log-likelihood and MSE, we can see
that minimizing NLL is equivalent to minimizing MSE.
Cmse =
1
m
m
i=1
ˆy(i)
− y(y) 2
Cnll =
m
i=1
Cx(i)
=
−1
2
m ln(2πσ2
) +
m
i=1
µx(i) − y(i) 2
σ2
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
Experiment(ref): generate random data base on the formula
y = x + 7.0 sin(0.75x) +
where is the gaussian noise with µ = 0, σ = 1
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
FNN config:
#(hidden layey) = 1, width = 20 and hidden unit is tanh.
Gerneric Deterministic
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
More Complicated Cases
Complicated data distributions.
In some cases, it’s almost impossible to describe data via
deterministic methods.
Generic methods might perform better in complicated
case.
24 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Generate random data based on the formula
x = y + 7.0 sin(0.75y) +
where is the gaussian noise with µ = 0, σ = 1
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and one
hidden layer with width = 20, hidden unit is tanh.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and one
hidden layer with width = 20, hidden unit is tanh.
The reason is, minimizing MSE is
equivalant to minimizing nagetive log-likelihood for simple
Gaussian.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
The mixture density network. The Gaussian mixture with n
components is defined by the conditional probability
distribution
p(y | x) =
n
i=1
p(c = i|x)ℵ(y; µ(i)
(x); Σ(i)
(x)). (9)
Network configuration,
1 Number of components n, need to be fine tuned(try and
error).
2 3 × n output units.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Experiment(ref):
#(components) = 24,
two hidden layers with width = 24 and activation is tanh,
#(output units) = 3 × 24 and they are linear.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost Functions
Binary
Multinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
26 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturally
good to evaluate errors than other methods.
An cross entropy improvement to avoid numerically
unstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?
– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learn
full distribution.
– When data distribution is not normal(high kurtosis or fat
tail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Introduction
Output Units and Cost Functions
Deterministic and Generic Model
Concludsions and Discussions
Thank you.
28 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

More Related Content

What's hot

Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svmStéphane Canu
 
The Mathematics of RSA Encryption
The Mathematics of RSA EncryptionThe Mathematics of RSA Encryption
The Mathematics of RSA EncryptionNathan F. Dunn
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine LearningAndres Mendez-Vazquez
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
CP 2011 Poster
CP 2011 PosterCP 2011 Poster
CP 2011 PosterSAAM007
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkKazuki Fujikawa
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward HeuristicsAndres Mendez-Vazquez
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svmStéphane Canu
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1TOMMYLINK1
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Palak Sanghani
 

What's hot (20)

Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svm
 
The Mathematics of RSA Encryption
The Mathematics of RSA EncryptionThe Mathematics of RSA Encryption
The Mathematics of RSA Encryption
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
Perceptron
PerceptronPerceptron
Perceptron
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
CP 2011 Poster
CP 2011 PosterCP 2011 Poster
CP 2011 Poster
 
elm
elmelm
elm
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svm
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
 

Viewers also liked

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionzukun
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep LearningAsim Jalis
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat Politècnica de Catalunya
 
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》鍾誠 陳鍾誠
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningAsim Jalis
 
Using deep neural networks for fashion applications
Using deep neural networks for fashion applicationsUsing deep neural networks for fashion applications
Using deep neural networks for fashion applicationsAhmad Qamar
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural NetworksYogendra Tamang
 
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...Data Science Milan
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 

Viewers also liked (17)

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural Networks
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for vision
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》
用30分鐘深入瞭解《AlphaGo圍棋程式的設計原理》
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Using deep neural networks for fashion applications
Using deep neural networks for fashion applicationsUsing deep neural networks for fashion applications
Using deep neural networks for fashion applications
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...
Neural Networks and Deep Learning (Part 1 of 2): An introduction - Valentino ...
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 

Similar to Output Units and Cost Function in FNN

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemVarad Meru
 
20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilersComputer Science Club
 
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptx
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptxECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptx
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptxMdJubayerFaisalEmon
 
designanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxdesignanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxarifimad15
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowOswald Campesato
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 

Similar to Output Units and Cost Function in FNN (20)

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
 
20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers
 
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptx
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptxECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptx
ECE 2103_L6 Boolean Algebra Canonical Forms [Autosaved].pptx
 
designanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxdesignanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptx
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Shortest Path Problem
Shortest Path ProblemShortest Path Problem
Shortest Path Problem
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Output Units and Cost Function in FNN

  • 1. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Deep Neural Network Cost Functions and Output Units Jiaming Lin jmlin@arbor.ee.ntu.edu.tw DATALab@III NetDBLab@NTU January 9, 2017 1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 2. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 2 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 3. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 4. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. Determine and analyse the cost function. – Is the cost function †analytic? – Can the learning progress well(first order derivative)? 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 5. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Introduction In the neural network learning... The selection of output unit depends on the learning problems. – Classification: sigmoid, softmax or linear. – Linear Regression: linear. Determine and analyse the cost function. – Is the cost function †analytic? – Can the learning progress well(first order derivative)? Deterministic and Generic Model. – Data is more complicated in many cases. Note: †For simplicity, we mean analytic to say a function is infinitely differentiable on the domain. 3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 6. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 4 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 7. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 5 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 8. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Binary index x1 · · · xn target 1 0 · · · 1 Class A 2 1 · · · 0 Class B 3 1 · · · 1 Class A · · · · · · · · · · · · · · · m 0 · · · 0 Class B 6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 9. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Binary where S is the sigmoid function, z is the input of output layer z = w h + b (1) with w is weight, h is output of hidden layer and b is bias. 6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 10. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Mean Square Error Cross Entropy 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 11. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Mean Square Error Let y(i) denotes the data label, and ˆy(i) = S(z(i) ) as the prediction. We may define the cost function Cmse by Cmse = 1 m m i=1 (ˆy(i) − y(i) )2 (2) where m is the data size, and z(i) , ˆy(i) and y(i) are real numbers. 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 12. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Cost Function Cost function can be derived from many methods, we discuss two of the most common Cross Entropy Adapting the symbols above, the cost function defined by Cross Entropy is Cce = 1 m m i=1 y(i) ln(ˆy(i) ) + (1 − y(i) ) ln(1 − ˆy(i) ) (2) where m is the data size, and z(i) , ˆy(i) and y(i) are real numbers. 7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 13. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Problem: Which one is better? Analyticity(infinitely differentiable) Learning ability(first order derivatives) 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 14. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Analyticity: Cmse = 1 m m i=1 (ˆy(i) − y(i) )2 Cce = 1 m m i=1 y(i) ln(ˆy(i) ) + (1 − y(i) ) ln(1 − ˆy(i) ) Computationally, the value of ˆy(i) = S(z(i) ) could overflow to 1 or underflow to 0 when z(i) is very positive or very negative. Therefore, given a fixed y(i) ∈ {0, 1}, Cce is undefined at ˆy(i) is 0 or 1. Cmse is polynomial and thus analytic every where. 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 15. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy Learning Ability: compare the gradients ∂Cmse ∂w = [S(z) − y] [1 − S(z)] S(z)h, (3) ∂Cce ∂w = [y − S(z)] h (4) respectively, where S is sigmoid, z = w h + b. 8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 16. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy MSE Cross Entropy [S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 0 If y = 0 and ˆy → 1, steps → 0 If y = 0 and ˆy → 0, steps → 0 If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 1 If y = 0 and ˆy → 1, steps → −1 If y = 0 and ˆy → 0, steps → 0 9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 17. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Comparison between MSE and Cross Entropy MSE Cross Entropy [S(z) − y] [1 − S(z)] S(z)h [y − S(z)] h If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 0 If y = 0 and ˆy → 1, steps → 0 If y = 0 and ˆy → 0, steps → 0 If y = 1 and ˆy → 1, steps → 0 If y = 1 and ˆy → 0, steps → 1 If y = 0 and ˆy → 1, steps → −1 If y = 0 and ˆy → 0, steps → 0 In the ceas of Mean Square Error, the progress get stuck when z is very positive or very negative. 9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 18. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy We have mentioned about the unstable issue of cross entropy. 10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 19. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy We have mentioned about the unstable issue of cross entropy. Precisely, ˆy = S(z) underflow to 0 when z is very negative, ˆy = S(z) overflow to 1 when z is very positive. Therefore, given a fixed y ∈ {0, 1}, then the function C = y ln ˆy + (1 − y) ln(1 − ˆy) could be undefined when z is very positive or very negative. 10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 20. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 21. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. We may obtain the analyticity of C by showing the dC dz is multiple of analytic functions. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 22. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli The Unstable Issue in Cross Entropy Alternatively, regarding z as the variable of cross entropy C = y ln S(z) + (1 − y) ln(1 − S(z)) (5) = −ζ(−z) + z(y − 1), (6) where ζ is the softplus and z is real number. In the cases of right answer y = 1 and ˆy = S(z) → 1 ⇒ z → ∞, C → 0, y = 0 and ˆy = S(z) → 0 ⇒ z → −∞, C → 0. In the cases of wrong answer y = 1 and ˆy = S(z) → 0 ⇒ z → −∞, C → −1, y = 0 and ˆy = S(z) → 1 ⇒ z → ∞, C → −1. 11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 23. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 12 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 24. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Multinoulli: Output Unit and Cost Function Generalize the binary case to multiple classes. Linear output units and #(output units) = #(classes). Cost function evaluated by cross entropy. 13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 25. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Multinoulli: Output Unit and Cost Function Generalize the binary case to multiple classes. Linear output units and #(output units) = #(classes). Cost function evaluated by cross entropy. Cost Function in Multinoulli Problems Suppose the size of dataset is m and there are K classes, then we can obtain the cost function from cross entropy C(w) = − m i=1 K k=1 1{y(i) = k} ln exp(z (i) k ) K j=1 exp(z (i) j ) (7) where z (i) k = wk h(i) + bk and h(i) is the output of hidden layer corresponding to example data xi. 13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 26. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Analyticity(infinitely differentiable) Learning ability(first order derivatives) 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 27. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Analyticity(infinitely differentiable) Learning ability(first order derivatives) To claim above properties, We should show a lemma at very first, Lemma 1 For the output z = w h + b and z = [z1, . . . , zK], we have sup z ln K j=1 exp(zj) = max j {zj}. (8) 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 28. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli A Lemma for Cost Function Simplify Proof. Without loss of generality, we may assume z1 > . . . > zK, then the remaining work is to show, for all > 0. ln ez1 1 + K j=2 ezj−z1 = z1 + ln 1 + K j=2 ezj−z1 ≤ z1 + Intuitively, the ln K j=1 exp (zj) can be well approximated by max j {zj}. 14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 29. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Analyticity We may rewrite the cost function as C(w) = − m i=1 K k=1 1{y(i) = k} z (i) k − ln K j=1 exp(z (i) j ) . For each summand, it is substraction of analytic function and thus analytic, and the term 1{y(i) = k} is acturally a constant. The total cost is summation of analytic functions and thus analytic. 15 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 30. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Property 2 By the rule of sum in derivatives, we may simplify the (7) as following C(i) = K k=1 1{y = k} zk − ln K j=1 exp(zj) , (8) this cost is contributed by the example xi in the total cost C. 1 Assume the model gives the right answer, then the errors would close to 0. 2 Assume the model gives the wrong answer, then the learning can prograss well. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 31. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Right Answer). Suppose the true label is class n. By the assumption, we know zn is the maxmal. Then − ≤ K k=1 1{y = k} zk − ln K j=1 exp(zj) = zn − ln K j=1 exp(zj) < zn − max j {zj} = 0. This shows that − ≤ C(i) < 0 for an arbitrary small . 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 32. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Wrong Answer). Suppose the true label is class n. By assumption, the prediction zn given by model is not the maxmal. On the other hand, using the fact zn = max j {zj} ⇒ softmax(zn) 1. This implies that there exist a sufficient large δ > 0 such that | softmax(zn) − 1 |> δ. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 33. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Binary Multinoulli Learning Ability Proof (The Wrong Answer, Conti.) Then ∂C(i) ∂zn = ∂ ∂zn zn − ln K j=1 ezj = 1 − softmax(zn) > δ This shows the gradient is sufficently large and also predictable(bounded by 1), therefore the learning can progress well. 16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 34. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 17 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 35. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Learning Processes Overview Deterministic Generic Step1 Model function Linear Sigmoid Probability distribution Gaussian Bernoulli Step2 Design errors evals MSE Cross Entropy Maximum Likelihood Es- timate Step3 Learning one statistic Mean Median Learning full distribution 18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 36. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Learning Processes Overview Deterministic Generic Step1 Model function Linear Sigmoid Probability distribution Gaussian Bernoulli Step2 Design errors evals MSE Cross Entropy Maximum Likelihood Es- timate Step3 Learning one statistic Mean Median Learning full distribution To describe some complicate data, it’s easier to build model with generic method. 18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 37. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Binary Classification Step1: Using Bernoulli distribution as likelihood function. p(y | x) = py (1 − p)1−y = S(z)y (1 − S(z))1−y Step2: Minimizing negative log-likelihood ln p(y | x(i) ) = y ln S(z) + (1 − y) ln(1 − S(z)) Step3: We an learn the full distribution. p(y | x ) = S(z )y (1 − S(z ))1−y , where we denote z = w x + b and S is sigmoid. 19 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 38. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step1 Given a training feature x, using Gaussian distribution as likelihood function 20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 39. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step1 Given a training feature x, using Gaussian distribution as likelihood function p(y | x) = 1 √ 2σ2π exp −(µ − y)2 2σ2 , where we denote the output of hidden layer as hx, weight w = [w1, w2] and bias b = [b1, b2], then µ = w1 hx + b1 σ = w2 hx + b2 Intuitively, µ and σ are two linear output units, they are functions of x. 20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 40. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 Recall that the maximum likelihood estimate is equivalent to minimize the negative log-likelihood, that is (ˆµ, ˆσ) = arg min (µ,σ) − x ln p(y | x) (8) 21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 41. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 Recall that the maximum likelihood estimate is equivalent to minimize the negative log-likelihood, that is (ˆµ, ˆσ) = arg min (µ,σ) − x ln p(y | x) (8) However, for each summand, Cx = ln p(y | x) = −1 2 ln(2πσ2 ) + (µ − y)2 σ2 ∂Cx ∂σ = (πσ)−1 − 2σ−3 (µ − y) the gradients and errors become unstable when σ close 0. 21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 42. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 To prevent the gradients and errors from being unstable, we may substitute the term 1 2σ2 with v, then for each summand in the negative log-likelihood Cx = ln π − ln v − (µ − y)2 v, ∂Cx ∂µ = −2v(µ − y), ∂Cx ∂v = 1 v − (µ − y)2 . Note that, this substitution valid only when the variance isn’t too large. 22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 43. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step2 If the variance σ is fixed and chosen by user, then by comparing the negative log-likelihood and MSE, we can see that minimizing NLL is equivalent to minimizing MSE. Cmse = 1 m m i=1 ˆy(i) − y(y) 2 Cnll = m i=1 Cx(i) = −1 2 m ln(2πσ2 ) + m i=1 µx(i) − y(i) 2 σ2 22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 44. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 45. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. Experiment(ref): generate random data base on the formula y = x + 7.0 sin(0.75x) + where is the gaussian noise with µ = 0, σ = 1 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 46. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Generic Modeling for Linear Regression: Step3 Full distribution from Generic, µ and σ in this case. Single statistics from Deterministic, µ in this case. FNN config: #(hidden layey) = 1, width = 20 and hidden unit is tanh. Gerneric Deterministic 23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 47. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions More Complicated Cases Complicated data distributions. In some cases, it’s almost impossible to describe data via deterministic methods. Generic methods might perform better in complicated case. 24 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 48. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Generate random data based on the formula x = y + 7.0 sin(0.75y) + where is the gaussian noise with µ = 0, σ = 1 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 49. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Firstly, just try to using MSE to define cost function and one hidden layer with width = 20, hidden unit is tanh. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 50. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Firstly, just try to using MSE to define cost function and one hidden layer with width = 20, hidden unit is tanh. The reason is, minimizing MSE is equivalant to minimizing nagetive log-likelihood for simple Gaussian. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 51. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network The mixture density network. The Gaussian mixture with n components is defined by the conditional probability distribution p(y | x) = n i=1 p(c = i|x)ℵ(y; µ(i) (x); Σ(i) (x)). (9) Network configuration, 1 Number of components n, need to be fine tuned(try and error). 2 3 × n output units. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 52. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Mixture Density Network Experiment(ref): #(components) = 24, two hidden layers with width = 24 and activation is tanh, #(output units) = 3 × 24 and they are linear. 25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 53. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Outline 1 Introduction 2 Output Units and Cost Functions Binary Multinoulli 3 Deterministic and Generic Model 4 Concludsions and Discussions 26 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 54. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 55. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 56. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. Determine the cost function is good or not. – Is the cost function analytic? – Can the learning progress well? 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 57. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions In classification problems, cross entropy is naturally good to evaluate errors than other methods. An cross entropy improvement to avoid numerically unstable. – The MNIST example from Tensorflow. Determine the cost function is good or not. – Is the cost function analytic? – Can the learning progress well? Deterministic v.s. Generic – Deterministic learns single statistic while generic learn full distribution. – When data distribution is not normal(high kurtosis or fat tail), generic might be better. – Generic methods is easier to apply to complicated cases. 27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
  • 58. Introduction Output Units and Cost Functions Deterministic and Generic Model Concludsions and Discussions Thank you. 28 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network