The world of loss function

1
The World of Loss Function
- Part I -
2018. 8.
김 홍 배

4
1. Classification with NN
2. Linear Classification : Support Vector Machine
* Non-linear SVM ?
3. Logistic Regression :
Binary Classifier, Cross entropy,
Information Theory
4. What is “Maximum A Posterior Estimator” ?
5. Kullback-Leibler(KL) divergence
6. Softmax Regression : Multi-class Classifier
7. Focal Loss
8. Discriminative Feature Learning
9. Learning by Association
Part 1

5
Classification with NN
2D Example

6
Let’s imagine a simple case
To classify the given classes,
we only need to define a straight line
Human
Dog

7
f= 𝑖=0
2
𝑤𝑖 𝑋𝑖
+1
-1
Human
Dog
At Decision boundary
f(x)= 0
slope
offset
Single cell is enough for a simple case !

8
X1 : No. of straight lines in image
X2:Blackpixelsratio(%)
100
50
Decision boundary
w1/w2 ≈ 12, w0/w2 ≈-60
ex, Get computer to classify input image as Chinese or Japanese

9
Human
Monkey
Let’s imagine a more complex case
Can’t classify with a straight line  needs more complex boundary

10
We need many cells and layers generally
We can create extra features that allow more
complex decision boundaries

• Select a network architecture
• Randomly initialize weights
• Observe features “x” with reference “y”
• Push “x” through NN  output is “ŷ”
• Calculate Error : (y- ŷ)2, least squared error for example
• While error is too large
– Calculate errors and backpropagate error signals
– Adjust weights
• Evaluate performance using the test set
Network Training by Backpropagation
y
ŷ
11

• How should we update the weights to improve ?
 To minimize error or loss function, J=(y- ŷ)2,
Gradient descent algorithm is generally used
Network Training by Backpropagation
J(w)
w
Start point
Final point
Sensitivity w.r.t Cost ft’n
wn+1 = wn - η
𝜕J(wn)
𝜕wn
w1 w2 wf
12

13
LINEAR CLASSIFIER :
SUPPORT VECTOR MACHINES

14
The main idea of the SVMs may be summed up as follows:
• “Given a training samples, the SVM constructs a hyperplane
as decision surface in such a way the margin of separation
between positive and negative examples is maximized.”
Introduction

15
Linearly Separable Patterns
SVM is a binary learning machine.
• Binary classification is the task of separating classes
in feature space.

16
Which of the linear separators is optimal?
Linearly Separable Patterns

17
• The optimal decision boundary is the one that maximize
the margin ρ
Optimal Decision Boundary

18
𝑤
𝑥0
𝑥
𝑃0P
0
𝑥 − 𝑥0 Define vectors : 𝑥0 = 𝑂𝑃0 𝑎𝑛𝑑 𝑥 = 𝑂𝑃
𝑤ℎ𝑒𝑟𝑒 P is an arbitrary point on a hyperplane.
A condition for P to be on the plane is that the vector 𝑥 − 𝑥0 is
perpendicular to 𝑤
vector 𝑤 · ( 𝑥 − 𝑥0) = 0 or 𝑤 · 𝑥 + b = 0
Equation of a Hyperplane

19
𝑤
𝑛
Understanding the basics
g(x) is a linear function :
• A hyperplane in the feature space
• Normal vector of the hyperplane :
𝑛=
𝑤
𝑤

20
𝑤
𝑛
Margin
Safe zone
• The linear discriminant function
function (classifier) with the
maximum margin is the best
• Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
• Why it is the best ?
Robust to outliers and thus
strong generalization ability
: denotes +1
: denotes -1

21
𝑤
𝑛
: denotes +1
: denotes -1
• Given a set of data points
( 𝑥𝑖, 𝑦𝑖) , i=1,2, ···, n
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > 0
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < 0
• With a scale transformation on
both 𝑊 𝑎𝑛𝑑 𝑏,
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1

22
𝑤
𝑛
Margin
: denotes +1
: denotes -1
W𝑥+
+ 𝑏 = +1
Support vector
X+
X-
X-W𝑥−
+ 𝑏 = −1
• At extreme points 𝑥+
, 𝑥−
• The margin width is :
M = (𝑥+
− 𝑥−
) ∙ 𝑛
=
2
𝑤
2
𝑤
Maximize
Such that
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1
1
2
𝑤 2Or Minimize
𝑦𝑖(𝑊𝑥𝑖 + 𝑏) ≥ 1

23
The Optimization Problem
Introduce Lagrange multipliers αi,
• That is, the Lagrange function:
(𝑦𝑖
Is to be minimized with respect to w and b, i.e,

• The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane.
• However, there are situations where a nonlinear region can separate the
groups more efficiently.
• The kernel function transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
Non-Linear SVM(Support Vector Machines)
kernel trick

To Map from input space to feature space to simplify classification task
Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is
adopted
Non-Linear SVM(Support Vector Machines)
Feature space에서의 inner product(a measure of similarity)

Key Idea of Kernel Methods
K(𝑥𝑖, 𝑥𝑗)
K(𝑥𝑖, 𝑥𝑗) = Φ(𝑥𝑖)· Φ(𝑥𝑗)

Normal Condition :
Cluster bound :
exp{−
[ 𝑥1−𝑐1
2+ 𝑥2−𝑐2
2]
2𝜎2 } ≥ {0<Threshold<<1}
𝑥1 − 𝑐1
2
+ 𝑥2 − 𝑐2
2
≤ r2
𝐾1 + 𝐾2 ≤ r2
x1
x2
.(c1,c2)
r
K1
K2
r2
r2
Key Idea of Kernel Methods

RBFN architecture
Σ
Input layer
Hidden layer
(RBFs)
Output layer
W1 W2 WM
x1 x2 xn
No weight
f(x)
Each of n components of
the input vector x feeds
forward to m basis
functions whose outputs
are linearly combined with
weights w (i.e. dot product
x∙w) into the network
output f(x).
The output layer performs a simple weighted sum (i.e. w ∙x).
If the RBFN is used for regression then this output is fine.
However, if pattern classification is required, then a hard-
limiter or sigmoid function could be placed on the output
neurons to give 0/1 output values
Input data set ∶ 𝑋 = { 𝑥1 𝑥2 … 𝑥 𝑁}

Σ
query
0.2
query
0.9
Radial Basis Function Detector
Architecture for Anomaly detection

Normal data
Unusual or
Abnomaly data
Σ
- Anomaly or Unusual event detection
query
query
0.9
query
query
0.1

Σ Σ
Category 1 Category 2
Category 1
Category 2
- Classification Problem

 For Gaussian basis functions
 s x w w x c
w w
x c
p i i p i
i
M
i
pj ij
ijj
n
i
M
( )
exp
( )
  
  
  










0
1
0
2
2
11 2


 Assume the variance  across each dimension are
equal
s x w w x cp i
i
pj ij
j
n
i
M
( ) exp ( )

   






0 2
2
11
1
2
→ → →
→

• Design decision
• number of hidden neurons
• max of neurons = number of input patterns
• more neurons – more complex, smaller tolerance
• Parameters to be learnt
• centers
• radii
• A hidden neuron is more sensitive to data points near its center.
This sensitivity may be tuned by adjusting the radius.
• smaller radius  fits training data better (overfitting)
• larger radius  less sensitivity, less overfitting, network of
smaller size, faster execution
• weights between hidden and output layers

The question now is:
How to train the RBF network?
In other words, how to find:
 The number and the parameters of hidden units (the basis functions)
using unlabeled data (unsupervised learning).
 K-Mean Clustering Algorithm
 The weights between the hidden layer and the output layer.
 Recursive Least-Squares Estimation Algorithm
RBFN Learning

xp
K-means
K-Nearest
Neighbor
Basis
Functions
Linear
Regression
ci
ci
i
A w
RBFN Learning

 Use the K-mean algorithm to find ci
RBFN Learning

K-mean Algorithm
step1: K initial clusters are chosen randomly from the samples
to form K groups.
step2: Each new sample is added to the group whose mean is
the closest to this sample.
step3: Adjust the mean of the group to take account of the new
points.
step4: Repeat step2 until the distance between the old means
and the new means of all clusters is smaller than a
predefined tolerance.

Outcome: There are K clusters with means representing
the centroid of each clusters.
Advantages: (1) A fast and simple algorithm.
(2) Reduce the effects of noisy samples.

 Use K nearest neighbor rule to find the function
width 
k-th nearest neighbor of ci
 The objective is to cover the training points so that a
smooth fit of the training samples can be achieved
2
1
1


K
k
iki cc
K
→ →

 RBF learning by gradient descent
 Let andi p
pj ij
ijj
n
p p px
x c
e x d x s x( ) exp ( ) ( ) ( )
   
 








 


1
2
2
2
1 
 E e xp
p
N



1
2 1
2
( ) .

we have






E
w
E E
ci ij ij
, , and
Apply
→ → → →
→
N : No. of batch

we have the following update equations
 RBF learning by gradient descent

Logistic regression
Needs to classify the output (y) as either 0 or 1.  Binary classification
A liner equation may not be a good fit for classification problem.
Logistic regression uses the sigmoid function to plot the hypothesis
z
Sigmoid ft’n
1
0
Linear ft’n
Want 0 ≤ ℎ 𝜃 𝑥 ≤ 1, 𝑓𝑜𝑟 𝑎𝑛 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥
Logistic Regression Model
ℎ 𝜃 𝑥 = σ(ϴ 𝑇 𝑥)
Binary Classification
42
J(w)= 𝑘 𝑦 𝑘 − 𝑦 𝑘
2
Linear regression
𝜎 ϴ 𝑇 𝑥 =
1
1 + 𝑒−ϴ 𝑇 𝑥
𝜎′
𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))𝜎 𝑧 =
1
1 + 𝑒−𝑧

Logistic regression
Satisfy condition
Sum of probability of y, given x, parameterized by ϴ
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
hq(x) : estimate of probability that y=1 for a given x
with model parameter, θ
hq(x) = P(y = 1 | x; θ)
Loss function of Logistic Regression
J(ϴ)=- logP(y = 1 | x; θ)= - log𝒉 𝜽(𝒙)
Take negative logarithm

44
Logistic regression
Maximum (log) likelihood estimator (MLE)
𝜃∗
= argmax 𝜃 logP(y = 1 | x; θ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
44
- logP(y = 1 | x; θ)
P(y = 1 | x; θ)
To make J(ϴ)  0 as P(y = 1 | x; θ)  1
J(ϴ)  ∞ as P(y = 1 | x; θ)  0
Why taking negative logarithm ?
Likelihood function
Maximize Likelihood = Minimize Loss
Likelihood : estimate unknown parameters
based on known outcomes : L ϴ 𝑦 =P 𝑦 ϴ

• Maximum Likelihood Approach for Binary classification
Data set {𝑥(𝑖)
,𝑦(𝑖)
}
where 𝑦(𝑖)
 {0,1} and 𝑥(𝑖)
, i=1,..,m
Since 𝑦(𝑖)
is binary we can use Bernoulli
• Likelihood function associated with m observations
Generalization of Binary classification
L(ϴ)=P 𝑦 𝑥; ϴ = 𝑖=1
𝑚
P 𝑦𝑖 𝑥𝑖; ϴ = 𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖
) 𝑦(𝑖)
1 − ℎ 𝜃(𝑥 𝑖
)
1−𝑦(𝑖)
Logistic regression
45
For a single observation case,
P 𝑦 = 1 𝑥; ϴ =ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ =1 − ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
Same as previous page

46
• By taking negative logarithm we get the
Cross-entropy Error Function
J(ϴ) = −
1
𝑚
𝑙𝑜g(P 𝑦 𝑥; ϴ )
Logistic regression
Likelihood
Maximum likelihood estimator (MLE)
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}

Loss function is defined as
For 𝑦(𝑖)
= 1 𝑐𝑎𝑠𝑒
J(ϴ)= - logℎ 𝜃(𝑥 𝑖 )
ℎ 𝜃(𝑥 𝑖
)
1
J(ϴ)
For 𝑦(𝑖) = 0 𝑐𝑎𝑠𝑒
J(ϴ)= - log(1-ℎ 𝜃(𝑥 𝑖 ))
ℎ 𝜃(𝑥 𝑖 )
1
J(ϴ)
As ℎ 𝜃(𝑥 𝑖
) approaches to 1, J(ϴ) becomes 0 As ℎ 𝜃(𝑥 𝑖
) approaches to 0, J(ϴ) becomes 0
Z
Sigmoid ft’n
1
0
47
Logistic regression

Gradient :
convex quadratic function
ℎ 𝜃 𝑥 = σ(ϴ 𝑇
𝑥)
𝜕J(ϴ)
𝜕𝜃𝑗
=
1
𝑚 𝑖
𝑦(𝑖)
− ℎ 𝜃(𝑥 𝑖
) 𝑥𝑗
Alternative Cost function for Binary classification
Logistic regression
48

Regularized Logistic Regression
Adds numerical damping to prevent overshoot or over-fitting
Fidelity term Regularization term
Logistic regression
49
+
λ
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
?

50
Logistic regression
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists :
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
Data likelihood for specific parameters
(could be modeled with Deep Network!)
Prior distribution over parameters
(describes our prior knowledge or /
and our desires for the model)
Bayesian evidence
A powerful method for model selection!
As a rule this integral is intractable :(
(You can never integrate this)

51
Logistic regression
The core idea of Maximum a Posteriori Estimator:
Maximum a posteriori estimator
𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦
=𝐽 𝑀𝐿𝐸 ϴ +
1
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
+ 𝑐𝑜𝑛𝑠𝑡
𝜃 𝑀𝐴𝑃
∗
= argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ }
Loss function of Posterior distribution over model parameters
assuming a Gaussian prior for the weights

Kullback-Leibler (KL) divergence
• A measure of the difference between two
probability distributions: and
We can measure the
difference according
to an objective and
numerical value.
P (x) Q (x)
𝐷 𝑃 𝑄 ≡ P (x) log
Q (x)
P (x)
𝑑𝑥
P (x) Q (x)
Note: KL divergence is not a metric.
𝐷 𝑃 𝑄 ≠ 𝐷 𝑄 𝑃

KL divergence = Conditional Entropy H(P|Q)

Minimize KL divergence
• Random events are drawn from the real
distribution
true distribution
data set
Using the observed
data, we want to
estimate the true
distribution using a trial
distribution.
trial distribution
minimize
divergence
The smaller the KL divergence , the better an estimate.

Minimize KL divergence
• KL divergence between the two distributions
Constant: independent
of parameter
To minimize KL divergence, we have only
to maximize the second term with respect
to the parameter .

Likelihood and KL divergence
• The second term is approximated by the sample
mean:
data set
Log likelihood
They are the same:
• Minimizing the KL divergence
• Maximizing the likelihood

Softmax Regression
• Softmax Regression ( or multinomial logistic regression) is a
classification method that generalizes logistic regression to
multiclass problems. (i.e. with more than two possible discrete
outcomes.)
• Used to predict the probabilities of the different possible
outcomes of a categorically distributed dependent variable, given
a set of independent variables (which may be real-valued, binary-
valued, categorical-valued, etc.).
generalized logistic regression
to multiclass problems
58

• Used in classification problem in which response variable y can take on any one of
k values.
𝑦 ∈ 1,2, … , 𝑘 .
• To derive General Linear Model for multinomial data
 we begin by expressing the multinomial as an exponential family distribution.
 then computes the multinomial logistic loss (-log likelihood)
ℎ 𝜃 𝑥 𝑖
=
𝑝 𝑦(𝑖)
= 1 𝑥(𝑖)
; 𝜃
⋮
𝑝 𝑦(𝑖)
= 𝑘 𝑥(𝑖)
; 𝜃
=
1
𝑗=1
𝑘
e
θj
T 𝑥(𝑖)
eθ1
T 𝑥(𝑖)
⋮
eθk
T
𝑥(𝑖)
Softmax Regression
59

60
Softmax Regression
2.0
1.0
-1.0
-3.0
Logits,
Scores
0.7
0.2
0.05
0.01
Probabilities

• Remember that for logistic regression, we had:
which can be written similarly as:
Softmax Regression
61
Cross Entropy !
m events or dataset

• The softmax cost function is similar, except that we now sum
over the k different possible values of the class label.
• Gradient
.
: logistic
: softmax
Softmax Regression
62
Cross Entropy !
K Category Classification
m events or dataset

63
Probability vs. Likelihood
• Probability : likelihood that y will occur based on given parameters, ϴ :
Known parameters, ϴ : μ = 32 𝑔, σ = 2.5
μ = 32 𝑔
σ = 2.5
24 g 40 g
P 𝑦 ϴ
P 𝑦 = 34 𝑔 ϴ = 0.15
P 𝑦 = 32 𝑔 ϴ = 0.2
Probability=0.15
Probability=0.2
P Variable Fixed
y=34y=32

64
• Likelihood : Parameters of a statistical model, ϴ based on given observed data, y :
 Find a best fitting model
L ϴ 𝑦
Probability vs. Likelihood
24 g 40 g
The likelihood of weighs 34 g for a parameter is
Assumed parameters, ϴ : μ = 32 𝑔, σ = 2.5
Probability is 0.12
0.12
L ϴ weighs 34 g =
𝑦 = 34 𝑔
μ = 34 𝑔
Assumed parameters, ϴ : μ = 34 𝑔, σ = 2.5
0.2
Probability is 0.2
True data distribution: μ = 34 𝑔, σ = 2.5

65
Fitting normal distribution: ML
L 𝜇, 𝜎2 𝑥1,…𝐼 =
Assumed model
Given dataset

66
Fitting normal distribution: ML
Plotted surface of likelihoods
as a function of possible
parameter values
ML Solution is at peak

67
Information Theory
Information
• It is quantitative measure of information
• Most UN-expected events give maximum information
• Average uncertainty of a random variable
Relation between Information and its probability :
• Information is inversely proportional to its probability of occurrence
• Information is continuous function of its probability
• Total information of two or more independent message is the sum of
individual information
I(𝑥) = 𝑙𝑜g(
1
P(𝑥)
) = −𝑙𝑜𝑔P(𝑥)
I(𝑥) ∶ 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛, 𝑃 𝑥 : 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
 Increase entropy
 Loss

68
Let’s see observation as X= 𝑥1, … , 𝑥 𝑚 with probability P(X)= 𝑝1, … , 𝑝 𝑚
• Total N observations are occurred
• 𝑥1 occurs N*𝑝1 times and so ….
• Single occurrence of 𝑥1 conveys information = −𝑙𝑜𝑔𝑝1
• N*𝑝1 times occurrence conveys information = -N*𝑝1 ∗ 𝑙𝑜𝑔𝑝1
• Total information = -N 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖
• Averaged information = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖  H(x)
Marginal Entropy
H(x) : Marginal Entropy
Cross Entropy Loss = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔 𝑝i
𝑝𝑖 : Given, usually 1
𝑝I : Estimated, usually less than 1

69
Joint Entropy
Venn diagram for definition of entropies

70
Joint Entropy
Let’s see two observations as X= 𝑥1, … , 𝑥 𝑚 and Y= 𝑦1, … , 𝑦 𝑛
• Can be a reference and a query for Anomaly detection problem
• Should have complete probability scheme i.e. sum of all
possible combinations of joint observation of X and Y should be “1”
𝑖=1
𝑚
𝑗=1
𝑛
𝑝( 𝑥𝑖, 𝑦𝑗) = 1
• Entropy calculated same as marginal entropy
• Information delivered when one pair (𝑥𝑖, 𝑦𝑗) occur once is − log 𝑝(𝑥𝑖, 𝑦𝑗)
• Number of times this can happen is Nij out of total N
• Information for Nij times for this particular combination is - Nijlog 𝑝(𝑥𝑖, 𝑦𝑗)
• Total information for all combination of i and j is - 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)

71
-
1
𝑁 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy
Averaged information : Joint Entropy H(X, Y)
Nij = 𝑝(𝑥𝑖, 𝑦𝑗)·N
H(X, Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy H(X, Y)

73
Conditional Entropy H(X|Y)
• Bay’s theorem : 𝑝(𝑥𝑖, 𝑦𝑗) = 𝑝(𝑥𝑖)· 𝑝(𝑦𝑗|𝑥𝑖) = 𝑝(𝑦𝑗)· 𝑝(𝑥𝑖|𝑦𝑗)
• For a particular 𝑦𝑗 observed, it can be only from one of X= 𝑥1, … , 𝑥 𝑚
• Similarly
𝑖=1
𝑚
𝑝(𝑥𝑖 𝑦𝑗 = 1
𝑗=1
𝑛
𝑝(𝑦𝑗 𝑥𝑖 = 1

76
Mutual Information I(X;Y)


yx ypxp
yxp
yxp
XYHYHYXHXHYXI
, )()(
),(
log),(
)|()()|()();(
• The reduction in uncertainty of one random variable due to knowing about
another  Information Gain
• The amount of information one random variable contains about another
• Measure of independence
: two variables are independent
 grows according to ...
- the degree of dependence
- the entropy of the variables
0);( YXI

77
Symmetrical Uncertainty Measure @ Remaining Useful Lifetime Prediction
• Not all variables measured from the critical component are useful for
predicting the RUL.
• The assumption is that variables that have non-random relationship
carry information about the system behaviour.
• The goal is to group the variables, which have non-random relationship.
• To do so, a method based on mutual information has been applied on
the dataset for feature selection:
0 ≤ 𝑆𝑈 𝑋, 𝑌 = 2
𝐼(𝑋;𝑌)
𝐻 𝑋 +𝐻(𝑌)
≤ 1

78








yx
yxyxyx
yxyx
ypxp
yxp
yxp
yxpyxp
yp
yxp
xp
yxp
yxpyxp
yp
yp
xp
xp
YXHYHXH
YXHXHYXI
,
,,,
,
)()(
),(
log),(
),(log),(
)(
1
log),(
)(
1
log),(
),(log),(
)(
1
log)(
)(
1
log)(
),()()(
)|()();(
Derivation of Mutual Information(i)

79
Derivation of Mutual Information(ii)
• xi is occurred with probability of p(xi) : Priori entropy of xi
• Initial uncertainty of xi is - log 𝑝(𝑥𝑖)
• Reduction in uncertainty of one random variable, xi due to
knowing about another, yj  Information Gain
• Final uncertainty of xi is - log 𝑝(𝑥𝑖|𝑦𝑗) : Posteriori entropy of xi
• Information gain = Net reduction in the uncertainties
• I(xi ; yj) = Initial uncertainty of xi - Final uncertainty of xi
= - log 𝑝(𝑥𝑖)+ log 𝑝(𝑥𝑖|𝑦𝑗) = log
𝑝(𝑥 𝑖)

80
Derivation of Mutual Information(ii)
I(X;Y) = 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log
𝑝(𝑥 𝑖)
• I(X;Y) : averaging I(xi ; yj) for all values of i and j
I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
= H(X) +H(Y) – H(X,Y)

The world of loss function

In this document

More Related Content

What's hot

Similar to The world of loss function

More from 홍배 김

Recently uploaded

The world of loss function