SlideShare a Scribd company logo
1 of 146
Download to read offline
Machine Learning for Data Mining
Introduction to Bayesian Classifiers
Andres Mendez-Vazquez
August 3, 2015
1 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
2 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
4 / 71
Classification Problem
Goal
Given dnew, provide h(dnew)
The Machinery in General looks...
Supervised
Learning
Training Info: Desired/Trget Output
INPUT OUTPUT
5 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
6 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
The most important term in all this
The factor
likelihood × prior − information (3)
10 / 71
Example
We have the likelihood of two classes
11 / 71
Example
We have the posterior of two classes when P (ω1) = 2
3
and P (ω2) = 1
3
12 / 71
Naive Bayes Model
In the case of two classes
P (x) =
2
i=1
p (x, ωi) =
2
i=1
p (x|ωi) P (ωi) (4)
13 / 71
Error in this rule
We have that
P (error|x) =
P (ω1|x) if we decide ω2
P (ω2|x) if we decide ω1
(5)
Thus, we have that
P (error) =
ˆ ∞
−∞
P (error, x) dx =
ˆ ∞
−∞
P (error|x) p (x) dx (6)
14 / 71
Error in this rule
We have that
P (error|x) =
P (ω1|x) if we decide ω2
P (ω2|x) if we decide ω1
(5)
Thus, we have that
P (error) =
ˆ ∞
−∞
P (error, x) dx =
ˆ ∞
−∞
P (error|x) p (x) dx (6)
14 / 71
Classification Rule
Thus, we have the Bayes Classification Rule
1 If P (ω1|x) > P (ω2|x) x is classified to ω1
2 If P (ω1|x) < P (ω2|x) x is classified to ω2
15 / 71
Classification Rule
Thus, we have the Bayes Classification Rule
1 If P (ω1|x) > P (ω2|x) x is classified to ω1
2 If P (ω1|x) < P (ω2|x) x is classified to ω2
15 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
We have several cases
If for some x we have P (x|ω1) = P (x|ω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probable
In this case the decision is based entirely on the likelihoods P (x|ωi).
17 / 71
We have several cases
If for some x we have P (x|ω1) = P (x|ω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probable
In this case the decision is based entirely on the likelihoods P (x|ωi).
17 / 71
How the Rule looks like
If P (ω1) = P (ω2) the Rule depends on the term p (x|ωi)
18 / 71
The Error in the Second Case of Naive Bayes
Error in equiprobable classes
P (error) =
1
2
x0ˆ
−∞
p (x|ω2) dx +
1
2
∞ˆ
x0
p (x|ω1) dx (8)
Remark: P (ω1) = P (ω2) = 1
2
19 / 71
What do we want to prove?
Something Notable
Bayesian classifier is optimal with respect to minimizing the
classification error probability.
20 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Graphically P (ω1): Thanks Edith 2013 Class!!!
In Red
24 / 71
Thus we have´
R1
p (ω1|x) p (x) dx =
´
R1
p (ω1, x) dx = PR1
(ω1)
Thus
25 / 71
Finally
Finally
Pe = P (ω1) −
ˆ
R1
[p (ω1|x) − p (ω2|x)] p (x) dx (15)
Thus, we have
Pe =


P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx


 +
ˆ
R1
p (ω2|x) p (x) dx
26 / 71
Finally
Finally
Pe = P (ω1) −
ˆ
R1
[p (ω1|x) − p (ω2|x)] p (x) dx (15)
Thus, we have
Pe =


P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx


 +
ˆ
R1
p (ω2|x) p (x) dx
26 / 71
Pe for a non optimal rule
A great idea Edith!!!
27 / 71
Which decision function for minimizing the error
A single number in this case
28 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Pe for an optimal rule
A great idea Edith!!!
30 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi|x) > P (ωj|x) ∀j = i (16)
Something Notable
It turns out that such a choice also minimizes the classification error
probability.
31 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi|x) > P (ωj|x) ∀j = i (16)
Something Notable
It turns out that such a choice also minimizes the classification error
probability.
31 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
32 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
Example
In classification
In the classification case, binary or otherwise, a natural loss function is the
0-1 loss where y = f (x) :
y , y = 1 y = y (18)
35 / 71
Example
In classification
In the classification case, binary or otherwise, a natural loss function is the
0-1 loss where y = f (x) :
y , y = 1 y = y (18)
35 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Thus
Risk Minimization
Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads
to a very important learning rule, named empirical risk minimization
(ERM):
ˆfN = arg min
f ∈F
ˆR (f ) (21)
If we knew the distribution P and do not restrict ourselves F, the
best function would be
f ∗
= arg min
f
R (f ) (22)
37 / 71
Thus
Risk Minimization
Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads
to a very important learning rule, named empirical risk minimization
(ERM):
ˆfN = arg min
f ∈F
ˆR (f ) (21)
If we knew the distribution P and do not restrict ourselves F, the
best function would be
f ∗
= arg min
f
R (f ) (22)
37 / 71
Thus
For classification with 0-1 loss, f ∗
is called the Bayesian Classifier
f ∗
= arg min
y∈Y
P (Y = y|X = x) (23)
38 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Then
We have that
p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33)
for all j = i.
45 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
Definition
l12 is known as the likelihood ratio and the preceding test as the likelihood
ratio test.
47 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
48 / 71
Decision Surface
Because the R1 and R2 are contiguous
The separating surface between both of them is described by
P (ω1|x) − P (ω2|x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37)
49 / 71
Decision Surface
Because the R1 and R2 are contiguous
The separating surface between both of them is described by
P (ω1|x) − P (ω2|x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37)
49 / 71
Which decision function for the Naive Bayes
A single number in this case
50 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
52 / 71
Gaussian Distribution
We can use the Gaussian distribution
p (x|ωi) =
1
(2π)
l/2
|Σi|
1/2
exp −
1
2
(x − µi)T
Σ−1
i (x − µi) (38)
Example
53 / 71
Gaussian Distribution
We can use the Gaussian distribution
p (x|ωi) =
1
(2π)
l/2
|Σi|
1/2
exp −
1
2
(x − µi)T
Σ−1
i (x − µi) (38)
Example
53 / 71
Some Properties
About Σ
It is the covariance matrix between variables.
Thus
It is positive definite.
Symmetric.
The inverse exists.
54 / 71
Some Properties
About Σ
It is the covariance matrix between variables.
Thus
It is positive definite.
Symmetric.
The inverse exists.
54 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
55 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =
1 0
0 1
It simple the unit Gaussian with mean µ
56 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =
1 0
0 1
It simple the unit Gaussian with mean µ
56 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =
4 0
0 1
Actually, it flatten the circle through the x − axis
57 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =
4 0
0 1
Actually, it flatten the circle through the x − axis
57 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =
cos θ − sin θ
− sin θ cos θ
It allows to rotate the axises
58 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =
cos θ − sin θ
− sin θ cos θ
It allows to rotate the axises
58 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
60 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (x|ωj) ∼ N µj, Σj (41)
In our case
We will assume that there is no dependence between classes!!!
62 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (x|ωj) ∼ N µj, Σj (41)
In our case
We will assume that there is no dependence between classes!!!
62 / 71
Now
Suppose that ωj contains n samples x1, x2, ..., xn
p (x1, x2, ..., xn|θj) =
n
j=1
p (xj|θj) (42)
We can see then the function p (x1, x2, ..., xn|θj) as a function of
L (θj) =
n
j=1
p (xj|θj) (43)
63 / 71
Now
Suppose that ωj contains n samples x1, x2, ..., xn
p (x1, x2, ..., xn|θj) =
n
j=1
p (xj|θj) (42)
We can see then the function p (x1, x2, ..., xn|θj) as a function of
L (θj) =
n
j=1
p (xj|θj) (43)
63 / 71
Example
L (θj) = log n
j=1 p (xj|θj)
64 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
65 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood
Then
∂ ln L (ωi)
∂µi
=
n
j=1
Σ−1
i (xj − µi) = 0
nΣ−1
i

−µi +
1
n
n
j=1
xj

 = 0
ˆµi =
1
n
n
j=1
xj
67 / 71
Maximum Likelihood
Then, we derive with respect to Σi
For this we use the following tricks:
1
∂ log|Σ|
∂Σ−1 = − 1
|Σ| · |Σ| (Σ)T
= −Σ
2
∂Tr[AB]
∂A = ∂Tr[BA]
∂A = BT
3 Trace(of a number)=the number
4 Tr(AT B) = Tr BAT
Thus
f (Σi) = −
n
2
ln |ΣI | −
1
2
n
j=1
(xj − µi)T
Σ−1
i (xj − µi) + c1 (47)
68 / 71
Maximum Likelihood
Thus
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace (xj − µi)T
Σ−1
i (xj − µi) +c1 (48)
Tricks!!!
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace Σ−1
i (xj − µi) (xj − µi)T
+c1 (49)
69 / 71
Maximum Likelihood
Thus
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace (xj − µi)T
Σ−1
i (xj − µi) +c1 (48)
Tricks!!!
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace Σ−1
i (xj − µi) (xj − µi)T
+c1 (49)
69 / 71
Maximum Likelihood
Derivative with respect to Σ
∂f (Σi)
∂Σi
=
n
2
Σi −
1
2
n
j=1
(xj − µi) (xj − µi)T T
(50)
Thus, when making it equal to zero
ˆΣi =
1
n
n
j=1
(xj − µi) (xj − µi)T
(51)
70 / 71
Maximum Likelihood
Derivative with respect to Σ
∂f (Σi)
∂Σi
=
n
2
Σi −
1
2
n
j=1
(xj − µi) (xj − µi)T T
(50)
Thus, when making it equal to zero
ˆΣi =
1
n
n
j=1
(xj − µi) (xj − µi)T
(51)
70 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71

More Related Content

What's hot

Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Tien-Yang (Aiden) Wu
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
Text classification
Text classificationText classification
Text classificationHarry Potter
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceasimnawaz54
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada BoostAndres Mendez-Vazquez
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027ankitgadgil
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learningdhruvgairola
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Christian Robert
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
Lecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceLecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceasimnawaz54
 

What's hot (20)

Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Text classification
Text classificationText classification
Text classification
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Bayes 6
Bayes 6Bayes 6
Bayes 6
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Lecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceLecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inference
 

Viewers also liked

08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum AposterioriAndres Mendez-Vazquez
 
A Semi-naive Bayes Classifier with Grouping of Cases
A Semi-naive Bayes Classifier with Grouping of CasesA Semi-naive Bayes Classifier with Grouping of Cases
A Semi-naive Bayes Classifier with Grouping of CasesNTNU
 
Wikipedia, Dead Authors, Naive Bayes and Python
Wikipedia, Dead Authors, Naive Bayes and Python Wikipedia, Dead Authors, Naive Bayes and Python
Wikipedia, Dead Authors, Naive Bayes and Python Abhaya Agarwal
 
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)PyData
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationHammad Haleem
 
02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revisionJeonghun Yoon
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love BucharestStefan Adam
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 

Viewers also liked (12)

08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori08 Machine Learning Maximum Aposteriori
08 Machine Learning Maximum Aposteriori
 
A Semi-naive Bayes Classifier with Grouping of Cases
A Semi-naive Bayes Classifier with Grouping of CasesA Semi-naive Bayes Classifier with Grouping of Cases
A Semi-naive Bayes Classifier with Grouping of Cases
 
Wikipedia, Dead Authors, Naive Bayes and Python
Wikipedia, Dead Authors, Naive Bayes and Python Wikipedia, Dead Authors, Naive Bayes and Python
Wikipedia, Dead Authors, Naive Bayes and Python
 
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classification
 
02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revision
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 

Similar to 06 Machine Learning - Naive Bayes

bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.pptOmDalvi4
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningKumari Naveen
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theorytuxette
 
original
originaloriginal
originalbutest
 
Bayesian decesion theory
Bayesian decesion theoryBayesian decesion theory
Bayesian decesion theoryVARUN KUMAR
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Inder Singh
 
ISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxssuser1eba67
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using anijaia
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 

Similar to 06 Machine Learning - Naive Bayes (20)

Naive Bayes.pptx
Naive Bayes.pptxNaive Bayes.pptx
Naive Bayes.pptx
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learning
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theory
 
original
originaloriginal
original
 
Bayesian decesion theory
Bayesian decesion theoryBayesian decesion theory
Bayesian decesion theory
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Part1: Quest for DataScience 101
Part1: Quest for DataScience 101
 
ISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptx
 
Microeconomics Theory Exam Help
Microeconomics Theory Exam HelpMicroeconomics Theory Exam Help
Microeconomics Theory Exam Help
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using an
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
Prob distros
Prob distrosProb distros
Prob distros
 
Pattern baysin
Pattern baysinPattern baysin
Pattern baysin
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

06 Machine Learning - Naive Bayes

  • 1. Machine Learning for Data Mining Introduction to Bayesian Classifiers Andres Mendez-Vazquez August 3, 2015 1 / 71
  • 2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71
  • 3. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 4. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 5. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 6. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 4 / 71
  • 7. Classification Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71
  • 8. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 6 / 71
  • 9. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 10. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 11. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 12. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 13. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 14. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 15. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 16. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 17. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 18. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 19. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 20. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 21. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 22. The most important term in all this The factor likelihood × prior − information (3) 10 / 71
  • 23. Example We have the likelihood of two classes 11 / 71
  • 24. Example We have the posterior of two classes when P (ω1) = 2 3 and P (ω2) = 1 3 12 / 71
  • 25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, ωi) = 2 i=1 p (x|ωi) P (ωi) (4) 13 / 71
  • 26. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71
  • 27. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71
  • 28. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71
  • 29. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71
  • 30. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 31. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 32. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 33. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The final decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71
  • 34. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The final decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71
  • 35. How the Rule looks like If P (ω1) = P (ω2) the Rule depends on the term p (x|ωi) 18 / 71
  • 36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0ˆ −∞ p (x|ω2) dx + 1 2 ∞ˆ x0 p (x|ω1) dx (8) Remark: P (ω1) = P (ω2) = 1 2 19 / 71
  • 37. What do we want to prove? Something Notable Bayesian classifier is optimal with respect to minimizing the classification error probability. 20 / 71
  • 38. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 39. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 40. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 41. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 42. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 43. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 44. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 45. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 46. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 47. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 48. Graphically P (ω1): Thanks Edith 2013 Class!!! In Red 24 / 71
  • 49. Thus we have´ R1 p (ω1|x) p (x) dx = ´ R1 p (ω1, x) dx = PR1 (ω1) Thus 25 / 71
  • 50. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71
  • 51. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71
  • 52. Pe for a non optimal rule A great idea Edith!!! 27 / 71
  • 53. Which decision function for minimizing the error A single number in this case 28 / 71
  • 54. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 55. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 56. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 57. Pe for an optimal rule A great idea Edith!!! 30 / 71
  • 58. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classification error probability. 31 / 71
  • 59. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classification error probability. 31 / 71
  • 60. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 32 / 71
  • 61. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 62. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 63. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 64. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 65. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 66. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 67. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 68. Example In classification In the classification case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
  • 69. Example In classification In the classification case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
  • 70. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 71. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 72. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 73. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 74. Thus Risk Minimization Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71
  • 75. Thus Risk Minimization Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71
  • 76. Thus For classification with 0-1 loss, f ∗ is called the Bayesian Classifier f ∗ = arg min y∈Y P (Y = y|X = x) (23) 38 / 71
  • 77. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 78. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 79. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 80. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 81. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 82. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 83. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 84. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 85. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 86. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 87. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 88. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 89. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 90. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 91. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 92. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 93. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 94. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 95. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 96. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 97. Then We have that p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33) for all j = i. 45 / 71
  • 98. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 99. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 100. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 101. Special Case: The Two-Class Case Definition l12 is known as the likelihood ratio and the preceding test as the likelihood ratio test. 47 / 71
  • 102. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 48 / 71
  • 103. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we define the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71
  • 104. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we define the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71
  • 105. Which decision function for the Naive Bayes A single number in this case 50 / 71
  • 106. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 107. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 108. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 109. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 110. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 52 / 71
  • 111. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71
  • 112. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71
  • 113. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive definite. Symmetric. The inverse exists. 54 / 71
  • 114. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive definite. Symmetric. The inverse exists. 54 / 71
  • 115. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 55 / 71
  • 116. Influence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71
  • 117. Influence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71
  • 118. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it flatten the circle through the x − axis 57 / 71
  • 119. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it flatten the circle through the x − axis 57 / 71
  • 120. Influence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71
  • 121. Influence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71
  • 122. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 123. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 124. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 125. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 60 / 71
  • 126. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 127. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 128. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 129. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
  • 130. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
  • 131. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71
  • 132. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71
  • 133. Example L (θj) = log n j=1 p (xj|θj) 64 / 71
  • 134. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 65 / 71
  • 135. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 136. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 137. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 138. Maximum Likelihood Then ∂ ln L (ωi) ∂µi = n j=1 Σ−1 i (xj − µi) = 0 nΣ−1 i  −µi + 1 n n j=1 xj   = 0 ˆµi = 1 n n j=1 xj 67 / 71
  • 139. Maximum Likelihood Then, we derive with respect to Σi For this we use the following tricks: 1 ∂ log|Σ| ∂Σ−1 = − 1 |Σ| · |Σ| (Σ)T = −Σ 2 ∂Tr[AB] ∂A = ∂Tr[BA] ∂A = BT 3 Trace(of a number)=the number 4 Tr(AT B) = Tr BAT Thus f (Σi) = − n 2 ln |ΣI | − 1 2 n j=1 (xj − µi)T Σ−1 i (xj − µi) + c1 (47) 68 / 71
  • 140. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71
  • 141. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71
  • 142. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71
  • 143. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71
  • 144. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
  • 145. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
  • 146. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71