3. 3
1.UProbabilistic Classification.
Probabilistic classifiers construct a model that quantifies the relationship between the
feature variables and the target (class) variable as a probability.
There are many ways in which such a modeling can be performed. Two of the most
popular models are as follows:
1. Bayes classifier (generative classifier,)
2. Logistic regression (discriminative classifier,)
U1. Bayes classifier
The Bayes rule is used to model the probability of each value of the target variable for a
given set of feature variables.
It is assumed that the data points within a class are generated from a specific probability
distribution such as
• Bernoulli distribution
• Multinomial distribution.
A naive Bayes assumption of class-conditioned feature independence is often (but not
always) used to simplify the modeling.
2. ULogistic regression
The target variable is assumed to be drawn from a Bernoulli distribution whose mean is
defined by a parameterized logit function on the feature variables.
Thus, the probability distribution of the class variable is a parameterized function of the
feature variables. This is in contrast to the Bayes model that assumes a specific
generative model of the feature distribution of each class
4. 4
2. Naïve Bayes Classifier: -
• Naïve Bayes is a Supervised Learning Classifier.
• Naïve Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes' theorem with strong (naive) independence assumptions between
the features.
A Naïve Bayesian Model is easy to build, with no complicated iterative
parameter estimation which makes it useful for a very large datasets.
Naïve Bayes Classifier is surprisingly well and it's widely used because
it often outperforms more sophisticated classification methods.
It is based on Frequency Table.
How does it works..
• Bayes Theorem provides a way of calculating the Posterior Probability, P(c|x) from
P(c), P(x), and P(x|c) .
• Naïve Bayes Classifier assumes that the effect of the value of a predictor (x) on a
given class (c) is independent of the values of other predictors;
• this assumption is called Class Conditional Independence.
5. 5
• P (c|x): Posterior Probability of class (target) given predictor (attribute).
• P (x|c): is the Likelihood, which is the probability of the predictor given the class.
• P(c): is the Prior Probability of the class (before seeing any data).
• P(x): is the Prior probability of the predictor.
Example of Naïve Bayes Classifier
Id Outlook Temp Humidity Windy Play Tennis
1 Rainy Hot High False No
2 Rainy Hot High True No
3 Overcast Hot High false Yes
4 Sunny Mid High false Yes
5 Sunny Cool Normal false Yes
6 Sunny Cool Normal True No
7 Overcast Cool Normal True Yes
8 Rainy Mid High false No
9 Rainy Cool Normal false Yes
10 Sunny Mid Normal false Yes
11 Rainy Mid Normal True Yes
12 Overcast Mid High True Yes
13 Overcast Hot Normal false Yes
14 Sunny Mid High True No
6. 6
• Frequency Tables:
Table 1
Play tennis
Yes No
outlook
Sunny 3/9 2/5
Overcast 4/9 0/5
Rainy 2/9 3/5
Table 3
Play tennis
Yes No
Humidity
High 3/9 4/5
Normal 6/9 1/5
• Class Probability: Play Tennis
P (Yes) 9/14
P (No) 5/14
•
• Likelihood Tables
Table 1
Play tennis
Yes No
outlook
Sunny 3/9 2/5 5/14
Overcast 4/9 0/5 4/14
Rainy 2/9 3/5 5/14
Table 3
Play tennis
Yes No
Humidity
High 3/9 4/5 7/14
Normal 6/9 1/5 7/14
Say that we want to calculate the Posterior Probability to the class (Yes) given
(sunny) according to the P (C|X) previous equation:
P (C|X) = P (X|C)*P(C)/P(X)
P (Yes|Sunny) = P (Sunny|yes)*P(Yes)/P(sunny)
= (3/9) * (9/14) / (5/14)
= 0.33 * 0.64 / 0.36
= 0.60
Table 2
Play tennis
Yes No
Temp
Hot 2/9 2/5
Mid 4/9 2/5
Cool 3/9 1/5
Table 4
Play tennis
Yes No
Windy
False 6/9 2/5
True 3/9 3/5
Table 2
Play tennis
Yes No
Temp
Hot 2/9 2/5 4/14
Mid 4/9 2/5 6/14
Cool 3/9 1/5 4/14
Table 4
Play tennis
Yes No
Windy
False 6/9 2/5 8/14
True 3/9 3/5 6/14
7. 7
Now let's assume the following data of a day:
id Outlook Temp Humidity Windy Play Tennis
Rainy Mid Normal True ?
Likelihood of Yes =
P(Outlook=Rainy|Yes)
*P(Temp=Mid|Yes)P(Humidity=Normal|Yes)*P(Windy=True|Yes)*P(Yes)
= 2/9 *4/9 *6/9 *3/9 * 9/14
= 0.014109347
Likelihood of No =P(Outlook=Rainy|No)
*P(Temp=Mid|No)*P(Humidity=Normal|No)*P(Windy=True|No)*P(Yes)
= 3/5 *2/5 *1/5 *3/5 * 5/14
= 0.010285714
Normalizing (dividing by the evidence)
P (Yes) = 0.014109347/ (0.014109347+0.010285714) = 0.578368999
P (No) = 0.010285714/ (0.014109347+0.010285714) =0.421631001
P (Yes) > P (No)
id Outlook Temp Humidity Windy Play Tennis
Rainy Mid Normal True yes
Since the evidence is constant and scales both posteriors equally. It therefore does not
affect classification and can be ignored.
8. 8
3. Logistic Regression
Logistic regression is a regression model where the dependent variable (DV) is
categorical. The output can take only two values, "0" and "1" (binary classification),
which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick.
Idea 1: Let 𝑝𝑝(𝑥𝑥)be a linear function
• W are estimating a probability, which must be between 0 and 1
• Linear functions are unbounded, so this approach doesn’t work
Better idea:
• Set the odds ratio to a linear function:
log𝑜𝑜𝑑𝑑𝑑𝑑𝑠𝑠=𝑙𝑙𝑜𝑜 𝑔𝑔𝑖𝑖𝑡𝑡𝑝𝑝=ln𝑝𝑝1−𝑝𝑝=𝛽𝛽0+𝛽𝛽1 𝑥𝑥
Solving for p:
𝑝𝑝𝑥𝑥=𝑒𝑒𝛽𝛽0+𝛽𝛽1𝑥𝑥1+ 𝑒𝑒 𝛽𝛽0+𝛽𝛽1𝑥𝑥= 𝟏𝟏/𝟏𝟏+ 𝒆𝒆− (𝜷𝜷𝟎𝟎+𝜷𝜷𝟏𝟏 𝒙𝒙)
• This is called the logistic (logit) function and it assumes values [0,1]
• 𝛽𝛽0, 𝛽𝛽1, are estimated as the ‘log-odds’ of a unit change in the input feature
it is associated with.
Logit Function:
Logistic regression is an estimate of a logit function. Is used to estimate probabilities of
class membership instead of constructing a squared error objective here is how the logit
function looks like:
The core of logistic regression is the sigmoid function:
9. 9
The sigmoid function wraps linear function y = mx+b or y= 𝛽𝛽0+𝛽𝛽1𝑥𝑥 or to
force the output to be between 0 and 1. The output can, therefore, be interpreted as a
probability.
Logistic function linear function
To minimize misclassification rates, we predict:
• 𝑌𝑌=1when 𝑝𝑝(𝑥𝑥)≥0.5 and 𝑌𝑌=0 when 𝑝𝑝(𝑥𝑥)<0.5
• So 𝑌𝑌=1when 𝛽𝛽0+𝛽𝛽1𝑥𝑥 is non-negative and 0 otherwise
•Logistic regression gives us a linear classifier where the decision boundary
separating the two classes is the solution of 𝛽𝛽0+𝛽𝛽1𝑥𝑥=0
10.5.2.1 Training a Logistic Regression Classifier
The maximum likelihood approach is used to estimate the best fitting parameters of the
Logistic regression model
In other meaning the parameters 𝛽𝛽0, 𝛽𝛽1, are estimated using a technique called
Maximum likelihood estimation
Logistic regression is similar to classical least-squares linear regression
The difference that the logit function is used to estimate probabilities of class
membership instead of constructing a squared error objective. Consequently, instead
of the least-squares optimization in linear regression, a maximum likelihood
optimization model is used for logistic regression
10. 10
Example: Suppose that medical researchers are interested in exploring the relationship
between patient age (x) and the presence (1) or absence (0) of a particular disease (y).
The data collected from 20 patients are shown
Let’s examine the results of the logistic regression of disease on age, shown in Table 4.2. The coefficients, that
is, the maximum likelihood estimates of the unknown Parameters β0 and β1, are given as :
𝛽𝛽0= −4.372
𝛽𝛽1 = 0.06696.
N Age Y
1 25 0
2 29 0
3 30 0
4 31 0
5 32 0
6 41 0
7 41 0
8 42 0
9 44 1
10 49 1
11 50 0
12 59 1
13 60 0
14 62 0
15 68 1
16 72 0
17 79 1
18 80 0
19 81 1
20 84 1
sum 1059 7
11. 11
These equations may then be used to estimate the probability that the disease
is present in a particular patient given the patient’s age. For example, for a 50-
year-old patient, we have
Thus, the estimated probability that a 50-year-old patient has the disease is
26%, and the estimated probability that the disease is not present is 100% −
26% = 74%. On the other hand, for a 72-year-old patient, we have
The estimated probability that a 72-year-old patient has the disease is 61%, and
the estimated probability that the disease is not present is 39%.
12. 12
4. Linear Support Vector Machine in Mathematical
Steps for Solution the LSVM
1. Find the maximum margin linear.
2. Determine the support vector.
3. Determine ( x ,y )for each support vector.
4. Here we will use vectors augmented with a 1 as a bias input, and for clarity we
will differentiate these with an over-tilde.
5. Determine the class the each support vector belong it. If the group > 1 the class
equals the 1, otherwise equal to -1.
6. Find the ρi for each support vector by apply equation
ρi Si Sj+ ρi Si+1 Sj+...........= (- or + 1 dependence of class (y))
For each support vector.
7. The hyper plane that discriminates the positive class from the negative class is
given by:- 𝑊𝑊� = ∑ 𝛼𝛼𝑖𝑖 𝑆𝑆𝑖𝑖
�
8. Our vectors are augmented with a bias.
9. Hence we can equate the entry in 𝑤𝑤 as the hyper plane with an offset b.
10- Therefore the separating hyper plane equation 𝑦𝑦=𝑤𝑤𝑥𝑥+𝑏𝑏
Fig.4 (LSVM)
13. 13
Example for LSVM: - Find the support vector machine?
Solution:-
Find the maximum margin linear.
Determine the support vector.
Here we select 3 Support Vectors to start with.They are S1, S2 and S3.
Determine ( x1, x2 )for each support vector.
𝑆𝑆1 = �
2
1
� , 𝑆𝑆2 = �
2
−1
� , 𝑆𝑆3 = �
4
0
�
Here we will use vectors augmented with a 1 as a bias input, and for clarity we will
differentiate these with an over-tilde.
𝑆𝑆1
� = �
2
1
1
� , 𝑆𝑆2
� = �
2
−1
1
� , 𝑆𝑆3
� = �
4
0
1
�
14. 14
Now we need to find 3 parameters 𝛼𝛼1,𝛼𝛼2, and 𝛼𝛼3 based on the following 3 linear
equations:
Let's substitute the values for 𝑆𝑆1
� , 𝑆𝑆2
� , 𝑎𝑎𝑎𝑎𝑎𝑎 𝑆𝑆3
� in the above equation.
𝑆𝑆1
� = �
2
1
1
� , 𝑆𝑆2
� = �
2
−1
1
� , 𝑆𝑆3
� = �
4
0
1
�
After simplification we get:
Simplifying the above 3 simultaneous equations we get:
𝛼𝛼1 = 𝛼𝛼2 = −3.5 𝑎𝑎𝑎𝑎𝑎𝑎 𝛼𝛼3 = 3.5
The hyperplane that discriminates the positive class from the negative class is given by:
𝑊𝑊� = � 𝛼𝛼𝑖𝑖 𝑆𝑆𝑖𝑖
�
15. 15
Substituting the value we get:
Therefore the separating hyper plane equation 𝑦𝑦=𝑤𝑤𝑥𝑥+𝑏𝑏 with 𝑤𝑤�=
1
0
and offset 𝑏𝑏=−3.
16. 16
Example2:
Factory “ABC” produces very precise high quality chip rings that their qualities are
measured in term of curvature and diameter. Result of quality control by experts is
given in the Table below
curvature diameter Quality control result
2.947814 6.626878 Passed
2.530388 7.785050 Passed
3.566991 5.651046 Passed
3.156983 5.467077 Passed
2.582346 4.457777 Not-passed
2.155826 6.222343 Not-passed
3.273418 3.520687 Not-passed
2.8100 5.456782 ?
The new chip rings have curvature 2.8100 and diameter 5.456782. Can you solve this
problem by employing SVM?
SOLUTION:
In above example, we have training data consists of two numerical features, curvature
and diameter. For each data, we also have predetermined groups: Passed or Not-Passed
the manual quality control. We are going to create a model to classify the training data.
0
2
4
6
8
10
0 1 2 3 4
diameter
curvature
y= -1
y=+1
18. 18
5. Instance-Based Learning
Most of the classifiers discussed in the previous sections are eager learners in which
the classification model is constructed up front and then used to classify a specific test
instance
• In instance-based learning, the training is delayed until the last step of
classification. Such classifiers are also referred to as lazy learners
• The simplest principle to describe instance based learning is as follows:
Similar instances have similar class labels.
Different Learning Methods
• Eager Learning
– Learning = acquiring an explicit structure of a classifier on the whole
training set;
– Classification = an instance gets a classification using the explicit structure
of the classifier.
• Instance-Based Learning (Lazy Learning)
– Learning = storing all training instances
– Classification = an instance gets a classification equal to the classification
of the nearest instances to the instance
Similar instances have similar class labels.
19. 19
5.1 Design Variations of Nearest Neighbor Classifiers
Unsupervised Mahalanobis Metric
The value of A is chosen to be the inverse of the d × d covariance matrix Σ of the data
set. The (i, j)th entry of the matrix Σ is the covariance between the dimensions i and j.
Therefore, the Mahalanobis distance is defined as follows:
The Mahalanobis metric adjusts well to the different scaling of the dimensions and the
Redundancies across different features. Even when the data is uncorrelated, the
Mahalanobis metric is useful because it auto-scales for the naturally different ranges of
attributes describing different physical quantities,
How does Mahalanobis Metric works..
1. We need to find the center matrix for each group
2. Then, we calculate the covariance matrix, which is calculated as follows:
3. The next step after creating the covariance matrices for group 1 and group 2 is
to calculate the pooled covariance matrix
4. Finally to calculate the Mahalanobis distance by taking the square root of
multiplication of the difference between the means of G1 and G2 by the inverse
of pooled covariance matrix.
Example:
Group 1 Group 2
x1 y1 x2 y2
2 2 6 5
2 5 7 4
6 5 8 7
7 3 5 6
4 7 5 4
6 4
5 3
4 6
2 5
1 3
Mean
X1 Y1 X2 Y2
3.9 4.3 6.2 5.2
Total data of group 1 = M 10
Total data of group 2 = N 5
Total data = q 15
20. 20
1. We need to find the center matrix for each group, which can be calculated using
the following formula:
Center matrix X1= 𝑿𝑿𝑿𝑿 − 𝑿𝑿�
Center matrix Y1 = 𝒀𝒀𝒀𝒀 − 𝒀𝒀�
The centered groups are:
Group 1 Group 2
x1 y1 x2 y2
-1.90 -2.30 -0.20 -0.20
-1.90 0.70 0.80 -1.20
2.10 0.70 1.80 1.80
3.10 -1.30 -1.20 0.80
0.10 2.70 -1.20 -1.20
2.10 -0.30
1.10 -1.30
0.10 1.70
-1.90 0.70
-2.90 -1.30
2. Then, we calculate the covariance for group 1 and 2 matrix, which is calculated
as follows:
1/n X.XT
where n is the number of data points
1
𝑁𝑁
X
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎡
−𝟏𝟏. 𝟗𝟗 −𝟐𝟐. 𝟑𝟑
−𝟏𝟏. 𝟗𝟗 𝟎𝟎. 𝟕𝟕
𝟐𝟐. 𝟏𝟏 𝟎𝟎. 𝟕𝟕
𝟑𝟑. 𝟏𝟏 −𝟏𝟏. 𝟑𝟑
𝟎𝟎. 𝟏𝟏 𝟐𝟐. 𝟕𝟕
𝟐𝟐. 𝟏𝟏 −𝟎𝟎. 𝟑𝟑
𝟏𝟏. 𝟏𝟏 −𝟏𝟏. 𝟑𝟑
𝟎𝟎. 𝟏𝟏 𝟏𝟏. 𝟕𝟕
−𝟏𝟏. 𝟗𝟗 𝟎𝟎. 𝟕𝟕
−𝟐𝟐. 𝟗𝟗 −𝟏𝟏. 𝟑𝟑⎦
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎤
X �
−𝟏𝟏. 𝟗𝟗 −𝟏𝟏. 𝟗𝟗 𝟐𝟐. 𝟏𝟏 𝟑𝟑. 𝟏𝟏 𝟎𝟎. 𝟏𝟏 𝟐𝟐. 𝟏𝟏 𝟏𝟏. 𝟏𝟏 𝟎𝟎. 𝟏𝟏 −𝟏𝟏. 𝟗𝟗 −𝟐𝟐. 𝟗𝟗
−𝟐𝟐. 𝟑𝟑 𝟎𝟎. 𝟕𝟕 𝟎𝟎. 𝟕𝟕 −𝟏𝟏. 𝟑𝟑 𝟐𝟐. 𝟕𝟕 −𝟎𝟎. 𝟑𝟑 −𝟏𝟏. 𝟑𝟑 𝟏𝟏. 𝟕𝟕 𝟎𝟎. 𝟕𝟕 −𝟏𝟏. 𝟑𝟑
�T =
21. 21
The result will be:
Covariance of Group 1
x1 y1
x1 3.89 0.13
y1 0.13 2.21
Covariance of Group 2
x2 y2
x2 1.36 0.56
y2 0.56 1.36
3. The next step after creating the covariance matrices for group 1 and group 2 is to
calculate the pooled covariance matrix
Pooled Covariance Matrix
x y
x 3.05 0.27
y 0.27 1.93
4. Finally to calculate the Mahalanobis distance by taking the square root of
multiplication of the difference between the means of G1 and G2 by the inverse of
pooled covariance matrix.
22. 22
Inverse Pooled Covariance matrix
INVERS �
3.05 0.27
0.27 1.93
� =
1
(3.05∗1.93)−(0.27∗0.27)
x �
1.93 −0.27
−0.27 3.05
� = �
0.332 −0.047
−0.047 0.526
�
x Y
x 0.332 -0.047
y -0.047 0.526
Mean difference (G1- G2)
-2.3 𝑿𝑿𝑿𝑿���� − 𝑿𝑿𝑿𝑿���� = 3.9 - 6.2 = -2.3
-0.9 𝒚𝒚𝒚𝒚���� − 𝒚𝒚𝒚𝒚���� = 4.3 - 5.2 = -0.9
Mahalanobis distance
= 1.41
�
−2.3
−0.9
� x (−2.3 −0.9) x �
0.332 −0.047
−0.047 0.526
� = 1.41
23. 23
References:
Data mining // The Textbook by:
Charu C. Aggarwal
IBM T.J. Watson Research Center
Yorktown Heights
New York USA
• https://www.autonlab.org/tutorials/mbl.html
• https://en.wikipedia.org/wiki/Logistic_regression
• http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.h
tml
• https://www.mathsisfun.com/algebra/matrix-multiplying.html