3. 3
1.UProbabilistic Classification.
Probabilistic classifiers construct a model that quantifies the relationship between the
feature variables and the target (class) variable as a probability.
There are many ways in which such a modeling can be performed. Two of the most
popular models are as follows:
1. Bayes classifier (generative classifier,)
2. Logistic regression (discriminative classifier,)
U1. Bayes classifier
The Bayes rule is used to model the probability of each value of the target variable for a
given set of feature variables.
It is assumed that the data points within a class are generated from a specific probability
distribution such as
โข Bernoulli distribution
โข Multinomial distribution.
A naive Bayes assumption of class-conditioned feature independence is often (but not
always) used to simplify the modeling.
2. ULogistic regression
The target variable is assumed to be drawn from a Bernoulli distribution whose mean is
defined by a parameterized logit function on the feature variables.
Thus, the probability distribution of the class variable is a parameterized function of the
feature variables. This is in contrast to the Bayes model that assumes a specific
generative model of the feature distribution of each class
4. 4
2. Naรฏve Bayes Classifier: -
โข Naรฏve Bayes is a Supervised Learning Classifier.
โข Naรฏve Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes' theorem with strong (naive) independence assumptions between
the features.
๏ถ A Naรฏve Bayesian Model is easy to build, with no complicated iterative
parameter estimation which makes it useful for a very large datasets.
๏ถ Naรฏve Bayes Classifier is surprisingly well and it's widely used because
it often outperforms more sophisticated classification methods.
๏ถ It is based on Frequency Table.
How does it works..
โข Bayes Theorem provides a way of calculating the Posterior Probability, P(c|x) from
P(c), P(x), and P(x|c) .
โข Naรฏve Bayes Classifier assumes that the effect of the value of a predictor (x) on a
given class (c) is independent of the values of other predictors;
โข this assumption is called Class Conditional Independence.
5. 5
โข P (c|x): Posterior Probability of class (target) given predictor (attribute).
โข P (x|c): is the Likelihood, which is the probability of the predictor given the class.
โข P(c): is the Prior Probability of the class (before seeing any data).
โข P(x): is the Prior probability of the predictor.
Example of Naรฏve Bayes Classifier
Id Outlook Temp Humidity Windy Play Tennis
1 Rainy Hot High False No
2 Rainy Hot High True No
3 Overcast Hot High false Yes
4 Sunny Mid High false Yes
5 Sunny Cool Normal false Yes
6 Sunny Cool Normal True No
7 Overcast Cool Normal True Yes
8 Rainy Mid High false No
9 Rainy Cool Normal false Yes
10 Sunny Mid Normal false Yes
11 Rainy Mid Normal True Yes
12 Overcast Mid High True Yes
13 Overcast Hot Normal false Yes
14 Sunny Mid High True No
6. 6
โข Frequency Tables:
Table 1
Play tennis
Yes No
outlook
Sunny 3/9 2/5
Overcast 4/9 0/5
Rainy 2/9 3/5
Table 3
Play tennis
Yes No
Humidity
High 3/9 4/5
Normal 6/9 1/5
โข Class Probability: Play Tennis
P (Yes) 9/14
P (No) 5/14
โข
โข Likelihood Tables
Table 1
Play tennis
Yes No
outlook
Sunny 3/9 2/5 5/14
Overcast 4/9 0/5 4/14
Rainy 2/9 3/5 5/14
Table 3
Play tennis
Yes No
Humidity
High 3/9 4/5 7/14
Normal 6/9 1/5 7/14
Say that we want to calculate the Posterior Probability to the class (Yes) given
(sunny) according to the P (C|X) previous equation:
P (C|X) = P (X|C)*P(C)/P(X)
P (Yes|Sunny) = P (Sunny|yes)*P(Yes)/P(sunny)
= (3/9) * (9/14) / (5/14)
= 0.33 * 0.64 / 0.36
= 0.60
Table 2
Play tennis
Yes No
Temp
Hot 2/9 2/5
Mid 4/9 2/5
Cool 3/9 1/5
Table 4
Play tennis
Yes No
Windy
False 6/9 2/5
True 3/9 3/5
Table 2
Play tennis
Yes No
Temp
Hot 2/9 2/5 4/14
Mid 4/9 2/5 6/14
Cool 3/9 1/5 4/14
Table 4
Play tennis
Yes No
Windy
False 6/9 2/5 8/14
True 3/9 3/5 6/14
7. 7
Now let's assume the following data of a day:
id Outlook Temp Humidity Windy Play Tennis
Rainy Mid Normal True ?
Likelihood of Yes =
P(Outlook=Rainy|Yes)
*P(Temp=Mid|Yes)P(Humidity=Normal|Yes)*P(Windy=True|Yes)*P(Yes)
= 2/9 *4/9 *6/9 *3/9 * 9/14
= 0.014109347
Likelihood of No =P(Outlook=Rainy|No)
*P(Temp=Mid|No)*P(Humidity=Normal|No)*P(Windy=True|No)*P(Yes)
= 3/5 *2/5 *1/5 *3/5 * 5/14
= 0.010285714
Normalizing (dividing by the evidence)
P (Yes) = 0.014109347/ (0.014109347+0.010285714) = 0.578368999
P (No) = 0.010285714/ (0.014109347+0.010285714) =0.421631001
P (Yes) > P (No)
id Outlook Temp Humidity Windy Play Tennis
Rainy Mid Normal True yes
Since the evidence is constant and scales both posteriors equally. It therefore does not
affect classification and can be ignored.
8. 8
3. Logistic Regression
Logistic regression is a regression model where the dependent variable (DV) is
categorical. The output can take only two values, "0" and "1" (binary classification),
which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick.
Idea 1: Let ๐๐(๐ฅ๐ฅ)be a linear function
โข W are estimating a probability, which must be between 0 and 1
โข Linear functions are unbounded, so this approach doesnโt work
Better idea:
โข Set the odds ratio to a linear function:
log๐๐๐๐๐๐๐ ๐ =๐๐๐๐ ๐๐๐๐๐ก๐ก๐๐=ln๐๐1โ๐๐=๐ฝ๐ฝ0+๐ฝ๐ฝ1 ๐ฅ๐ฅ
Solving for p:
๐๐๐ฅ๐ฅ=๐๐๐ฝ๐ฝ0+๐ฝ๐ฝ1๐ฅ๐ฅ1+ ๐๐ ๐ฝ๐ฝ0+๐ฝ๐ฝ1๐ฅ๐ฅ= ๐๐/๐๐+ ๐๐โ (๐ท๐ท๐๐+๐ท๐ท๐๐ ๐๐)
โข This is called the logistic (logit) function and it assumes values [0,1]
โข ๐ฝ๐ฝ0, ๐ฝ๐ฝ1, are estimated as the โlog-oddsโ of a unit change in the input feature
it is associated with.
๏ถ Logit Function:
Logistic regression is an estimate of a logit function. Is used to estimate probabilities of
class membership instead of constructing a squared error objective here is how the logit
function looks like:
๏ถ The core of logistic regression is the sigmoid function:
9. 9
The sigmoid function wraps linear function y = mx+b or y= ๐ฝ๐ฝ0+๐ฝ๐ฝ1๐ฅ๐ฅ or to
force the output to be between 0 and 1. The output can, therefore, be interpreted as a
probability.
Logistic function linear function
To minimize misclassification rates, we predict:
โข ๐๐=1when ๐๐(๐ฅ๐ฅ)โฅ0.5 and ๐๐=0 when ๐๐(๐ฅ๐ฅ)<0.5
โข So ๐๐=1when ๐ฝ๐ฝ0+๐ฝ๐ฝ1๐ฅ๐ฅ is non-negative and 0 otherwise
โขLogistic regression gives us a linear classifier where the decision boundary
separating the two classes is the solution of ๐ฝ๐ฝ0+๐ฝ๐ฝ1๐ฅ๐ฅ=0
10.5.2.1 Training a Logistic Regression Classifier
The maximum likelihood approach is used to estimate the best fitting parameters of the
Logistic regression model
In other meaning the parameters ๐ฝ๐ฝ0, ๐ฝ๐ฝ1, are estimated using a technique called
Maximum likelihood estimation
Logistic regression is similar to classical least-squares linear regression
The difference that the logit function is used to estimate probabilities of class
membership instead of constructing a squared error objective. Consequently, instead
of the least-squares optimization in linear regression, a maximum likelihood
optimization model is used for logistic regression
10. 10
Example: Suppose that medical researchers are interested in exploring the relationship
between patient age (x) and the presence (1) or absence (0) of a particular disease (y).
The data collected from 20 patients are shown
Letโs examine the results of the logistic regression of disease on age, shown in Table 4.2. The coefficients, that
is, the maximum likelihood estimates of the unknown Parameters ฮฒ0 and ฮฒ1, are given as :
๐ฝ๐ฝ0= โ4.372
๐ฝ๐ฝ1 = 0.06696.
N Age Y
1 25 0
2 29 0
3 30 0
4 31 0
5 32 0
6 41 0
7 41 0
8 42 0
9 44 1
10 49 1
11 50 0
12 59 1
13 60 0
14 62 0
15 68 1
16 72 0
17 79 1
18 80 0
19 81 1
20 84 1
sum 1059 7
11. 11
These equations may then be used to estimate the probability that the disease
is present in a particular patient given the patientโs age. For example, for a 50-
year-old patient, we have
Thus, the estimated probability that a 50-year-old patient has the disease is
26%, and the estimated probability that the disease is not present is 100% โ
26% = 74%. On the other hand, for a 72-year-old patient, we have
The estimated probability that a 72-year-old patient has the disease is 61%, and
the estimated probability that the disease is not present is 39%.
12. 12
4. Linear Support Vector Machine in Mathematical
Steps for Solution the LSVM
1. Find the maximum margin linear.
2. Determine the support vector.
3. Determine ( x ,y )for each support vector.
4. Here we will use vectors augmented with a 1 as a bias input, and for clarity we
will differentiate these with an over-tilde.
5. Determine the class the each support vector belong it. If the group > 1 the class
equals the 1, otherwise equal to -1.
6. Find the ฯi for each support vector by apply equation
ฯi Si Sj+ ฯi Si+1 Sj+...........= (- or + 1 dependence of class (y))
For each support vector.
7. The hyper plane that discriminates the positive class from the negative class is
given by:- ๐๐๏ฟฝ = โ ๐ผ๐ผ๐๐ ๐๐๐๐
๏ฟฝ
8. Our vectors are augmented with a bias.
9. Hence we can equate the entry in ๐ค๐ค as the hyper plane with an offset b.
10- Therefore the separating hyper plane equation ๐ฆ๐ฆ=๐ค๐ค๐ฅ๐ฅ+๐๐
Fig.4 (LSVM)
13. 13
Example for LSVM: - Find the support vector machine?
Solution:-
๏ถFind the maximum margin linear.
๏ถDetermine the support vector.
Here we select 3 Support Vectors to start with.They are S1, S2 and S3.
๏ถDetermine ( x1, x2 )for each support vector.
๐๐1 = ๏ฟฝ
2
1
๏ฟฝ , ๐๐2 = ๏ฟฝ
2
โ1
๏ฟฝ , ๐๐3 = ๏ฟฝ
4
0
๏ฟฝ
๏ถHere we will use vectors augmented with a 1 as a bias input, and for clarity we will
differentiate these with an over-tilde.
๐๐1
๏ฟฝ = ๏ฟฝ
2
1
1
๏ฟฝ , ๐๐2
๏ฟฝ = ๏ฟฝ
2
โ1
1
๏ฟฝ , ๐๐3
๏ฟฝ = ๏ฟฝ
4
0
1
๏ฟฝ
14. 14
๏ถNow we need to find 3 parameters ๐ผ๐ผ1,๐ผ๐ผ2, and ๐ผ๐ผ3 based on the following 3 linear
equations:
Let's substitute the values for ๐๐1
๏ฟฝ , ๐๐2
๏ฟฝ , ๐๐๐๐๐๐ ๐๐3
๏ฟฝ in the above equation.
๐๐1
๏ฟฝ = ๏ฟฝ
2
1
1
๏ฟฝ , ๐๐2
๏ฟฝ = ๏ฟฝ
2
โ1
1
๏ฟฝ , ๐๐3
๏ฟฝ = ๏ฟฝ
4
0
1
๏ฟฝ
After simplification we get:
Simplifying the above 3 simultaneous equations we get:
๐ผ๐ผ1 = ๐ผ๐ผ2 = โ3.5 ๐๐๐๐๐๐ ๐ผ๐ผ3 = 3.5
The hyperplane that discriminates the positive class from the negative class is given by:
๐๐๏ฟฝ = ๏ฟฝ ๐ผ๐ผ๐๐ ๐๐๐๐
๏ฟฝ
15. 15
Substituting the value we get:
Therefore the separating hyper plane equation ๐ฆ๐ฆ=๐ค๐ค๐ฅ๐ฅ+๐๐ with ๐ค๐ค๏ฟฝ=
1
0
and offset ๐๐=โ3.
16. 16
Example2:
Factory โABCโ produces very precise high quality chip rings that their qualities are
measured in term of curvature and diameter. Result of quality control by experts is
given in the Table below
curvature diameter Quality control result
2.947814 6.626878 Passed
2.530388 7.785050 Passed
3.566991 5.651046 Passed
3.156983 5.467077 Passed
2.582346 4.457777 Not-passed
2.155826 6.222343 Not-passed
3.273418 3.520687 Not-passed
2.8100 5.456782 ?
The new chip rings have curvature 2.8100 and diameter 5.456782. Can you solve this
problem by employing SVM?
SOLUTION:
In above example, we have training data consists of two numerical features, curvature
and diameter. For each data, we also have predetermined groups: Passed or Not-Passed
the manual quality control. We are going to create a model to classify the training data.
0
2
4
6
8
10
0 1 2 3 4
diameter
curvature
y= -1
y=+1
18. 18
5. Instance-Based Learning
Most of the classifiers discussed in the previous sections are eager learners in which
the classification model is constructed up front and then used to classify a specific test
instance
โข In instance-based learning, the training is delayed until the last step of
classification. Such classifiers are also referred to as lazy learners
โข The simplest principle to describe instance based learning is as follows:
Similar instances have similar class labels.
๏ Different Learning Methods
โข Eager Learning
โ Learning = acquiring an explicit structure of a classifier on the whole
training set;
โ Classification = an instance gets a classification using the explicit structure
of the classifier.
โข Instance-Based Learning (Lazy Learning)
โ Learning = storing all training instances
โ Classification = an instance gets a classification equal to the classification
of the nearest instances to the instance
Similar instances have similar class labels.
19. 19
5.1 Design Variations of Nearest Neighbor Classifiers
Unsupervised Mahalanobis Metric
The value of A is chosen to be the inverse of the d ร d covariance matrix ฮฃ of the data
set. The (i, j)th entry of the matrix ฮฃ is the covariance between the dimensions i and j.
Therefore, the Mahalanobis distance is defined as follows:
The Mahalanobis metric adjusts well to the different scaling of the dimensions and the
Redundancies across different features. Even when the data is uncorrelated, the
Mahalanobis metric is useful because it auto-scales for the naturally different ranges of
attributes describing different physical quantities,
How does Mahalanobis Metric works..
1. We need to find the center matrix for each group
2. Then, we calculate the covariance matrix, which is calculated as follows:
3. The next step after creating the covariance matrices for group 1 and group 2 is
to calculate the pooled covariance matrix
4. Finally to calculate the Mahalanobis distance by taking the square root of
multiplication of the difference between the means of G1 and G2 by the inverse
of pooled covariance matrix.
Example:
Group 1 Group 2
x1 y1 x2 y2
2 2 6 5
2 5 7 4
6 5 8 7
7 3 5 6
4 7 5 4
6 4
5 3
4 6
2 5
1 3
Mean
X1 Y1 X2 Y2
3.9 4.3 6.2 5.2
Total data of group 1 = M 10
Total data of group 2 = N 5
Total data = q 15
20. 20
1. We need to find the center matrix for each group, which can be calculated using
the following formula:
Center matrix X1= ๐ฟ๐ฟ๐ฟ๐ฟ โ ๐ฟ๐ฟ๏ฟฝ
Center matrix Y1 = ๐๐๐๐ โ ๐๐๏ฟฝ
The centered groups are:
Group 1 Group 2
x1 y1 x2 y2
-1.90 -2.30 -0.20 -0.20
-1.90 0.70 0.80 -1.20
2.10 0.70 1.80 1.80
3.10 -1.30 -1.20 0.80
0.10 2.70 -1.20 -1.20
2.10 -0.30
1.10 -1.30
0.10 1.70
-1.90 0.70
-2.90 -1.30
2. Then, we calculate the covariance for group 1 and 2 matrix, which is calculated
as follows:
1/n X.XT
where n is the number of data points
1
๐๐
X
โฃ
โข
โข
โข
โข
โข
โข
โข
โข
โก
โ๐๐. ๐๐ โ๐๐. ๐๐
โ๐๐. ๐๐ ๐๐. ๐๐
๐๐. ๐๐ ๐๐. ๐๐
๐๐. ๐๐ โ๐๐. ๐๐
๐๐. ๐๐ ๐๐. ๐๐
๐๐. ๐๐ โ๐๐. ๐๐
๐๐. ๐๐ โ๐๐. ๐๐
๐๐. ๐๐ ๐๐. ๐๐
โ๐๐. ๐๐ ๐๐. ๐๐
โ๐๐. ๐๐ โ๐๐. ๐๐โฆ
โฅ
โฅ
โฅ
โฅ
โฅ
โฅ
โฅ
โฅ
โค
X ๏ฟฝ
โ๐๐. ๐๐ โ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ โ๐๐. ๐๐ โ๐๐. ๐๐
โ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ โ๐๐. ๐๐ ๐๐. ๐๐ โ๐๐. ๐๐ โ๐๐. ๐๐ ๐๐. ๐๐ ๐๐. ๐๐ โ๐๐. ๐๐
๏ฟฝT =
21. 21
The result will be:
Covariance of Group 1
x1 y1
x1 3.89 0.13
y1 0.13 2.21
Covariance of Group 2
x2 y2
x2 1.36 0.56
y2 0.56 1.36
3. The next step after creating the covariance matrices for group 1 and group 2 is to
calculate the pooled covariance matrix
Pooled Covariance Matrix
x y
x 3.05 0.27
y 0.27 1.93
4. Finally to calculate the Mahalanobis distance by taking the square root of
multiplication of the difference between the means of G1 and G2 by the inverse of
pooled covariance matrix.
22. 22
Inverse Pooled Covariance matrix
INVERS ๏ฟฝ
3.05 0.27
0.27 1.93
๏ฟฝ =
1
(3.05โ1.93)โ(0.27โ0.27)
x ๏ฟฝ
1.93 โ0.27
โ0.27 3.05
๏ฟฝ = ๏ฟฝ
0.332 โ0.047
โ0.047 0.526
๏ฟฝ
x Y
x 0.332 -0.047
y -0.047 0.526
Mean difference (G1- G2)
-2.3 ๐ฟ๐ฟ๐ฟ๐ฟ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โ ๐ฟ๐ฟ๐ฟ๐ฟ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ = 3.9 - 6.2 = -2.3
-0.9 ๐๐๐๐๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โ ๐๐๐๐๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ = 4.3 - 5.2 = -0.9
Mahalanobis distance
= 1.41
๏ฟฝ
โ2.3
โ0.9
๏ฟฝ x (โ2.3 โ0.9) x ๏ฟฝ
0.332 โ0.047
โ0.047 0.526
๏ฟฝ = 1.41
23. 23
References:
Data mining // The Textbook by:
Charu C. Aggarwal
IBM T.J. Watson Research Center
Yorktown Heights
New York USA
โข https://www.autonlab.org/tutorials/mbl.html
โข https://en.wikipedia.org/wiki/Logistic_regression
โข http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.h
tml
โข https://www.mathsisfun.com/algebra/matrix-multiplying.html