NBaysian classifier, Naive Bayes classifier

Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Supervised Modeling and AI Technologies (CO9401)
Unit –I: Supervised Learning Naïve Bayes and K-NN

Content
 Baysian classifier, Naive Bayes classifier cases, Constraints of Naïve bayes,
Advantages of Naïve Bayes, Comparison of Naïve bayes with other
classifiers,
 K-nearest neighbor classifier, K-nearest neighbor classifier selection criteria,
Constraints of K-nearest neighbor, Advantages and Disadvantages of K-
nearest neighbor algorithms, controlling complexity of K-NN.

Supervised vs. Unsupervised Learning
 Supervised learning (classification)- Prediction either Yes or No
 Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
 New data is classified based on the training set.
 Unsupervised learning (clustering)
 The class labels of training data is unknown.
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.

Prediction Problems: Classification vs. Numeric Prediction
 Classification
 Predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
 Numeric Prediction
 Models continuous-valued functions, i.e., predicts unknown or missing
values .
 Typical applications
 Credit/loan approval: Loan approved Yes or No
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is

Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified result from the model
 Accuracy rate is the percentage of test set samples that are correctly classified by the
model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set

Issues: Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

Issues: Evaluating Classification Methods: Accuracy
 Accuracy simply measures how often the classifier correctly predicts.
 We can define accuracy as the ratio of the number of correct predictions and the total number of
predictions.
 For binary classification (only two class labels) we use TP and TN.

Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured

Bayes’ Theorem: Basics
 Total probability Theorem:
 Bayes’ Theorem:
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
)
(
)
1
|
(
)
( i
A
P
M
i i
A
B
P
B
P 


)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 



Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 



Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, only
needs to be maximized
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)
(
)
|
(
)
|
( i
C
P
i
C
P
i
C
P X
X 

Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
 This greatly reduces the computation cost: Only counts the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by
|Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X

Naïve Bayes Classifier Example 1
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)

Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 Prior Probability
 P(Buys Computer= Yes)=914=0.642
 P(Buys Computer= No)=514=0.357
 Posterior/ Conditional Probability
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4

Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 P(Buys Computer= Yes)=914=0.642
 P(Buys Computer= No)=514=0.357
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
 P(Yes)=0.22*0.44*0.33*0.67*0.33*0.22*0.44*0.33*
0.67*0.33*0.642=
 P(No)=0.6*0.4*0.6*0.4*0.4*0.4*0.2*0.4*0.6*0.357=

Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 Data to be classified
 Age=31…40, Income= High, Student = No,
Credit Rating= Excellent, Buys Computer?
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
 P(Yes)=0.44*0.22*0.33*0.33*0.642=0.0067
 P(No)=0*0.4*0.4*0.6*0.357=0
AS P(Yes)> P(No), so Buys Computer=Yes

Class:
C1:Plaing Tennis = ‘yes’
C2: C1:Plaing Tennis = ‘no’
Data to be classified:
Outlook=Rainy,
Temp=Hot,
Humidity=High, Windy=
Strong, Play= ?
Day Outlook Temp Humidity Windy Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No

Outlook P(Y) P(N)
Sunny 29=0.22 3/5=0.6
Overcast 49=0.44 0
Rainy 39=0.33 25=0.4
 P(Y)=914=0.6428
 P(N)=514=0.3571
Temp P(Y) P(N)
Hot 2/9=0.22 2/5=0.4
Mild 4/9=0.44 2/5=0.4
Cool 3/9=0.33 1/5=0.2
Wind P(Y) P(N)
Weak 6/9=0.67 2/5=0.4
Strong 3/9=0.33 3/5=0.6
Humidity P(Y) P(N)
High 3/9=0.33 4/5=0.8
Normal 6/9=0.66 1/5=0.2
 P(Y)=0.333*0.222*0.33*0.33*0.6428=0.0052
 P(N)=0.4*0.4*0.8*0.6*0.3571=0.0274
As P(Y) <P(N) , WILL NOT PLAY

Given the training set for
classification problem into two
classes “fraud” and “normal”. There
are two attributes A1 and A2 taking
values 0 or 1. The Bayes classifier
classifies the instance (A1=1,
A2=1) into class?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal

Benefits of Naïve Bayes Classifier
 It is simple and easy to implement
 It doesn’t require as much training data
 It handles both continuous and discrete data
 It is highly scalable with the number of predictors and data
points
 It is fast and can be used to make real-time predictions
 It is not sensitive to irrelevant features

Limitations of Naïve Bayes Classifier
 Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the
applicability of this algorithm in real-world use cases.
 This algorithm faces the ‘zero-frequency problem’ where it assigns
zero probability to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It would be best if
you used a smoothing technique to overcome this issue.
 Its estimations can be wrong in some cases, so you shouldn’t take
its probability outputs very seriously.

Type of Distances used in Machine Learning algorithm
 Distance metric are used to represent distances between any two
data points.
 There are many distance metrics.
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance

Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
Lets calculate Distance between { 2, 3 } from { 3, 5 }
= √(3-2)²+(5-3)²
=√(1)²+(2)²
= √(1+4
= √(5
Calculate Distance between { 40, 20 } from {20, 35 }

Manhattan Distance
The Manhattan distance as the sum of absolute differences
Lets calculate Distance between { 2, 3 } from { 3, 5 }
=|2–3|+|3–5|
= |-1| + |-2|
= 1+2
= 3
Calculate Distance between { 40, 20 } from {20, 35 }
|x1 — x2| + |y1 — y2|

The k-Nearest Neighbor Algorithm
 The k-nearest neighbors (KNN) algorithm is a non-parametric
 Supervised learning classifier
 Uses proximity to make classifications or predictions about the grouping of an
individual data point
 Popular and simplest classification and regression classifiers used in machine
learning today
Mostly suited for Binary Classification

 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
 Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples
.
_
_ xq
+
_ _
+
_
_
+
.
.
.
. .

 Step #1 - Assign a value to K.
 Step #2 - Calculate the distance between the new data entry and all
other existing data entries. Arrange them in ascending order.
 Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.
 Step #4 - Assign the new data entry to the majority class in the
nearest neighbors.
The k-Nearest Neighbor Algorithm Steps

 For given Barbie movie IMDb Rating = 7.4, Duration = 114, Genre ?
Assume K=3, use Euclidean distance
IMDb Rating Duration Genre
8.0 ( Mission Impossible) 160 Action
6.2 (Gadar 2) 170 Action
7.2 (Rocky and Rani) 168 Comedy
8.2 ( OMG 2) 155 Comedy

 Step 1: Calculate the distances.
 Calculate the distance of new movie and each movie in dataset.
Distance to (8.0,160)=√(7.4-8.0)²+(114-160)² = √(0.36+2116) = 46.00
Distance to (6.2,160)=√(7.4-6.2)²+(114-160)² = √(1.44+2116) = 56.01
Distance to (7.2,168)=√(7.4-7.2)²+(114-168)² = √(0.04+2916) = 54.00
Distance to (8.2,155)=√(7.4-8.2)²+(114-155)² = √(0.64+1618) = 41.00
(X ₁,Y₁) (X₂,Y ₂)
 Step 2: Select K Nearest Neighbor.
 For K=1, Shortest distance is 41.00
 So, Barbie movie Genre is Comedy.

 Step 3: Majority Voting (Classification)
 For K=3, Shortest distance is 41.00, 46.00 and 54
Action, Comedy, Comedy
 So, Barbie movie Genre is Comedy.

 For given data test tuple Brightness=20, saturation=35, Class?
Assume K=5, use Euclidean distance
BRIGHTNESS SATURATION CLASS
40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 38
Reference
 Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
 https://onlinecourses.nptel.ac.in/noc24_cs22
 https://medium.com/analytics-vidhya/type-of-distances-used-in-machine-
learning-algorithm-c873467140de
 https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-
classifiers-and-model-example/

NBaysian classifier, Naive Bayes classifier

More Related Content

Similar to NBaysian classifier, Naive Bayes classifier

More from ShivarkarSandip

Recently uploaded

NBaysian classifier, Naive Bayes classifier