Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Supervised Modeling and AI Technologies (CO9401)
Unit –I: Supervised Learning Naïve Bayes and K-NN
Content
 Baysian classifier, Naive Bayes classifier cases, Constraints of Naïve bayes,
Advantages of Naïve Bayes, Comparison of Naïve bayes with other
classifiers,
 K-nearest neighbor classifier, K-nearest neighbor classifier selection criteria,
Constraints of K-nearest neighbor, Advantages and Disadvantages of K-
nearest neighbor algorithms, controlling complexity of K-NN.
Supervised vs. Unsupervised Learning
 Supervised learning (classification)- Prediction either Yes or No
 Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
 New data is classified based on the training set.
 Unsupervised learning (clustering)
 The class labels of training data is unknown.
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
Prediction Problems: Classification vs. Numeric Prediction
 Classification
 Predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
 Numeric Prediction
 Models continuous-valued functions, i.e., predicts unknown or missing
values .
 Typical applications
 Credit/loan approval: Loan approved Yes or No
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified result from the model
 Accuracy rate is the percentage of test set samples that are correctly classified by the
model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set
Step 1: Model Construction
Step 2: Model Usage
Issues: Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
Issues: Evaluating Classification Methods: Accuracy
 Accuracy simply measures how often the classifier correctly predicts.
 We can define accuracy as the ratio of the number of correct predictions and the total number of
predictions.
 For binary classification (only two class labels) we use TP and TN.
Accuracy: Confusion Matrix
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Bayes’ Theorem: Basics
 Total probability Theorem:
 Bayes’ Theorem:
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
)
(
)
1
|
(
)
( i
A
P
M
i i
A
B
P
B
P 


)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, only
needs to be maximized
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)
(
)
|
(
)
|
( i
C
P
i
C
P
i
C
P X
X 
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
 This greatly reduces the computation cost: Only counts the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by
|Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X
Naïve Bayes Classifier Example 1
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 Prior Probability
 P(Buys Computer= Yes)=914=0.642
 P(Buys Computer= No)=514=0.357
 Posterior/ Conditional Probability
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 Prior Probability
 P(Buys Computer= Yes)=914=0.642
 P(Buys Computer= No)=514=0.357
 Posterior/ Conditional Probability
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
 P(Yes)=0.22*0.44*0.33*0.67*0.33*0.22*0.44*0.33*
0.67*0.33*0.642=
 P(No)=0.6*0.4*0.6*0.4*0.4*0.4*0.2*0.4*0.6*0.357=
Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
 Data to be classified
 Age=31…40, Income= High, Student = No,
Credit Rating= Excellent, Buys Computer?
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
 P(Yes)=0.44*0.22*0.33*0.33*0.642=0.0067
 P(No)=0*0.4*0.4*0.6*0.357=0
AS P(Yes)> P(No), so Buys Computer=Yes
Naïve Bayes Classifier Example 2
Class:
C1:Plaing Tennis = ‘yes’
C2: C1:Plaing Tennis = ‘no’
Data to be classified:
Outlook=Rainy,
Temp=Hot,
Humidity=High, Windy=
Strong, Play= ?
Day Outlook Temp Humidity Windy Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
Naïve Bayes Classifier Example 2 Solution
Outlook P(Y) P(N)
Sunny 29=0.22 3/5=0.6
Overcast 49=0.44 0
Rainy 39=0.33 25=0.4
 Prior Probability
 P(Y)=914=0.6428
 P(N)=514=0.3571
 Posterior/ Conditional Probability
Temp P(Y) P(N)
Hot 2/9=0.22 2/5=0.4
Mild 4/9=0.44 2/5=0.4
Cool 3/9=0.33 1/5=0.2
Wind P(Y) P(N)
Weak 6/9=0.67 2/5=0.4
Strong 3/9=0.33 3/5=0.6
Humidity P(Y) P(N)
High 3/9=0.33 4/5=0.8
Normal 6/9=0.66 1/5=0.2
 P(Y)=0.333*0.222*0.33*0.33*0.6428=0.0052
 P(N)=0.4*0.4*0.8*0.6*0.3571=0.0274
As P(Y) <P(N) , WILL NOT PLAY
Naïve Bayes Classifier Example 3
Given the training set for
classification problem into two
classes “fraud” and “normal”. There
are two attributes A1 and A2 taking
values 0 or 1. The Bayes classifier
classifies the instance (A1=1,
A2=1) into class?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal
Benefits of Naïve Bayes Classifier
 It is simple and easy to implement
 It doesn’t require as much training data
 It handles both continuous and discrete data
 It is highly scalable with the number of predictors and data
points
 It is fast and can be used to make real-time predictions
 It is not sensitive to irrelevant features
Limitations of Naïve Bayes Classifier
 Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the
applicability of this algorithm in real-world use cases.
 This algorithm faces the ‘zero-frequency problem’ where it assigns
zero probability to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It would be best if
you used a smoothing technique to overcome this issue.
 Its estimations can be wrong in some cases, so you shouldn’t take
its probability outputs very seriously.
Type of Distances used in Machine Learning algorithm
 Distance metric are used to represent distances between any two
data points.
 There are many distance metrics.
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
Type of Distances used in Machine Learning algorithm
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
Lets calculate Distance between { 2, 3 } from { 3, 5 }
= √(3-2)²+(5-3)²
=√(1)²+(2)²
= √(1+4
= √(5
Calculate Distance between { 40, 20 } from {20, 35 }
Type of Distances used in Machine Learning algorithm
Manhattan Distance
The Manhattan distance as the sum of absolute differences
Lets calculate Distance between { 2, 3 } from { 3, 5 }
=|2–3|+|3–5|
= |-1| + |-2|
= 1+2
= 3
Calculate Distance between { 40, 20 } from {20, 35 }
|x1 — x2| + |y1 — y2|
The k-Nearest Neighbor Algorithm
 The k-nearest neighbors (KNN) algorithm is a non-parametric
 Supervised learning classifier
 Uses proximity to make classifications or predictions about the grouping of an
individual data point
 Popular and simplest classification and regression classifiers used in machine
learning today
Mostly suited for Binary Classification
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
 Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples
.
_
_ xq
+
_ _
+
_
_
+
.
.
.
. .
 Step #1 - Assign a value to K.
 Step #2 - Calculate the distance between the new data entry and all
other existing data entries. Arrange them in ascending order.
 Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.
 Step #4 - Assign the new data entry to the majority class in the
nearest neighbors.
The k-Nearest Neighbor Algorithm Steps
Type of Distances used in Machine Learning algorithm
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
Lets calculate Distance between { 2, 3 } from { 3, 5 }
= √(3-2)²+(5-3)²
=√(1)²+(2)²
= √(1+4
= √(5
Calculate Distance between { 40, 20 } from {20, 35 }
Type of Distances used in Machine Learning algorithm
Manhattan Distance
The Manhattan distance as the sum of absolute differences
Lets calculate Distance between { 2, 3 } from { 3, 5 }
=|2–3|+|3–5|
= |-1| + |-2|
= 1+2
= 3
Calculate Distance between { 40, 20 } from {20, 35 }
|x1 — x2| + |y1 — y2|
 For given Barbie movie IMDb Rating = 7.4, Duration = 114, Genre ?
Assume K=3, use Euclidean distance
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
The k-Nearest Neighbor Algorithm
IMDb Rating Duration Genre
8.0 ( Mission Impossible) 160 Action
6.2 (Gadar 2) 170 Action
7.2 (Rocky and Rani) 168 Comedy
8.2 ( OMG 2) 155 Comedy
 Step 1: Calculate the distances.
 Calculate the distance of new movie and each movie in dataset.
Distance to (8.0,160)=√(7.4-8.0)²+(114-160)² = √(0.36+2116) = 46.00
Distance to (6.2,160)=√(7.4-6.2)²+(114-160)² = √(1.44+2116) = 56.01
Distance to (7.2,168)=√(7.4-7.2)²+(114-168)² = √(0.04+2916) = 54.00
Distance to (8.2,155)=√(7.4-8.2)²+(114-155)² = √(0.64+1618) = 41.00
The k-Nearest Neighbor Algorithm
(X ₁,Y₁) (X₂,Y ₂)
 Step 2: Select K Nearest Neighbor.
 For K=1, Shortest distance is 41.00
 So, Barbie movie Genre is Comedy.
The k-Nearest Neighbor Algorithm
 Step 3: Majority Voting (Classification)
 For K=3, Shortest distance is 41.00, 46.00 and 54
Action, Comedy, Comedy
 So, Barbie movie Genre is Comedy.
 For given data test tuple Brightness=20, saturation=35, Class?
Assume K=5, use Euclidean distance
BRIGHTNESS SATURATION CLASS
40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
The k-Nearest Neighbor Algorithm
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 38
Reference
 Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
 https://onlinecourses.nptel.ac.in/noc24_cs22
 https://medium.com/analytics-vidhya/type-of-distances-used-in-machine-
learning-algorithm-c873467140de
 https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-
classifiers-and-model-example/

NBaysian classifier, Naive Bayes classifier

  • 1.
    Sanjivani Rural EducationSociety’s Sanjivani College of Engineering, Kopargaon-423 603 (An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune) NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified Department of Computer Engineering (NBA Accredited) Prof. S. A. Shivarkar Assistant Professor Contact No.8275032712 Email- shivarkarsandipcomp@sanjivani.org.in Subject- Supervised Modeling and AI Technologies (CO9401) Unit –I: Supervised Learning Naïve Bayes and K-NN
  • 2.
    Content  Baysian classifier,Naive Bayes classifier cases, Constraints of Naïve bayes, Advantages of Naïve Bayes, Comparison of Naïve bayes with other classifiers,  K-nearest neighbor classifier, K-nearest neighbor classifier selection criteria, Constraints of K-nearest neighbor, Advantages and Disadvantages of K- nearest neighbor algorithms, controlling complexity of K-NN.
  • 3.
    Supervised vs. UnsupervisedLearning  Supervised learning (classification)- Prediction either Yes or No  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations.  New data is classified based on the training set.  Unsupervised learning (clustering)  The class labels of training data is unknown.  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data.
  • 4.
    Prediction Problems: Classificationvs. Numeric Prediction  Classification  Predicts categorical class labels (discrete or nominal)  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.  Numeric Prediction  Models continuous-valued functions, i.e., predicts unknown or missing values .  Typical applications  Credit/loan approval: Loan approved Yes or No  Medical diagnosis: if a tumor is cancerous or benign  Fraud detection: if a transaction is fraudulent  Web page categorization: which category it is
  • 5.
    Classification—A Two-Step Process Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set (otherwise overfitting)  If the accuracy is acceptable, use the model to classify new data  Note: If the test set is used to select models, it is called validation (test) set
  • 6.
    Step 1: ModelConstruction
  • 7.
  • 8.
    Issues: Data Preparation Data cleaning  Preprocess data in order to reduce noise and handle missing values  Relevance analysis (feature selection)  Remove the irrelevant or redundant attributes  Data transformation  Generalize and/or normalize data
  • 9.
    Issues: Evaluating ClassificationMethods  Accuracy  classifier accuracy: predicting class label  predictor accuracy: guessing value of predicted attributes  Speed  time to construct the model (training time)  time to use the model (classification/prediction time)  Robustness: handling noise and missing values  Scalability: efficiency in disk-resident databases  Interpretability  understanding and insight provided by the model  Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
  • 10.
    Issues: Evaluating ClassificationMethods: Accuracy  Accuracy simply measures how often the classifier correctly predicts.  We can define accuracy as the ratio of the number of correct predictions and the total number of predictions.  For binary classification (only two class labels) we use TP and TN.
  • 11.
  • 12.
    Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  Foundation: Based on Bayes’ Theorem.  Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
  • 13.
    Bayes’ Theorem: Basics Total probability Theorem:  Bayes’ Theorem:  Let X be a data sample (“evidence”): class label is unknown  Let H be a hypothesis that X belongs to class C  Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X  P(H) (prior probability): the initial probability  E.g., X will buy computer, regardless of age, income, …  P(X): probability that sample data is observed  P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds  E.g., Given that X will buy computer, the prob. that X is 31..40, medium income ) ( ) 1 | ( ) ( i A P M i i A B P B P    ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 14.
    Prediction Based onBayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem  Informally, this can be viewed as posteriori = likelihood x prior/evidence  Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes  Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 15.
    Classification Is toDerive the Maximum Posteriori  Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)  Suppose there are m classes C1, C2, …, Cm.  Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)  This can be derived from Bayes’ theorem  Since P(X) is constant for all classes, only needs to be maximized ) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P  ) ( ) | ( ) | ( i C P i C P i C P X X 
  • 16.
    Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):  This greatly reduces the computation cost: Only counts the class distribution  If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)  If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is ) | ( ... ) | ( ) | ( 1 ) | ( ) | ( 2 1 Ci x P Ci x P Ci x P n k Ci x P Ci P n k        X 2 2 2 ) ( 2 1 ) , , (          x e x g ) , , ( ) | ( i i C C k x g Ci P    X
  • 17.
    Naïve Bayes ClassifierExample 1 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
  • 18.
    Naïve Bayes ClassifierExample 1 Solution Age P(Y) P(N) <=30 29=0.22 3/5=0.6 31…..40 49=0.44 0 >40 39=0.33 25=0.4  Prior Probability  P(Buys Computer= Yes)=914=0.642  P(Buys Computer= No)=514=0.357  Posterior/ Conditional Probability Income P(Y) P(N) High 2/9=0.22 2/5=0.4 Medium 4/9=0.44 2/5=0.4 Low 3/9=0.33 1/5=0.2 Credit Rating P(Y) P(N) Fair 6/9=0.67 2/5=0.4 Excellent 3/9=0.33 3/5=0.6 Student P(Y) P(N) Yes 6/9=0.67 3/5=0.6 No 3/9=0.33 2/5=0.4
  • 19.
    Naïve Bayes ClassifierExample 1 Solution Age P(Y) P(N) <=30 29=0.22 3/5=0.6 31…..40 49=0.44 0 >40 39=0.33 25=0.4  Prior Probability  P(Buys Computer= Yes)=914=0.642  P(Buys Computer= No)=514=0.357  Posterior/ Conditional Probability Income P(Y) P(N) High 2/9=0.22 2/5=0.4 Medium 4/9=0.44 2/5=0.4 Low 3/9=0.33 1/5=0.2 Credit Rating P(Y) P(N) Fair 6/9=0.67 2/5=0.4 Excellent 3/9=0.33 3/5=0.6 Student P(Y) P(N) Yes 6/9=0.67 3/5=0.6 No 3/9=0.33 2/5=0.4  P(Yes)=0.22*0.44*0.33*0.67*0.33*0.22*0.44*0.33* 0.67*0.33*0.642=  P(No)=0.6*0.4*0.6*0.4*0.4*0.4*0.2*0.4*0.6*0.357=
  • 20.
    Naïve Bayes ClassifierExample 1 Solution Age P(Y) P(N) <=30 29=0.22 3/5=0.6 31…..40 49=0.44 0 >40 39=0.33 25=0.4  Data to be classified  Age=31…40, Income= High, Student = No, Credit Rating= Excellent, Buys Computer? Income P(Y) P(N) High 2/9=0.22 2/5=0.4 Medium 4/9=0.44 2/5=0.4 Low 3/9=0.33 1/5=0.2 Credit Rating P(Y) P(N) Fair 6/9=0.67 2/5=0.4 Excellent 3/9=0.33 3/5=0.6 Student P(Y) P(N) Yes 6/9=0.67 3/5=0.6 No 3/9=0.33 2/5=0.4  P(Yes)=0.44*0.22*0.33*0.33*0.642=0.0067  P(No)=0*0.4*0.4*0.6*0.357=0 AS P(Yes)> P(No), so Buys Computer=Yes
  • 21.
    Naïve Bayes ClassifierExample 2 Class: C1:Plaing Tennis = ‘yes’ C2: C1:Plaing Tennis = ‘no’ Data to be classified: Outlook=Rainy, Temp=Hot, Humidity=High, Windy= Strong, Play= ? Day Outlook Temp Humidity Windy Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rainy Mild High Strong No
  • 22.
    Naïve Bayes ClassifierExample 2 Solution Outlook P(Y) P(N) Sunny 29=0.22 3/5=0.6 Overcast 49=0.44 0 Rainy 39=0.33 25=0.4  Prior Probability  P(Y)=914=0.6428  P(N)=514=0.3571  Posterior/ Conditional Probability Temp P(Y) P(N) Hot 2/9=0.22 2/5=0.4 Mild 4/9=0.44 2/5=0.4 Cool 3/9=0.33 1/5=0.2 Wind P(Y) P(N) Weak 6/9=0.67 2/5=0.4 Strong 3/9=0.33 3/5=0.6 Humidity P(Y) P(N) High 3/9=0.33 4/5=0.8 Normal 6/9=0.66 1/5=0.2  P(Y)=0.333*0.222*0.33*0.33*0.6428=0.0052  P(N)=0.4*0.4*0.8*0.6*0.3571=0.0274 As P(Y) <P(N) , WILL NOT PLAY
  • 23.
    Naïve Bayes ClassifierExample 3 Given the training set for classification problem into two classes “fraud” and “normal”. There are two attributes A1 and A2 taking values 0 or 1. The Bayes classifier classifies the instance (A1=1, A2=1) into class? A1 A2 Class 1 0 fraud 1 1 fraud 1 1 fraud 1 0 normal 1 1 fraud 0 0 normal 0 0 normal 0 0 normal 1 1 normal 1 0 normal
  • 24.
    Benefits of NaïveBayes Classifier  It is simple and easy to implement  It doesn’t require as much training data  It handles both continuous and discrete data  It is highly scalable with the number of predictors and data points  It is fast and can be used to make real-time predictions  It is not sensitive to irrelevant features
  • 25.
    Limitations of NaïveBayes Classifier  Naive Bayes assumes that all predictors (or features) are independent, rarely happening in real life. This limits the applicability of this algorithm in real-world use cases.  This algorithm faces the ‘zero-frequency problem’ where it assigns zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset. It would be best if you used a smoothing technique to overcome this issue.  Its estimations can be wrong in some cases, so you shouldn’t take its probability outputs very seriously.
  • 26.
    Type of Distancesused in Machine Learning algorithm  Distance metric are used to represent distances between any two data points.  There are many distance metrics. 1. Euclidean Distance 2. Manhattan Distance 3. Minkowski Distance 4. Hamming Distance
  • 27.
    Type of Distancesused in Machine Learning algorithm Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)² Lets calculate Distance between { 2, 3 } from { 3, 5 } = √(3-2)²+(5-3)² =√(1)²+(2)² = √(1+4 = √(5 Calculate Distance between { 40, 20 } from {20, 35 }
  • 28.
    Type of Distancesused in Machine Learning algorithm Manhattan Distance The Manhattan distance as the sum of absolute differences Lets calculate Distance between { 2, 3 } from { 3, 5 } =|2–3|+|3–5| = |-1| + |-2| = 1+2 = 3 Calculate Distance between { 40, 20 } from {20, 35 } |x1 — x2| + |y1 — y2|
  • 29.
    The k-Nearest NeighborAlgorithm  The k-nearest neighbors (KNN) algorithm is a non-parametric  Supervised learning classifier  Uses proximity to make classifications or predictions about the grouping of an individual data point  Popular and simplest classification and regression classifiers used in machine learning today Mostly suited for Binary Classification
  • 30.
    The k-Nearest NeighborAlgorithm  All instances correspond to points in the n-D space  The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)  Target function could be discrete- or real- valued  For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq  Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _ _ xq + _ _ + _ _ + . . . . .
  • 31.
     Step #1- Assign a value to K.  Step #2 - Calculate the distance between the new data entry and all other existing data entries. Arrange them in ascending order.  Step #3 - Find the K nearest neighbors to the new entry based on the calculated distances.  Step #4 - Assign the new data entry to the majority class in the nearest neighbors. The k-Nearest Neighbor Algorithm Steps
  • 32.
    Type of Distancesused in Machine Learning algorithm Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)² Lets calculate Distance between { 2, 3 } from { 3, 5 } = √(3-2)²+(5-3)² =√(1)²+(2)² = √(1+4 = √(5 Calculate Distance between { 40, 20 } from {20, 35 }
  • 33.
    Type of Distancesused in Machine Learning algorithm Manhattan Distance The Manhattan distance as the sum of absolute differences Lets calculate Distance between { 2, 3 } from { 3, 5 } =|2–3|+|3–5| = |-1| + |-2| = 1+2 = 3 Calculate Distance between { 40, 20 } from {20, 35 } |x1 — x2| + |y1 — y2|
  • 34.
     For givenBarbie movie IMDb Rating = 7.4, Duration = 114, Genre ? Assume K=3, use Euclidean distance Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)² The k-Nearest Neighbor Algorithm IMDb Rating Duration Genre 8.0 ( Mission Impossible) 160 Action 6.2 (Gadar 2) 170 Action 7.2 (Rocky and Rani) 168 Comedy 8.2 ( OMG 2) 155 Comedy
  • 35.
     Step 1:Calculate the distances.  Calculate the distance of new movie and each movie in dataset. Distance to (8.0,160)=√(7.4-8.0)²+(114-160)² = √(0.36+2116) = 46.00 Distance to (6.2,160)=√(7.4-6.2)²+(114-160)² = √(1.44+2116) = 56.01 Distance to (7.2,168)=√(7.4-7.2)²+(114-168)² = √(0.04+2916) = 54.00 Distance to (8.2,155)=√(7.4-8.2)²+(114-155)² = √(0.64+1618) = 41.00 The k-Nearest Neighbor Algorithm (X ₁,Y₁) (X₂,Y ₂)  Step 2: Select K Nearest Neighbor.  For K=1, Shortest distance is 41.00  So, Barbie movie Genre is Comedy.
  • 36.
    The k-Nearest NeighborAlgorithm  Step 3: Majority Voting (Classification)  For K=3, Shortest distance is 41.00, 46.00 and 54 Action, Comedy, Comedy  So, Barbie movie Genre is Comedy.
  • 37.
     For givendata test tuple Brightness=20, saturation=35, Class? Assume K=5, use Euclidean distance BRIGHTNESS SATURATION CLASS 40 20 Red 50 50 Blue 60 90 Blue 10 25 Red 70 70 Blue 60 10 Red 25 80 Blue Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)² The k-Nearest Neighbor Algorithm
  • 38.
    DEPARTMENT OF COMPUTERENGINEERING, Sanjivani COE, Kopargaon 38 Reference  Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.  https://onlinecourses.nptel.ac.in/noc24_cs22  https://medium.com/analytics-vidhya/type-of-distances-used-in-machine- learning-algorithm-c873467140de  https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm- classifiers-and-model-example/