Md Main Uddin Rony,
What is Classification?
Classification is a data mining task of predicting the value of a
categorical variable (target or class)
This is done by building a model based on one or more numerical
and/or categorical variables ( predictors, attributes or features)
Considered an instance of supervised learning
Corresponding unsupervised procedure is known as clustering
Four main groups of classification
● Frequency Table
- Naive Bayesian
- Decision Tree
● Covariance Matrix
- Linear Discriminant Analysis
- Logistic Regression
● Similarity Functions
- K Nearest Neighbours
- Artificial Neural Network
- Support Vector Machine
Naive Bayes Classifier
● Works based on Bayes’ theorem
● Why its is called Naive?
- Because it assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature
● Easy to build
● Useful for very large data sets
The theorem can be stated mathematically as follow:
P(A) and P(B) are the probabilities of observing A and B without regard
to each other. Also known as Prior Probability.
P(A | B), a conditional (Posterior) probability, is the probability of
observing event A given that B is true.
P(B | A) is the conditional (Posterior)probability of observing event B
given that A is true.
So, how does naive bayes classifier work based on this?
How Naive Bayes works?
● Let D be a training set of tuples and each tuple is represented by n-dimensional
attribute vector, X = ( x1, x2, ….., xn)
● Suppose that there are m classes, C1, C2,...., Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned
on X. That is, the Naive Bayesian classifier predicts that tuple X belongs to the class Ci
if and only if
● By Bayes’ theorem
● P(X) is constant for all classes, only needs to be maximized
How Naive Bayes works? (Contd.)
● To reduce computation in evaluating , the naive assumption of
class-conditional independence is made. This presumes that the attributes’ values are
conditionally independent of one another, given the class label of the tuple (i.e., that
there are no dependence relationships among the attributes). This assumption is
called class conditional independence.
Given all the previous patient's symptoms and
Does the patient with the following symptoms have
chills runny nose headache fever flu?
Y N Mild Y N
Y Y No N Y
Y N Strong Y Y
N Y Mild Y Y
N N No N N
N Y Strong Y Y
N Y Strong N N
Y Y Mild Y Y
chills runny nose headache fever flu?
Y N Mild Y ?
First, we compute all possible individual
probabilities conditioned on the target attribute
P(Flu=Y) 0.625 P(Flu=N) 0.375
P(chills=Y|flu=Y) 0.6 P(chills=Y|flu=N) 0.333
P(chills=N|flu=Y) 0.4 P(chills=N|flu=N) 0.666
P(runny nose=Y|flu=Y) 0.8 P(runny nose=Y|flu=N) 0.333
P(runny nose=N|flu=Y) 0.2 P(runny nose=N|flu=N) 0.666
P(headache=Mild|flu=Y) 0.4 P(headache=Mild|flu=N) 0.333
P(headache=No|flu=Y) 0.2 P(headache=No|flu=N) 0.333
P(headache=Strong|flu=Y) 0.4 P(headache=Strong|flu=N) 0.333
P(fever=Y|flu=Y) 0.8 P(fever=Y|flu=N) 0.333
P(fever=N|flu=Y) 0.2 P(fever=N|flu=N) 0.666
And then decide:
P(flu=Y|Given attribute) = P(chills = Y|flu=Y).P(runny
nose = N|flu=Y).P(headache = Mild|flu=Y).P(fever =
= 0.6 * 0.2 * 0.4 * 0.2 * 0.625
P(flu=N|Given attribute) = P(chills = Y|flu=N).P(runny
nose = N|flu=N).P(headache = Mild|flu=N).P(fever =
= 0.333 * 0.666 * 0.333 * 0.666 * 0.375
So, Naive Bayes classifier predicts that the patient
doesn’t have the flu.
● Decision tree builds classification or regression models in the form of a
● It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed.
● The final result is a tree with decision nodes and leaf nodes.
- A decision node has two or more branches
- Leaf node represents a classification or decision
● The topmost decision node in a tree which corresponds to the best
predictor called root node
● Decision trees can handle both categorical and numerical data
we will work
Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcoast Mild High True Yes
Overcoast Hot Normal False Yes
Sunny Mild High True No
How it works
● The core algorithm for building decision trees called ID3
by J. R. Quinlan
● ID3 uses Entropy and Information Gain to construct a
● A decision tree is built top-down from a root node and
involves partitioning the data into subsets that contain
instances with similar values (homogeneous)
● ID3 algorithm uses entropy to calculate the homogeneity
of a sample
● If the sample is completely homogeneous the entropy is
zero and if the sample is an equally divided it has entropy
● To build a decision tree, we need to calculate
two types of entropy using frequency tables
● a) Entropy using the frequency table of one
(Entropy of the Target):
● b) Entropy using the
frequency table of two
● The information gain is based on the decrease
in entropy after a dataset is split on an attribute
● Constructing a decision tree is all about finding
attribute that returns the highest information
gain (i.e., the most homogeneous branches)
● Step 1: Calculate entropy of the target
● Step 2: The dataset is then split on
the different attributes.
The entropy for each branch is
● Then it is added proportionally, to
get total entropy for the split.
● The resulting entropy is subtracted
from the entropy before the split.
● The result is the Information Gain,
or decrease in entropy
● K nearest neighbors is a simple algorithm that stores all available cases
and classifies new cases based on a similarity measure (e.g., distance
● KNN has been used in statistical estimation and pattern recognition
already in the beginning of 1970’s
● A case is classified by a majority vote of its neighbors, with the case being
assigned to the class most common amongst its K nearest neighbors
measured by a distance function
● If k =1 , the what will it do?
● Choosing the optimal value for K is best
done by first inspecting the data
● In general, a large K value is more
precise as it reduces the overall noise
but there is no guarantee
● Cross-validation is another way to
retrospectively determine a good K
value by using an independent dataset
to validate the K value
● Historically, the optimal K for most
datasets has been between 3-10. That
produces much better results than 1NN
● Consider the following data concerning credit default. Age and Loan are
two numerical variables (predictors) and Default is the target
● We can now use the training set to classify an
unknown case (Age=48 and Loan=$142,000)
using Euclidean distance.
● If K=1 then the nearest neighbor is the last
case in the training set with Default=Y
● D = Sqrt[(48-33)^2 + (142000-150000)^2] =
8000.01 >> Default=Y
● With K=3, there are two Default=Y and one
Default=N out of three closest neighbors. The
prediction for the unknown case is again
● One major drawback in calculating distance
measures directly from the training set is in the
case where variables have different
measurement scales or there is a mixture of
numerical and categorical variables.
● For example, if one variable is based on annual
income in dollars, and the other is based on
age in years then income will have a much
higher influence on the distance calculated.
● One solution is to standardize the training set
Using the standardized distance on the same
training set, the unknown case returned a different
neighbor which is not a good sign of robustness.
What will happen if k equals to multiple
of category or label type?
What will happen if k = 1?
What will happen if we take k’s value
equal to dataset size?
Contents are borrowed from…
1. Data Mining Concepts and Techniques By Jiawei Han, Micheline Kamber,
2. Naive Bayes Example (Youtube Video) By Francisco Icaobelli
3. Predicting the Future Classification
Presented By: Dr. Noureddin Sadawi