Classification in data mining

Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.

Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set

Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc

Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines

Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree

Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that fits
the same data!

Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree

Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.

Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data

Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”

Decision Tree Induction
• Many Algorithms:
1. Hunt’s Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees relatively
quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple testing
to avoid overfitting.

General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that
reach a node t
• General Procedure:
– If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
– If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
– If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Dt
?

Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Evaluation of a Classifier
• How predictive is the model we learned?
– Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole set of
instances
– Accuracy: proportion of correctly classified instances over
the whole set of instances
accuracy = 1 – error rate
19

Confusion Matrix
• A confusion matrix is a table that is often used to
describe the performance of a classification model (or
"classifier") on a set of test data for which the true
values are known.
20
PREDICTED CLASS
ACTUAL
CLASS
Class = Yes Class = No
Class = Yes a b
Class = No c d
a: TP (true positive) c: FP (false positive)
b: FN (false negative) d: TN (true negative)

Confusion Matrix - Example
• What can we learn from this matrix?
– There are two possible predicted
classes: "yes" and "no". If we were
predicting the presence of a disease,
for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
– The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
– Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
– In reality, 105 patients in the sample have the disease, and 60
patients do not.
21

Confusion Matrix – Confusion?
• False positives are actually negative
• False negatives are actually positives
22

Confusion Matrix - Example
• Let's now define the most
basic terms, which are
whole numbers (not rates):
– true positives (TP): These are
cases in which we predicted
yes (they have the disease), and
they do have the disease.
– true negatives (TN): We predicted no, and they don't have the
disease.
– false positives (FP): We predicted yes, but they don't actually
have the disease. (Also known as a "Type I error.")
– false negatives (FN): We predicted no, but they actually do have
the disease. (Also known as a "Type II error.")
23

Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often is it wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
24

Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64
25

Confusion Matrix – Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly
predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.
26http://aimotion.blogspot.com/2010/08/tools-for-machine-learning-performance.html

Confusion Matrix – Example 2
• Accuracy = (TP+TN)/total = (27+57)/100 = 84%
• Misclassification Rate = (FP+FN)/total = (6+10)/100 = 16%
• True Positive Rate = TP/actual yes = 27/33 = 0.81
• False Positive Rate =FP/actual no = 10/67 = 0.15
27http://www.marcovanetti.com/pages/cfmatrix/?noc=1

Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split

• Multi-way split: Use as many partitions as distinct
values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family} OR
Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct
values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
• What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large} {Small}
Size
{Small,
Medium} {Large}
OR
Size
{Small,
Large} {Medium}

• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)
• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes

Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
Splitting Based on Continuous Attributes

How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?

How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity

How to Measure Impurity?
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say
that the table is impure or heterogeneous.
37
http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm

How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
– Entropy
– Gini Index
– Misclassification error
• All above formulas contain values of probability of pj a class j.
38

How to Measure Impurity? - Example
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in
short we write as 4B, 3C, 3T). The total data is 10 rows.
39

How to Measure Impurity? - Example
• Based on the data, we can compute probability of each
class. Since probability is equal to frequency relative,
we have
– Prob(Bus) = 4/10 = 0.4
– Prob(Car) = 3/10 = 0.3
– Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
40

How to Measure Impurity? - Entropy
• One way to measure impurity degree is using entropy
• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,
Prob(Train)=0.3, we can now compute entropy as:
• Entropy = - 0.4log2(0.4) - 0.3log2(0.3) - 0.3log2(0.3) =
1.571
41

• Entropy of a pure table (consist of
single class) is zero because the
probability is 1 and log2(1)=0.
• Entropy reaches maximum value
when all classes in the table have
equal probability.
• Figure plots the values of
maximum entropy for different
number of classes n, where
probability is equal to p=1/n.
• In this case, maximum entropy is
equal to -n*p*log2p.
• Notice that the value of entropy is
larger than 1 if the number of
classes is more than 2.
42

How to Measure Impurity? - Gini
• Another way to measure impurity degree is using Gini
index
Prob(Train)=0.3, we can now compute Gini index as:
• Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660
43

• Gini index of a pure table (consist
of a single class is zero because
the probability is 1 and 1-(1)^2=0.
• Similar to Entroy, Gini index also
reaches maximum value when all
classes in the table have equal
probability.
• Figure plots the values of
maximum Gini index for different
number of classes n, where
probability is equal to p=1/n.
• Notice that the value of Gini index
is always between 0 and 1
regardless the number of classes.
44

How to Measure Impurity? –
Missclassification Error
• Still another way to measure impurity degree
Prob(Train)=0.3, we can now compute index as:
• Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60
45

How to Measure Impurity? –
Missclassification Error
• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Max(1)=0.
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=1/n and maximum Gini index
happens at 1-n*(1/n)^2=1-1/n, while maximum
misclassification error index also happens at 1-max{1/n}=1-
1/n.
46

Information Gain
• The reason for different ways of computation of impurity
degrees between data table D and subset table Si is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table Si) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain is
if we split the data table based on some attribute values.
47

Information Gain - Example
• For example, in the parent table below, we can compute degree of
impurity based on transportation mode. In this case we have 4
Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
48

• For example,
we split using
travel cost
attribute and
compute the
degree of
impurity.
49

• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose
we will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D – Sum (n
k /n * Entropy of each value k of subset table Si )
• The information gain of attribute Travel cost per km is
computed as 1.571 – (5/10 * 0.722+2/10*0+3/10*0) =
1.210
50

• You can also compute information gain based on Gini
index or classification error in the same method. The
results are given below.
51

Information Gain – Example
• Split using “Gender” attribute
52

• Split using “Car ownership” attribute
53

• Split using “Income Level” attribute
54

• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
impurity degree based on three methods. You can use
either one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
attribute i}). In our case, travel cost per km produces the
maximum information gain.
55

• So we split using “travel cost per km” attribute as this
produces the maximum information gain.
56

Classification in data mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Classification in data mining

Similar to Classification in data mining (20)

More from Sulman Ahmed

More from Sulman Ahmed (20)

Recently uploaded

Recently uploaded (20)

Classification in data mining