SlideShare a Scribd company logo
Data Mining
Classification: Basic Concepts and
Techniques
Lecture Notes for Chapter 3
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
2/1/2021 Introduction to Data Mining, 2nd Edition 1
Classification: Definition
l Given a collection of records (training set )
– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
2/1/2021 Introduction to Data Mining, 2nd Edition 2
Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing
email
messages
Features extracted from
email message header
and content
spam or non-spam
Identifying
tumor cells
Features extracted from
x-rays or MRI scans
malignant or benign
cells
Cataloging
galaxies
Features extracted from
telescope images
Elliptical, spiral, or
irregular-shaped
galaxies
2/1/2021 Introduction to Data Mining, 2nd Edition 3
General Approach for Building
Classification Model
2/1/2021 Introduction to Data Mining, 2nd Edition 4
Classification Techniques
 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
 Ensemble Classifiers
– Boosting, Bagging, Random Forests
2/1/2021 Introduction to Data Mining, 2nd Edition 5
Example of a Decision Tree
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
2/1/2021 Introduction to Data Mining, 2nd Edition 6
Apply Model to Test Data
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.
2/1/2021 Introduction to Data Mining, 2nd Edition 7
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 8
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 9
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 10
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 11
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 12
Another Example of Decision Tree
MarSt
Home
Owner
Income
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 13
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
2/1/2021 Introduction to Data Mining, 2nd Edition 14
Decision Tree Induction
 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
2/1/2021 Introduction to Data Mining, 2nd Edition 15
General Structure of Hunt’s Algorithm
 Let Dt be the set of training
records that reach a node t
 General Procedure:
– If Dt contains records that
belong the same class yt,
then t is a leaf node
labeled as yt
– If Dt contains records that
belong to more than one
class, use an attribute test
to split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
Dt
?
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 16
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 17
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 18
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 19
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 20
Design Issues of Decision Tree Induction
 How should training records be split?
– Method for expressing test condition
 depending on attribute types
– Measure for evaluating the goodness of a test
condition
 How should the splitting procedure stop?
– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 21
Methods for Expressing Test Conditions
 Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
2/1/2021 Introduction to Data Mining, 2nd Edition 22
Test Condition for Nominal Attributes
 Multi-way split:
– Use as many partitions as
distinct values.
 Binary split:
– Divides values into two subsets
Marital
Status
Single Divorced Married
{Single} {Married,
Divorced}
Marital
Status
{Married} {Single,
Divorced}
Marital
Status
OR OR
{Single,
Married}
Marital
Status
{Divorced}
2/1/2021 Introduction to Data Mining, 2nd Edition 23
Test Condition for Ordinal Attributes
 Multi-way split:
– Use as many partitions
as distinct values
 Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values
Large
Shirt
Size
Medium Extra Large
Small
{Medium, Large,
Extra Large}
Shirt
Size
{Small}
{Large,
Extra Large}
Shirt
Size
{Small,
Medium}
{Medium,
Extra Large}
Shirt
Size
{Small,
Large}
This grouping
violates order
property
2/1/2021 Introduction to Data Mining, 2nd Edition 24
Test Condition for Continuous Attributes
Annual
Income
> 80K?
Yes No
Annual
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
2/1/2021 Introduction to Data Mining, 2nd Edition 25
Splitting Based on Continuous Attributes
 Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
 Static – discretize once at the beginning
 Dynamic – repeat at each node
– Binary Decision: (A < v) or (A  v)
 consider all possible splits and finds the best cut
 can be more compute intensive
2/1/2021 Introduction to Data Mining, 2nd Edition 26
How to determine the Best Split
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
2/1/2021 Introduction to Data Mining, 2nd Edition 27
How to determine the Best Split
 Greedy approach:
– Nodes with purer class distribution are
preferred
 Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
High degree of impurity Low degree of impurity
2/1/2021 Introduction to Data Mining, 2nd Edition 28
Measures of Node Impurity
 Gini Index
 Entropy
 Misclassification error
2/1/2021 Introduction to Data Mining, 2nd Edition 29
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖(𝑡)]
Where 𝒑𝒊 𝒕 is the frequency
of class 𝒊 at node t, and 𝒄 is
the total number of classes
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of child nodes
3. Choose the attribute test condition that
produces the highest gain
Gain = P - M
or equivalently, lowest impurity measure after splitting
(M)
2/1/2021 Introduction to Data Mining, 2nd Edition 30
Finding the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
P
M11 M12 M21 M22
M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2nd Edition 31
Measure of Impurity: GINI
 Gini Index for a given node 𝒕
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
– Maximum of 1 − 1/𝑐 when records are equally
distributed among all classes, implying the least
beneficial situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT
2/1/2021 Introduction to Data Mining, 2nd Edition 32
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
Measure of Impurity: GINI
 Gini Index for a given node t :
– For 2-class problem (p, 1 – p):
 GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
2/1/2021 Introduction to Data Mining, 2nd Edition 33
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
Computing Gini Index of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
2/1/2021 Introduction to Data Mining, 2nd Edition 34
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
Computing Gini Index for a Collection of
Nodes
 When a node 𝑝 is split into 𝑘 partitions (children)
where, 𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝.
2/1/2021 Introduction to Data Mining, 2nd Edition 35
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 =
𝑖=1
𝑘
𝑛𝑖
𝑛
𝐺𝐼𝑁𝐼(𝑖)
Binary Attributes: Computing GINI Index
 Splits into two partitions (child nodes)
 Effect of Weighing partitions:
– Larger and purer partitions are sought
B?
Yes No
Node N1 Node N2
Parent
C1 7
C2 5
Gini = 0.486
N1 N2
C1 5 2
C2 1 4
Gini=0.361
Gini(N1)
= 1 – (5/6)2 – (1/6)2
= 0.278
Gini(N2)
= 1 – (2/6)2 – (4/6)2
= 0.444
Weighted Gini of N1 N2
= 6/12 * 0.278 +
6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
2/1/2021 Introduction to Data Mining, 2nd Edition 36
Categorical Attributes: Computing Gini Index
 For each distinct value, gather counts for each class in
the dataset
 Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?
2/1/2021 Introduction to Data Mining, 2nd Edition 37
Continuous Attributes: Computing Gini Index
 Use Binary Decisions based on one
value
 Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
 Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A ≤ v and A > v
 Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
≤ 80 > 80
Defaulted Yes 0 3
Defaulted No 3 4
Annual Income ?
2/1/2021 Introduction to Data Mining, 2nd Edition 38
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 39
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 40
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 41
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 42
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 43
Measure of Impurity: Entropy
 Entropy at a given node 𝒕
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number
of classes
 Maximum of log2𝑐 when records are equally distributed
among all classes, implying the least beneficial situation for
classification
 Minimum of 0 when all records belong to one class,
implying most beneficial situation for classification
– Entropy based computations are quite similar to the GINI
index computations
2/1/2021 Introduction to Data Mining, 2nd Edition 44
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
Computing Entropy of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
2/1/2021 Introduction to Data Mining, 2nd Edition 45
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
Computing Information Gain After Splitting
 Information Gain:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Choose the split that achieves most reduction (maximizes
GAIN)
– Used in the ID3 and C4.5 decision tree algorithms
– Information gain is the mutual information between the class
variable and the splitting variable
2/1/2021 Introduction to Data Mining, 2nd Edition 46
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 −
𝑖=1
𝑘
𝑛𝑖
𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
Problem with large number of partitions
 Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure
– Customer ID has highest information gain
because entropy for all the children is zero
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
2/1/2021 Introduction to Data Mining, 2nd Edition 47
Gain Ratio
 Gain Ratio:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Adjusts Information Gain by the entropy of the partitioning
(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
 Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
2/1/2021 Introduction to Data Mining, 2nd Edition 48
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = −
𝑖=1
𝑘
𝑛𝑖
𝑛
𝑙𝑜𝑔2
𝑛𝑖
𝑛
Gain Ratio
 Gain Ratio:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
2/1/2021 Introduction to Data Mining, 2nd Edition 49
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 =
𝑖=1
𝑘
𝑛𝑖
𝑛
𝑙𝑜𝑔2
𝑛𝑖
𝑛
Measure of Impurity: Classification Error
 Classification error at a node 𝑡
– Maximum of 1 − 1/𝑐 when records are equally
distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation
2/1/2021 Introduction to Data Mining, 2nd Edition 50
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max
𝑖
[𝑝𝑖 𝑡 ]
Computing Error of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
2/1/2021 Introduction to Data Mining, 2nd Edition 51
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max
𝑖
[𝑝𝑖 𝑡 ]
Comparison among Impurity Measures
For a 2-class problem:
2/1/2021 Introduction to Data Mining, 2nd Edition 52
Misclassification Error vs Gini Index
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
Gini(N1)
= 1 – (3/3)2 – (0/3)2
= 0
Gini(N2)
= 1 – (4/7)2 – (3/7)2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves but
error remains the
same!!
2/1/2021 Introduction to Data Mining, 2nd Edition 53
Misclassification Error vs Gini Index
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
N1 N2
C1 3 4
C2 1 2
Gini=0.416
Misclassification error for all three cases = 0.3 !
2/1/2021 Introduction to Data Mining, 2nd Edition 54
Decision Tree Based Classification
 Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
 Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute
2/1/2021 Introduction to Data Mining, 2nd Edition 55
Handling interactions
X
Y
+ : 1000 instances
o : 1000 instances
Entropy (X) : 0.99
Entropy (Y) : 0.99
2/1/2021 Introduction to Data Mining, 2nd Edition 56
Handling interactions
2/1/2021 Introduction to Data Mining, 2nd Edition 57
Handling interactions given irrelevant attributes
+ : 1000 instances
o : 1000 instances
Adding Z as a noisy
attribute generated
from a uniform
distribution
Y
Entropy (X) : 0.99
Entropy (Y) : 0.99
Entropy (Z) : 0.98
Attribute Z will be
chosen for splitting!
X
2/1/2021 Introduction to Data Mining, 2nd Edition 58
Limitations of single attribute-based decision boundaries
Both positive (+) and
negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.
2/1/2021 Introduction to Data Mining, 2nd Edition 59

More Related Content

What's hot

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
Nandakumar P
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
varshakumar21
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
Trilok Sharma
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
Lippo Group Digital
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
Slideshare
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
Shweta Ghate
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
Abdelfattah Al Zaqqa
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Salah Amean
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systems
guest77b0cd12
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
Kedar Damkondwar
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Representation learning on graphs
Representation learning on graphsRepresentation learning on graphs
Representation learning on graphs
Deakin University
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
Bilkent University
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 

What's hot (20)

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
Social Recommender Systems
Social Recommender SystemsSocial Recommender Systems
Social Recommender Systems
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Representation learning on graphs
Representation learning on graphsRepresentation learning on graphs
Representation learning on graphs
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 

Similar to Basic Classification.pptx

data mining Module 4.ppt
data mining Module 4.pptdata mining Module 4.ppt
data mining Module 4.ppt
PremaJain2
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
sathish sak
 
Classification
ClassificationClassification
Classification
AuliyaRahman9
 
Data Mining Lecture_10(a).pptx
Data Mining Lecture_10(a).pptxData Mining Lecture_10(a).pptx
Data Mining Lecture_10(a).pptx
Subrata Kumer Paul
 
3830599
38305993830599
3830599
mohsen1394
 
chap4_basic_classification(2).ppt
chap4_basic_classification(2).pptchap4_basic_classification(2).ppt
chap4_basic_classification(2).ppt
ssuserfdf196
 
chap4_basic_classification.ppt
chap4_basic_classification.pptchap4_basic_classification.ppt
chap4_basic_classification.ppt
BantiParsaniya
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
butest
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
LazherZaidi1
 
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
DustiBuckner14
 
Machine Learning - Classification
Machine Learning - ClassificationMachine Learning - Classification
Machine Learning - Classification
Darío Garigliotti
 
Mayo_tutorial_July14.ppt
Mayo_tutorial_July14.pptMayo_tutorial_July14.ppt
Mayo_tutorial_July14.ppt
ImXaib
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
tttiba
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
anjanishah774
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
ssuserb26f53
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
DEEPAK948083
 
Classification
ClassificationClassification
Classification
Anurag jain
 
chap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptxchap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptx
GautamDematti1
 
NN Classififcation Neural Network NN.pptx
NN Classififcation   Neural Network NN.pptxNN Classififcation   Neural Network NN.pptx
NN Classififcation Neural Network NN.pptx
cmpt cmpt
 

Similar to Basic Classification.pptx (20)

data mining Module 4.ppt
data mining Module 4.pptdata mining Module 4.ppt
data mining Module 4.ppt
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
Classification
ClassificationClassification
Classification
 
Data Mining Lecture_10(a).pptx
Data Mining Lecture_10(a).pptxData Mining Lecture_10(a).pptx
Data Mining Lecture_10(a).pptx
 
3830599
38305993830599
3830599
 
chap4_basic_classification(2).ppt
chap4_basic_classification(2).pptchap4_basic_classification(2).ppt
chap4_basic_classification(2).ppt
 
chap4_basic_classification.ppt
chap4_basic_classification.pptchap4_basic_classification.ppt
chap4_basic_classification.ppt
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
 
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
 
Machine Learning - Classification
Machine Learning - ClassificationMachine Learning - Classification
Machine Learning - Classification
 
Mayo_tutorial_July14.ppt
Mayo_tutorial_July14.pptMayo_tutorial_July14.ppt
Mayo_tutorial_July14.ppt
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
Classification
ClassificationClassification
Classification
 
chap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptxchap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptx
 
NN Classififcation Neural Network NN.pptx
NN Classififcation   Neural Network NN.pptxNN Classififcation   Neural Network NN.pptx
NN Classififcation Neural Network NN.pptx
 

More from PerumalPitchandi

Agile Methodology-extreme programming-23.07.2020.ppt
Agile Methodology-extreme programming-23.07.2020.pptAgile Methodology-extreme programming-23.07.2020.ppt
Agile Methodology-extreme programming-23.07.2020.ppt
PerumalPitchandi
 
Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
PerumalPitchandi
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
PerumalPitchandi
 
biv_mult.ppt
biv_mult.pptbiv_mult.ppt
biv_mult.ppt
PerumalPitchandi
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
PerumalPitchandi
 
ANOVA Presentation.ppt
ANOVA Presentation.pptANOVA Presentation.ppt
ANOVA Presentation.ppt
PerumalPitchandi
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
PerumalPitchandi
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
PerumalPitchandi
 
SW_Cost_Estimation.ppt
SW_Cost_Estimation.pptSW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
PerumalPitchandi
 
CostEstimation-1.ppt
CostEstimation-1.pptCostEstimation-1.ppt
CostEstimation-1.ppt
PerumalPitchandi
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
PerumalPitchandi
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
PerumalPitchandi
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
PerumalPitchandi
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
PerumalPitchandi
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
PerumalPitchandi
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
PerumalPitchandi
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
PerumalPitchandi
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
PerumalPitchandi
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
PerumalPitchandi
 
Data_exploration.ppt
Data_exploration.pptData_exploration.ppt
Data_exploration.ppt
PerumalPitchandi
 

More from PerumalPitchandi (20)

Agile Methodology-extreme programming-23.07.2020.ppt
Agile Methodology-extreme programming-23.07.2020.pptAgile Methodology-extreme programming-23.07.2020.ppt
Agile Methodology-extreme programming-23.07.2020.ppt
 
Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
 
biv_mult.ppt
biv_mult.pptbiv_mult.ppt
biv_mult.ppt
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
 
ANOVA Presentation.ppt
ANOVA Presentation.pptANOVA Presentation.ppt
ANOVA Presentation.ppt
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
 
SW_Cost_Estimation.ppt
SW_Cost_Estimation.pptSW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
 
CostEstimation-1.ppt
CostEstimation-1.pptCostEstimation-1.ppt
CostEstimation-1.ppt
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
 
Data_exploration.ppt
Data_exploration.pptData_exploration.ppt
Data_exploration.ppt
 

Recently uploaded

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 

Recently uploaded (20)

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 

Basic Classification.pptx

  • 1. Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 2/1/2021 Introduction to Data Mining, 2nd Edition 1
  • 2. Classification: Definition l Given a collection of records (training set ) – Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label  x: attribute, predictor, independent variable, input  y: class, response, dependent variable, output l Task: – Learn a model that maps each attribute set x into one of the predefined class labels y 2/1/2021 Introduction to Data Mining, 2nd Edition 2
  • 3. Examples of Classification Task Task Attribute set, x Class label, y Categorizing email messages Features extracted from email message header and content spam or non-spam Identifying tumor cells Features extracted from x-rays or MRI scans malignant or benign cells Cataloging galaxies Features extracted from telescope images Elliptical, spiral, or irregular-shaped galaxies 2/1/2021 Introduction to Data Mining, 2nd Edition 3
  • 4. General Approach for Building Classification Model 2/1/2021 Introduction to Data Mining, 2nd Edition 4
  • 5. Classification Techniques  Base Classifiers – Decision Tree based Methods – Rule-based Methods – Nearest-neighbor – Naïve Bayes and Bayesian Belief Networks – Support Vector Machines – Neural Networks, Deep Neural Nets  Ensemble Classifiers – Boosting, Bagging, Random Forests 2/1/2021 Introduction to Data Mining, 2nd Edition 5
  • 6. Example of a Decision Tree ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Home Owner MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree 2/1/2021 Introduction to Data Mining, 2nd Edition 6
  • 7. Apply Model to Test Data Home Owner MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Start from the root of tree. 2/1/2021 Introduction to Data Mining, 2nd Edition 7
  • 8. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 2/1/2021 Introduction to Data Mining, 2nd Edition 8
  • 9. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 2/1/2021 Introduction to Data Mining, 2nd Edition 9
  • 10. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 2/1/2021 Introduction to Data Mining, 2nd Edition 10
  • 11. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 2/1/2021 Introduction to Data Mining, 2nd Edition 11
  • 12. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Assign Defaulted to “No” Home Owner 2/1/2021 Introduction to Data Mining, 2nd Edition 12
  • 13. Another Example of Decision Tree MarSt Home Owner Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 13
  • 14. Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree 2/1/2021 Introduction to Data Mining, 2nd Edition 14
  • 15. Decision Tree Induction  Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT 2/1/2021 Introduction to Data Mining, 2nd Edition 15
  • 16. General Structure of Hunt’s Algorithm  Let Dt be the set of training records that reach a node t  General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 16
  • 17. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 17
  • 18. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 18
  • 19. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 19
  • 20. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 20
  • 21. Design Issues of Decision Tree Induction  How should training records be split? – Method for expressing test condition  depending on attribute types – Measure for evaluating the goodness of a test condition  How should the splitting procedure stop? – Stop splitting if all the records belong to the same class or have identical attribute values – Early termination 2/1/2021 Introduction to Data Mining, 2nd Edition 21
  • 22. Methods for Expressing Test Conditions  Depends on attribute types – Binary – Nominal – Ordinal – Continuous 2/1/2021 Introduction to Data Mining, 2nd Edition 22
  • 23. Test Condition for Nominal Attributes  Multi-way split: – Use as many partitions as distinct values.  Binary split: – Divides values into two subsets Marital Status Single Divorced Married {Single} {Married, Divorced} Marital Status {Married} {Single, Divorced} Marital Status OR OR {Single, Married} Marital Status {Divorced} 2/1/2021 Introduction to Data Mining, 2nd Edition 23
  • 24. Test Condition for Ordinal Attributes  Multi-way split: – Use as many partitions as distinct values  Binary split: – Divides values into two subsets – Preserve order property among attribute values Large Shirt Size Medium Extra Large Small {Medium, Large, Extra Large} Shirt Size {Small} {Large, Extra Large} Shirt Size {Small, Medium} {Medium, Extra Large} Shirt Size {Small, Large} This grouping violates order property 2/1/2021 Introduction to Data Mining, 2nd Edition 24
  • 25. Test Condition for Continuous Attributes Annual Income > 80K? Yes No Annual Income? (i) Binary split (ii) Multi-way split < 10K [10K,25K) [25K,50K) [50K,80K) > 80K 2/1/2021 Introduction to Data Mining, 2nd Edition 25
  • 26. Splitting Based on Continuous Attributes  Different ways of handling – Discretization to form an ordinal categorical attribute Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.  Static – discretize once at the beginning  Dynamic – repeat at each node – Binary Decision: (A < v) or (A  v)  consider all possible splits and finds the best cut  can be more compute intensive 2/1/2021 Introduction to Data Mining, 2nd Edition 26
  • 27. How to determine the Best Split Gender C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Customer ID ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 2/1/2021 Introduction to Data Mining, 2nd Edition 27
  • 28. How to determine the Best Split  Greedy approach: – Nodes with purer class distribution are preferred  Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 High degree of impurity Low degree of impurity 2/1/2021 Introduction to Data Mining, 2nd Edition 28
  • 29. Measures of Node Impurity  Gini Index  Entropy  Misclassification error 2/1/2021 Introduction to Data Mining, 2nd Edition 29 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡) 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖(𝑡)] Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node t, and 𝒄 is the total number of classes
  • 30. Finding the Best Split 1. Compute impurity measure (P) before splitting 2. Compute impurity measure (M) after splitting  Compute impurity measure of each child node  M is the weighted impurity of child nodes 3. Choose the attribute test condition that produces the highest gain Gain = P - M or equivalently, lowest impurity measure after splitting (M) 2/1/2021 Introduction to Data Mining, 2nd Edition 30
  • 31. Finding the Best Split B? Yes No Node N3 Node N4 A? Yes No Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 P M11 M12 M21 M22 M1 M2 Gain = P – M1 vs P – M2 2/1/2021 Introduction to Data Mining, 2nd Edition 31
  • 32. Measure of Impurity: GINI  Gini Index for a given node 𝒕 Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number of classes – Maximum of 1 − 1/𝑐 when records are equally distributed among all classes, implying the least beneficial situation for classification – Minimum of 0 when all records belong to one class, implying the most beneficial situation for classification – Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT 2/1/2021 Introduction to Data Mining, 2nd Edition 32 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 2
  • 33. Measure of Impurity: GINI  Gini Index for a given node t : – For 2-class problem (p, 1 – p):  GINI = 1 – p2 – (1 – p)2 = 2p (1-p) C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 2/1/2021 Introduction to Data Mining, 2nd Edition 33 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 2
  • 34. Computing Gini Index of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 2/1/2021 Introduction to Data Mining, 2nd Edition 34 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 2
  • 35. Computing Gini Index for a Collection of Nodes  When a node 𝑝 is split into 𝑘 partitions (children) where, 𝑛𝑖 = number of records at child 𝑖, 𝑛 = number of records at parent node 𝑝. 2/1/2021 Introduction to Data Mining, 2nd Edition 35 𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝑖=1 𝑘 𝑛𝑖 𝑛 𝐺𝐼𝑁𝐼(𝑖)
  • 36. Binary Attributes: Computing GINI Index  Splits into two partitions (child nodes)  Effect of Weighing partitions: – Larger and purer partitions are sought B? Yes No Node N1 Node N2 Parent C1 7 C2 5 Gini = 0.486 N1 N2 C1 5 2 C2 1 4 Gini=0.361 Gini(N1) = 1 – (5/6)2 – (1/6)2 = 0.278 Gini(N2) = 1 – (2/6)2 – (4/6)2 = 0.444 Weighted Gini of N1 N2 = 6/12 * 0.278 + 6/12 * 0.444 = 0.361 Gain = 0.486 – 0.361 = 0.125 2/1/2021 Introduction to Data Mining, 2nd Edition 36
  • 37. Categorical Attributes: Computing Gini Index  For each distinct value, gather counts for each class in the dataset  Use the count matrix to make decisions CarType {Sports, Luxury} {Family} C1 9 1 C2 7 3 Gini 0.468 CarType {Sports} {Family, Luxury} C1 8 2 C2 0 10 Gini 0.167 CarType Family Sports Luxury C1 1 8 1 C2 3 0 7 Gini 0.163 Multi-way split Two-way split (find best partition of values) Which of these is the best? 2/1/2021 Introduction to Data Mining, 2nd Edition 37
  • 38. Continuous Attributes: Computing Gini Index  Use Binary Decisions based on one value  Several Choices for the splitting value – Number of possible splitting values = Number of distinct values  Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A ≤ v and A > v  Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. ID Home Owner Marital Status Annual Income Defaulted 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 ≤ 80 > 80 Defaulted Yes 0 3 Defaulted No 3 4 Annual Income ? 2/1/2021 Introduction to Data Mining, 2nd Edition 38
  • 39. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index...  For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Sorted Values 2/1/2021 Introduction to Data Mining, 2nd Edition 39
  • 40. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index...  For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 2/1/2021 Introduction to Data Mining, 2nd Edition 40
  • 41. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index...  For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 2/1/2021 Introduction to Data Mining, 2nd Edition 41
  • 42. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index...  For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 2/1/2021 Introduction to Data Mining, 2nd Edition 42
  • 43. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index...  For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 2/1/2021 Introduction to Data Mining, 2nd Edition 43
  • 44. Measure of Impurity: Entropy  Entropy at a given node 𝒕 Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number of classes  Maximum of log2𝑐 when records are equally distributed among all classes, implying the least beneficial situation for classification  Minimum of 0 when all records belong to one class, implying most beneficial situation for classification – Entropy based computations are quite similar to the GINI index computations 2/1/2021 Introduction to Data Mining, 2nd Edition 44 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
  • 45. Computing Entropy of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 2/1/2021 Introduction to Data Mining, 2nd Edition 45 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑖=0 𝑐−1 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
  • 46. Computing Information Gain After Splitting  Information Gain: Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 – Choose the split that achieves most reduction (maximizes GAIN) – Used in the ID3 and C4.5 decision tree algorithms – Information gain is the mutual information between the class variable and the splitting variable 2/1/2021 Introduction to Data Mining, 2nd Edition 46 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − 𝑖=1 𝑘 𝑛𝑖 𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
  • 47. Problem with large number of partitions  Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure – Customer ID has highest information gain because entropy for all the children is zero Gender C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Customer ID ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 2/1/2021 Introduction to Data Mining, 2nd Edition 47
  • 48. Gain Ratio  Gain Ratio: Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 – Adjusts Information Gain by the entropy of the partitioning (𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).  Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 algorithm – Designed to overcome the disadvantage of Information Gain 2/1/2021 Introduction to Data Mining, 2nd Edition 48 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − 𝑖=1 𝑘 𝑛𝑖 𝑛 𝑙𝑜𝑔2 𝑛𝑖 𝑛
  • 49. Gain Ratio  Gain Ratio: Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 CarType {Sports, Luxury} {Family} C1 9 1 C2 7 3 Gini 0.468 CarType {Sports} {Family, Luxury} C1 8 2 C2 0 10 Gini 0.167 CarType Family Sports Luxury C1 1 8 1 C2 3 0 7 Gini 0.163 SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 2/1/2021 Introduction to Data Mining, 2nd Edition 49 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = 𝑖=1 𝑘 𝑛𝑖 𝑛 𝑙𝑜𝑔2 𝑛𝑖 𝑛
  • 50. Measure of Impurity: Classification Error  Classification error at a node 𝑡 – Maximum of 1 − 1/𝑐 when records are equally distributed among all classes, implying the least interesting situation – Minimum of 0 when all records belong to one class, implying the most interesting situation 2/1/2021 Introduction to Data Mining, 2nd Edition 50 𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max 𝑖 [𝑝𝑖 𝑡 ]
  • 51. Computing Error of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 2/1/2021 Introduction to Data Mining, 2nd Edition 51 𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max 𝑖 [𝑝𝑖 𝑡 ]
  • 52. Comparison among Impurity Measures For a 2-class problem: 2/1/2021 Introduction to Data Mining, 2nd Edition 52
  • 53. Misclassification Error vs Gini Index A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.342 Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0 Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves but error remains the same!! 2/1/2021 Introduction to Data Mining, 2nd Edition 53
  • 54. Misclassification Error vs Gini Index A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.342 N1 N2 C1 3 4 C2 1 2 Gini=0.416 Misclassification error for all three cases = 0.3 ! 2/1/2021 Introduction to Data Mining, 2nd Edition 54
  • 55. Decision Tree Based Classification  Advantages: – Relatively inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Robust to noise (especially when methods to avoid overfitting are employed) – Can easily handle redundant attributes – Can easily handle irrelevant attributes (unless the attributes are interacting)  Disadvantages: . – Due to the greedy nature of splitting criterion, interacting attributes (that can distinguish between classes together but not individually) may be passed over in favor of other attributed that are less discriminating. – Each decision boundary involves only a single attribute 2/1/2021 Introduction to Data Mining, 2nd Edition 55
  • 56. Handling interactions X Y + : 1000 instances o : 1000 instances Entropy (X) : 0.99 Entropy (Y) : 0.99 2/1/2021 Introduction to Data Mining, 2nd Edition 56
  • 57. Handling interactions 2/1/2021 Introduction to Data Mining, 2nd Edition 57
  • 58. Handling interactions given irrelevant attributes + : 1000 instances o : 1000 instances Adding Z as a noisy attribute generated from a uniform distribution Y Entropy (X) : 0.99 Entropy (Y) : 0.99 Entropy (Z) : 0.98 Attribute Z will be chosen for splitting! X 2/1/2021 Introduction to Data Mining, 2nd Edition 58
  • 59. Limitations of single attribute-based decision boundaries Both positive (+) and negative (o) classes generated from skewed Gaussians with centers at (8,8) and (12,12) respectively. 2/1/2021 Introduction to Data Mining, 2nd Edition 59