data mining Module 4.ppt

Classification: Basic Concepts and
Techniques
1

Classification: Definition
 Classification which is the task of assigning objects to
one of several predefined categories.
1/9/2023 Introduction to Data Mining, 2nd Edition 2

 Classification is the task of learning a target function f
that maps each attribute set x to one of the predefined
class y.
 The target function is also known informally as a
classification model.
 A classification model is useful for the following
purposes:
Descriptive Modeling
Predictive Modeling

Given a collection of records (training set )
 Each record is by characterized by a tuple (x,y), where x
is the attribute set and y is the class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output

General Approach for Building Classification
Model
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set

Evaluation of the performance of
a classification model

Performation Metrics

Decision Tree Induction
 Consider a simpler version of the vertebrate
classification problem described in the previous
section
 Instead of classifying the vertebrates into five distinct
groups of species, we assign them to two categories:
mammals and non-mammals

 A root node that has no incoming edges and zero or
more outgoing edges.
 Internal nodes, each of which has exactly one
incoming edge and two or more outgoing edges.
 Leaf or terminal nodes, each of which has exactly
one incoming edge and no outgoing edges. In a
decision tree, each leaf node is assigned a class label.

Classifying unlabeled vertebrate

 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT

Hunt’s AlgoritHm
Let Dt be the set of training records that are associated
with node t and y = {y1, y2, . . . , yc} be the class labels.
The following is a recursive definition of Hunt‘s algorithm.
 Step 1: If all the records in Data belong to the same class yt,
then t is a leaf node labeled as yt.
 Step 2: If Data contains records that belong to more than
one class, an attribute test condition is selected to partition
the records into smaller subsets. A child node is created for
each outcome of the test condition and the records in Dt are
distributed to the children based on the outcomes. The
algorithm is then recursively applied to each child node

Hunt’s Algorithm
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Defaulted barrower

Hunt’s Algorithm
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)

Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)

Apply Model to Test Data
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.

MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner

MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner

Design Issues of Decision Tree Induction
 How should training records be split?
 Method for specifying test condition
 depending on attribute types
 Measure for evaluating the goodness of a test condition
 How should the splitting procedure stop?
 Stop splitting if all the records belong to the same class
or have identical attribute values
 Early termination

Methods for Expressing Test Conditions
 Depends on attribute types
 Binary
 Nominal
 Ordinal
 Continuous

Binary Attributes:

Test Condition for Nominal Attributes
 Multi-way split:
 Use as many partitions as distinct
values.
 Binary split:
 Divides values into two subsets
Marital
Status
Single Divorced Married
{Single} {Married,
Divorced}
Marital
Status
{Married} {Single,
Divorced}
Marital
Status
OR
OR
{Single,
Married}
Marital
Status
{Divorced}

Test Condition for Ordinal Attributes
 Multi-way split:
 Use as many partitions
as distinct values
 Binary split:
 Divides values into two
subsets
 Preserve order property
among attribute values
Large
Shirt
Size
Medium Extra Large
Small
{Medium, Large,
Extra Large}
Shirt
Size
{Small}
{Large,
Extra Large}
Shirt
Size
{Small,
Medium}
{Medium,
Extra Large}
Shirt
Size
{Small,
Large}
This grouping
violates order
property

Test Condition for Continuous Attributes
Annual
Income
> 80K?
Yes No
Annual
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K

How to determine the Best Split
C0: 5
C1: 5
 Greedy approach:
 Nodes with purer class distribution are preferred
 Need a measure of node impurity:
C0: 9
C1: 1
High degree of impurity Low degree of impurity

Measures of for selecting the Best split



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
 Gini Index
 Entropy
 Misclassification error


 j
t
j
p
t
j
p
t
Entropy )
|
(
log
)
|
(
)
(
)
|
(
max
1
)
( t
i
P
t
Error i



Examples of computing the different impurity

Comparison among Impurity Measures
For a 2-class problem:

Finding the Best Split for Binary atrributes

If attribute A is chosen to split the data, the Gini index for node N is 0.4898,
for node N2, it is 0.480. The weighted average of the Gini index for the
descendent nodes is (7/12) x 0.4898 + (5/12) x 0.480= 0.486. Similarly we
can show that the weighted average of the Gini index for the attribute B is
0.375.

Categorical Attributes: Computing Gini Index
 For each distinct value, gather counts for each class in the
dataset
 Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?

Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
 For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values

Consider the training examples shown in Table for a binary
classification problem.

(a) Compute the Gini index for the overall collection of
training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using
multiway split.
(e) Compute the Gini index for the Shirt Size attribute using
multiway split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?

Algorithm for Decision Tree Induction

Characteristics of Decision Tree Induction
 Decision tree induction is a nonparametric
approach for building classification models.
 Finding an optimal decision tree is an NP-complete
problem.
 Techniques developed for constructing decision trees
are computationally inexpensive
 Decision trees, especially smaller-sized trees, are
relatively easy to interpret.
Introduction to Data Mining, 2nd Edition 43

 Decision trees provide an expressive representation for
learning discrete valued functions
 Decision tree algorithms are quite robust to the presence
of noise, especially when methods for avoiding
overfitting
 The presence of redundant attributes does not
adversely affect the accuracy of decision trees.

 Since most decision tree algorithms employ a top-down,
recursive partitioning approach, the number of records
becomes smaller as we traverse down the tree.
 A subtree can be replicated multiple times in a decision
tree
Characteristics of Decision Tree Induction

Consider the training examples shown in Table 4.2 for a binary classification
problem.
(a) What is the entropy of this collection of training examples with respect to the positive
class?
(b) What are the information gains of a1 and a2 relative to these training examples?
(c) For a3, which is a continuous attribute, compute the information gain for every possible
split.
(d)What is the best split (among a1, a2, and a3) according to the information gain?
(e) What is the best split (between a1 and a2) according to the classification error rate?
(f) What is the best split (between a1 and a2) according to the Gini index?

Rule-Based Classifier
 Classify records by using a collection of “if…then…”
rules
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

 The left-hand side of the rule is called the rule antecedent or
precondition.
 It contains a conjunction of attribute tests:
Conditioni = (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk)
 where (Aj , vj) is an attribute-value pair and op is a logical operator chosen from the
set {=, ≠,<,>,≤,≥}. Each attribute test (Aj op vj) is known as a conjunct.
 The right-hand side of the rule is called the rule consequent, which contains
the predicted class yi.
Rule Representation

Rule Coverage and Accuracy
 Coverage of a rule:
 Fraction of records that satisfy
the antecedent of a rule
 Accuracy of a rule:
 Fraction of records that satisfy
both the antecedent and
consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
(Status=Single)  No
Coverage = 40%, Accuracy = 50%

Characteristics of Rule-Based Classifier
 Mutually exclusive rules
 Classifier contains mutually exclusive rules if the rules
are independent of each other
 Every record is covered by at most one rule
 Exhaustive rules
 Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
 Each record is covered by at least one rule

Effect of Rule Simplification
 Rules are no longer mutually exclusive
 A record may trigger more than one rule
 Solution?
 Ordered rule set
 Unordered rule set – use voting schemes
 Rules are no longer exhaustive
 A record may not trigger any rules
 Solution?
 Use a default class

Ordered Rule Set
 Rules are rank ordered according to their priority
 An ordered rule set is known as a decision list
 When a test record is presented to the classifier
 It is assigned to the class label of the highest ranked rule it has
triggered
 If none of the rules fired, it is assigned to the default class

Rule Ordering Schemes
 Rule-based ordering
 Individual rules are ranked based on their quality
 Class-based ordering
 Rules that belong to the same class appear together

How to Build a Rule-Based Classifier
 Direct Method:
 Extract rules directly from data
 e.g.: RIPPER, CN2, Holte’s 1R
 Indirect Method:
 Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
 e.g: C4.5rules

Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is
met

Example of Sequential Covering
(i) Original Data (ii) Step 1

Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2

Rule Growing
 Two common strategies

Rule Evaluation
 An evaluation metric is needed to determine which
conjunct should be added (or removed) during the rule-
growing process.
 Accuracy is an obvious choice because it explicitly measures
the fraction of training examples classified correctly by the
rule.
 a potential limitation of accuracy is that it does not take
into account the rule’s coverage.

For example, consider a training set that contains 60 positive
examples and 100 negative examples. Suppose we are given the
following two candidate rules:
Rule r1: covers 50 positive examples and 5 negative examples,
Rule r2: covers 2 positive examples and no negative examples.
The accuracies for r1 and r2 are 90.9% and 100%, respectively. However, r1
is the better rule despite its lower accuracy. The high accuracy for r2 is
potentially spurious because the coverage of the rule is too low.

Approaches to Handle such problem
1. Likelihood ratio statistic:
2. Laplace Measure
3. FOIL’s information gain

 Likelihood ratio statistic
where k is the number of classes, fi is the observed frequency of
class i examples that are covered by the rule, and ei is the
expected frequency of a rule that makes random predictions

The likelihood ratio for r1 is
 R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)]
= 99.9.
The likelihood ratio statistic for r2 is
 R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.
This statistic therefore suggests that r1 is a better rule than r2.

Laplace Measure
where n is the number of examples covered by the rule, f+ is
the number of positive examples covered by the rule, k is the
total number of classes, and p+ is the prior probability for the
positive class.

The Laplace measure for r1 is 51/57 = 89.47%, which
is quite close to its accuracy. Conversely, the Laplace
measure for r2 (75%) is significantly lower than its
accuracy because r2 has a much lower coverage.

 FOIL’s information gain

 Foil’s Information Gain
 R0: {} => class (initial rule)
 R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = p1 x [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by
both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1

(a) Rule accuracy
(b) FOIL’s information gain.
(c) likelihood ratio
(d) The Laplace measure.
(e) m-estimate measure (with k = 2 and p+ = 0.2).

Direct Method: RIPPER
 For 2-class problem, choose one of the classes as positive class,
and the other as negative class
 Learn rules for negative class
 positive class will be default class
 For multi-class problem
 Order the classes according to increasing class prevalence
(fraction of instances that belong to a particular class)
 Learn the rule set for smallest class first, treat the rest as
negative class
 Repeat with next smallest class as positive class

Direct Method: RIPPER
 Growing a rule:
 Start from empty rule
 Add conjuncts as long as they improve FOIL’s information gain
 Stop when rule no longer covers negative examples
 Prune the rule immediately using incremental reduced error
pruning
 Measure for pruning: v = (p-n)/(p+n)
 p: number of positive examples covered by the rule in
the validation set
 n: number of negative examples covered by the rule in
the validation set
 Pruning method: delete any final sequence of conditions that
maximizes v

RIFFER

2. Indirect Methods for Rule Extraction

Charecteristics of Rule-Based Classifiers
 As highly expressive as decision trees
 Easy to interpret
 Easy to generate
 Can classify new instances rapidly

Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then it’s
probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records

Nearest-Neighbor Classifiers
 Nearest Neighbour classifier assumes the similarity
between the new case/data and available cases and put
the new case into the category that is most similar to
the available categories.

 Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either
it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new
data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog
category.

Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this type
of problem, we need a K-NN algorithm. With the help of K-
NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?
The K-NN working can be explained on the basis of the below
algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of new data
point with all other data points.
 Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
 Step-4: Among these k neighbors, count the number of
the data points in each category.
 Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
 Step-6: Our model is ready.

 Suppose we have a new data point and we need to
put it in the required category. Consider the
below image:

 Firstly, we will choose the number of neighbors, so
we will choose the k=5.
 Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which
we have already studied in geometry. It can be
calculated as:

 By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
• As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

Nearest-Neighbor Classifiers
 Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record

Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x

Nearest Neighbor Classification
 Compute distance between two points:
 Euclidean distance
 Determine the class from nearest neighbor list
 take the majority vote of class labels among the k-
nearest neighbors
 Weigh the vote according to distance
 weight factor, w = 1/d2
 
 i i
i
q
p
q
p
d 2
)
(
)
,
(

Characteristics of Nearest-Neighbor Classifiers
 Uses specific training instances to make predictions
without having to maintain an abstraction (or model)
derived from data.
 Lazy learners such as nearest-neighbor classifiers are quite
expensive
 Make their predictions based on local information
 Produce arbitrarily shaped decision boundaries.
 Produce wrong predictions unless the appropriate
proximity measure and data preprocessing steps are taken.

Bayesian Classifiers
 Bayes theorem, a statistical principle for combining
prior knowledge of the classes with new evidence
gathered from data.
 Two Implementations of Bayesian classifiers:
 Naive Bayes
 Bayesian belief network.

Bayes Theorem
 Let X and Y be a pair of random variables. Their joint
probability, P(X=x, Y=y), refers to the probability that
variable X will take on the value x and variable Y will take
on the value y.
 A conditional probability is the probability that a
random variable will take on a particular value given that
the outcome for another random variable is known.
 For example, The conditional probability P(Y=y | X=x)
refers to the probability that the variable Y will take on the
value y, given that the variable X is observed to have the
value x.

 The joint and conditional probabilities for X and Y are
related in the following way:
P(X,Y)= P(Y|x) x P(X) = P(X|Y) x P(Y).
 Bayes theorem:

 Consider a football game between two rival teams: Team0
and Team1 .
 Suppose Team0 wins 65%o of the time and Team 1 wins the
remaining matches
 Among the games won by Team0, only 30% of them come
from playing on Team1 's football field.
 On the other hand,75% of the victories for Team1 are
obtained while playing at home.
 If Team1 is to host the next match between the two teams,
which team will most likely emerge as the winner?

P(Y =O|X = 1) = 1 - P(Y =l | x = 1) =0.4262.
Since P(Y : llx : 1) > P(Y : OlX : 1), Team 1 has a better chance than Team 0 of winning the next
match.

Using the Bayes Theorem for Classification
 Let X denote the attribute set and Y denote the class
variable. If the class variable has a non-deterministic
relationship with the attributes
 Then we can treat X and Y as random variables and capture
their relationship probabilistically using P(Y|X).
 This conditional probability is also known as the posterior
probability for Y, as opposed to its prior probability, P(Y).

 The Bayes theorem is useful because it allows to express
the posterior probability in terms of the prior probability
P(f), the class-conditional probability P(X|Y), and the
evidence, P(X):
 To estimate the class-conditional probabilities P(X|Y),
we present two implementations of Bayesian
classification methods:
the naiVe Bayes classifier
the Bayesian belief network.

NaiVe Bayes Classifier
 A naive Bayes classifier estimates the class-conditional
probability by assuming that the attributes are
conditionally independent, given the class label Y.
 The conditional independence assumption can be
formally stated as follows:
where each attribute set X : {X1, X2,..., Xd} consists of d
attributes.

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Bayesian Belief Networks
 A Bayesian belief network (BBN), or simply, Bayesian
network, provides a graphical representation of the
probabilistic relationships among a set of random
variables.
 There are two key elements of a Bayesian network:
1. A directed acyclic graph (dag) encoding the dependence
relationships among a set of variables.
2. A probability table associating each node to its
immediate parent nodes.

Model building in Bayesian networks involves two steps:
(1) creating the structure of the network, and
(2) estimating the probability values in the tables associated with each node.
ModelBuilding

data mining Module 4.ppt

Recommended

Recommended

More Related Content

Similar to data mining Module 4.ppt

Similar to data mining Module 4.ppt (20)

Recently uploaded

Recently uploaded (20)

data mining Module 4.ppt