2. Classification: Definition
Classification which is the task of assigning objects to
one of several predefined categories.
1/9/2023 Introduction to Data Mining, 2nd Edition 2
3. 1/9/2023 Introduction to Data Mining, 2nd Edition 3
Classification is the task of learning a target function f
that maps each attribute set x to one of the predefined
class y.
The target function is also known informally as a
classification model.
A classification model is useful for the following
purposes:
Descriptive Modeling
Predictive Modeling
5. Given a collection of records (training set )
Each record is by characterized by a tuple (x,y), where x
is the attribute set and y is the class label
x: attribute, predictor, independent variable, input
y: class, response, dependent variable, output
1/9/2023 Introduction to Data Mining, 2nd Edition 5
6. General Approach for Building Classification
Model
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
1/9/2023 Introduction to Data Mining, 2nd Edition 6
7. Evaluation of the performance of
a classification model
1/9/2023 Introduction to Data Mining, 2nd Edition 7
9. Decision Tree Induction
Consider a simpler version of the vertebrate
classification problem described in the previous
section
Instead of classifying the vertebrates into five distinct
groups of species, we assign them to two categories:
mammals and non-mammals
1/9/2023 Introduction to Data Mining, 2nd Edition 9
11. Decision Tree Induction
A root node that has no incoming edges and zero or
more outgoing edges.
Internal nodes, each of which has exactly one
incoming edge and two or more outgoing edges.
Leaf or terminal nodes, each of which has exactly
one incoming edge and no outgoing edges. In a
decision tree, each leaf node is assigned a class label.
1/9/2023 Introduction to Data Mining, 2nd Edition 11
14. Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
1/9/2023 Introduction to Data Mining, 2nd Edition 14
15. Hunt’s AlgoritHm
Let Dt be the set of training records that are associated
with node t and y = {y1, y2, . . . , yc} be the class labels.
The following is a recursive definition of Hunt‘s algorithm.
Step 1: If all the records in Data belong to the same class yt,
then t is a leaf node labeled as yt.
Step 2: If Data contains records that belong to more than
one class, an attribute test condition is selected to partition
the records into smaller subsets. A child node is created for
each outcome of the test condition and the records in Dt are
distributed to the children based on the outcomes. The
algorithm is then recursively applied to each child node
1/9/2023 Introduction to Data Mining, 2nd Edition 15
16. Hunt’s Algorithm
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
1/9/2023 Introduction to Data Mining, 2nd Edition 16
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Defaulted barrower
17. Hunt’s Algorithm
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
1/9/2023 Introduction to Data Mining, 2nd Edition 17
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
18. Hunt’s Algorithm
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
1/9/2023 Introduction to Data Mining, 2nd Edition 18
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
19. Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
1/9/2023 Introduction to Data Mining, 2nd Edition 19
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
20. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 20
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.
21. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 21
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
22. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 22
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
23. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 23
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
24. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 24
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
25. Apply Model to Test Data
1/9/2023 Introduction to Data Mining, 2nd Edition 25
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner
26. Design Issues of Decision Tree Induction
How should training records be split?
Method for specifying test condition
depending on attribute types
Measure for evaluating the goodness of a test condition
How should the splitting procedure stop?
Stop splitting if all the records belong to the same class
or have identical attribute values
Early termination
1/9/2023 Introduction to Data Mining, 2nd Edition 26
27. Methods for Expressing Test Conditions
Depends on attribute types
Binary
Nominal
Ordinal
Continuous
1/9/2023 Introduction to Data Mining, 2nd Edition 27
29. Test Condition for Nominal Attributes
Multi-way split:
Use as many partitions as distinct
values.
Binary split:
Divides values into two subsets
Marital
Status
Single Divorced Married
{Single} {Married,
Divorced}
Marital
Status
{Married} {Single,
Divorced}
Marital
Status
OR
1/9/2023 Introduction to Data Mining, 2nd Edition 29
OR
{Single,
Married}
Marital
Status
{Divorced}
30. Test Condition for Ordinal Attributes
Multi-way split:
Use as many partitions
as distinct values
Binary split:
Divides values into two
subsets
Preserve order property
among attribute values
Large
Shirt
Size
Medium Extra Large
Small
{Medium, Large,
Extra Large}
Shirt
Size
{Small}
{Large,
Extra Large}
Shirt
Size
{Small,
Medium}
1/9/2023 Introduction to Data Mining, 2nd Edition 30
{Medium,
Extra Large}
Shirt
Size
{Small,
Large}
This grouping
violates order
property
31. Test Condition for Continuous Attributes
Annual
Income
> 80K?
Yes No
Annual
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
1/9/2023 Introduction to Data Mining, 2nd Edition 31
32. How to determine the Best Split
C0: 5
C1: 5
1/9/2023 Introduction to Data Mining, 2nd Edition 32
Greedy approach:
Nodes with purer class distribution are preferred
Need a measure of node impurity:
C0: 9
C1: 1
High degree of impurity Low degree of impurity
33. Measures of for selecting the Best split
j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
1/9/2023 Introduction to Data Mining, 2nd Edition 33
Gini Index
Entropy
Misclassification error
j
t
j
p
t
j
p
t
Entropy )
|
(
log
)
|
(
)
(
)
|
(
max
1
)
( t
i
P
t
Error i
34. 1/9/2023 Introduction to Data Mining, 2nd Edition 34
Examples of computing the different impurity
35. Comparison among Impurity Measures
1/9/2023 Introduction to Data Mining, 2nd Edition 35
For a 2-class problem:
36. Finding the Best Split for Binary atrributes
1/9/2023 Introduction to Data Mining, 2nd Edition 36
37. 1/9/2023 Introduction to Data Mining, 2nd Edition 37
If attribute A is chosen to split the data, the Gini index for node N is 0.4898,
for node N2, it is 0.480. The weighted average of the Gini index for the
descendent nodes is (7/12) x 0.4898 + (5/12) x 0.480= 0.486. Similarly we
can show that the weighted average of the Gini index for the attribute B is
0.375.
38. Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the
dataset
Use the count matrix to make decisions
1/9/2023 Introduction to Data Mining, 2nd Edition 38
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?
39. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
1/9/2023 Introduction to Data Mining, 2nd Edition 39
Split Positions
Sorted Values
40. Consider the training examples shown in Table for a binary
classification problem.
1/9/2023 Introduction to Data Mining, 2nd Edition 40
41. (a) Compute the Gini index for the overall collection of
training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using
multiway split.
(e) Compute the Gini index for the Shirt Size attribute using
multiway split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
1/9/2023 Introduction to Data Mining, 2nd Edition 41
42. Algorithm for Decision Tree Induction
1/9/2023 Introduction to Data Mining, 2nd Edition 42
43. Characteristics of Decision Tree Induction
Decision tree induction is a nonparametric
approach for building classification models.
Finding an optimal decision tree is an NP-complete
problem.
Techniques developed for constructing decision trees
are computationally inexpensive
Decision trees, especially smaller-sized trees, are
relatively easy to interpret.
Introduction to Data Mining, 2nd Edition 43
44. Decision trees provide an expressive representation for
learning discrete valued functions
Decision tree algorithms are quite robust to the presence
of noise, especially when methods for avoiding
overfitting
The presence of redundant attributes does not
adversely affect the accuracy of decision trees.
1/9/2023 Introduction to Data Mining, 2nd Edition 44
45. Since most decision tree algorithms employ a top-down,
recursive partitioning approach, the number of records
becomes smaller as we traverse down the tree.
A subtree can be replicated multiple times in a decision
tree
1/9/2023 Introduction to Data Mining, 2nd Edition 45
Characteristics of Decision Tree Induction
46. Consider the training examples shown in Table 4.2 for a binary classification
problem.
(a) What is the entropy of this collection of training examples with respect to the positive
class?
(b) What are the information gains of a1 and a2 relative to these training examples?
(c) For a3, which is a continuous attribute, compute the information gain for every possible
split.
(d)What is the best split (among a1, a2, and a3) according to the information gain?
(e) What is the best split (between a1 and a2) according to the classification error rate?
(f) What is the best split (between a1 and a2) according to the Gini index?
47. Rule-Based Classifier
Classify records by using a collection of “if…then…”
rules
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
48. The left-hand side of the rule is called the rule antecedent or
precondition.
It contains a conjunction of attribute tests:
Conditioni = (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk)
where (Aj , vj) is an attribute-value pair and op is a logical operator chosen from the
set {=, ≠,<,>,≤,≥}. Each attribute test (Aj op vj) is known as a conjunct.
The right-hand side of the rule is called the rule consequent, which contains
the predicted class yi.
1/9/2023 Introduction to Data Mining, 2nd Edition 48
Rule Representation
51. Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy
the antecedent of a rule
Accuracy of a rule:
Fraction of records that satisfy
both the antecedent and
consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(Status=Single) No
Coverage = 40%, Accuracy = 50%
52.
53. Characteristics of Rule-Based Classifier
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules
are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
Each record is covered by at least one rule
54. Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
55. Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has
triggered
If none of the rules fired, it is assigned to the default class
56. Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
58. How to Build a Rule-Based Classifier
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
59. Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is
met
64. Rule Evaluation
An evaluation metric is needed to determine which
conjunct should be added (or removed) during the rule-
growing process.
Accuracy is an obvious choice because it explicitly measures
the fraction of training examples classified correctly by the
rule.
a potential limitation of accuracy is that it does not take
into account the rule’s coverage.
1/9/2023 Introduction to Data Mining, 2nd Edition 64
65. For example, consider a training set that contains 60 positive
examples and 100 negative examples. Suppose we are given the
following two candidate rules:
Rule r1: covers 50 positive examples and 5 negative examples,
Rule r2: covers 2 positive examples and no negative examples.
The accuracies for r1 and r2 are 90.9% and 100%, respectively. However, r1
is the better rule despite its lower accuracy. The high accuracy for r2 is
potentially spurious because the coverage of the rule is too low.
1/9/2023 Introduction to Data Mining, 2nd Edition 65
66. Approaches to Handle such problem
1. Likelihood ratio statistic:
2. Laplace Measure
3. FOIL’s information gain
1/9/2023 Introduction to Data Mining, 2nd Edition 66
67. Likelihood ratio statistic
where k is the number of classes, fi is the observed frequency of
class i examples that are covered by the rule, and ei is the
expected frequency of a rule that makes random predictions
1/9/2023 Introduction to Data Mining, 2nd Edition 67
68. The likelihood ratio for r1 is
R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)]
= 99.9.
The likelihood ratio statistic for r2 is
R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.
This statistic therefore suggests that r1 is a better rule than r2.
1/9/2023 Introduction to Data Mining, 2nd Edition 68
69. Laplace Measure
where n is the number of examples covered by the rule, f+ is
the number of positive examples covered by the rule, k is the
total number of classes, and p+ is the prior probability for the
positive class.
1/9/2023 Introduction to Data Mining, 2nd Edition 69
70. The Laplace measure for r1 is 51/57 = 89.47%, which
is quite close to its accuracy. Conversely, the Laplace
measure for r2 (75%) is significantly lower than its
accuracy because r2 has a much lower coverage.
1/9/2023 Introduction to Data Mining, 2nd Edition 70
72. Foil’s Information Gain
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = p1 x [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by
both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
73. (a) Rule accuracy
(b) FOIL’s information gain.
(c) likelihood ratio
(d) The Laplace measure.
(e) m-estimate measure (with k = 2 and p+ = 0.2).
1/9/2023 Introduction to Data Mining, 2nd Edition 73
74. Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive class,
and the other as negative class
Learn rules for negative class
positive class will be default class
For multi-class problem
Order the classes according to increasing class prevalence
(fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as
negative class
Repeat with next smallest class as positive class
75. Direct Method: RIPPER
Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOIL’s information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced error
pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
Pruning method: delete any final sequence of conditions that
maximizes v
90. Charecteristics of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
91. Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s
probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
92. Nearest-Neighbor Classifiers
Nearest Neighbour classifier assumes the similarity
between the new case/data and available cases and put
the new case into the category that is most similar to
the available categories.
93. Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either
it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new
data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog
category.
94. Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this type
of problem, we need a K-NN algorithm. With the help of K-
NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
95. How does K-NN work?
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of new data
point with all other data points.
Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
Step-4: Among these k neighbors, count the number of
the data points in each category.
Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
Step-6: Our model is ready.
96. Suppose we have a new data point and we need to
put it in the required category. Consider the
below image:
97. Firstly, we will choose the number of neighbors, so
we will choose the k=5.
Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which
we have already studied in geometry. It can be
calculated as:
98. By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
• As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
99. Nearest-Neighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
100. Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
101. Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-
nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d2
i i
i
q
p
q
p
d 2
)
(
)
,
(
103. Characteristics of Nearest-Neighbor Classifiers
Uses specific training instances to make predictions
without having to maintain an abstraction (or model)
derived from data.
Lazy learners such as nearest-neighbor classifiers are quite
expensive
Make their predictions based on local information
Produce arbitrarily shaped decision boundaries.
Produce wrong predictions unless the appropriate
proximity measure and data preprocessing steps are taken.
1/9/2023 Introduction to Data Mining, 2nd Edition 103
104. Bayesian Classifiers
Bayes theorem, a statistical principle for combining
prior knowledge of the classes with new evidence
gathered from data.
Two Implementations of Bayesian classifiers:
Naive Bayes
Bayesian belief network.
1/9/2023 Introduction to Data Mining, 2nd Edition 104
105. Bayes Theorem
Let X and Y be a pair of random variables. Their joint
probability, P(X=x, Y=y), refers to the probability that
variable X will take on the value x and variable Y will take
on the value y.
A conditional probability is the probability that a
random variable will take on a particular value given that
the outcome for another random variable is known.
For example, The conditional probability P(Y=y | X=x)
refers to the probability that the variable Y will take on the
value y, given that the variable X is observed to have the
value x.
1/9/2023 Introduction to Data Mining, 2nd Edition 105
106. The joint and conditional probabilities for X and Y are
related in the following way:
P(X,Y)= P(Y|x) x P(X) = P(X|Y) x P(Y).
Bayes theorem:
1/9/2023 Introduction to Data Mining, 2nd Edition 106
107. Consider a football game between two rival teams: Team0
and Team1 .
Suppose Team0 wins 65%o of the time and Team 1 wins the
remaining matches
Among the games won by Team0, only 30% of them come
from playing on Team1 's football field.
On the other hand,75% of the victories for Team1 are
obtained while playing at home.
If Team1 is to host the next match between the two teams,
which team will most likely emerge as the winner?
1/9/2023 Introduction to Data Mining, 2nd Edition 107
108. 1/9/2023 Introduction to Data Mining, 2nd Edition 108
P(Y =O|X = 1) = 1 - P(Y =l | x = 1) =0.4262.
Since P(Y : llx : 1) > P(Y : OlX : 1), Team 1 has a better chance than Team 0 of winning the next
match.
109. Using the Bayes Theorem for Classification
Let X denote the attribute set and Y denote the class
variable. If the class variable has a non-deterministic
relationship with the attributes
Then we can treat X and Y as random variables and capture
their relationship probabilistically using P(Y|X).
This conditional probability is also known as the posterior
probability for Y, as opposed to its prior probability, P(Y).
1/9/2023 Introduction to Data Mining, 2nd Edition 109
110. The Bayes theorem is useful because it allows to express
the posterior probability in terms of the prior probability
P(f), the class-conditional probability P(X|Y), and the
evidence, P(X):
To estimate the class-conditional probabilities P(X|Y),
we present two implementations of Bayesian
classification methods:
the naiVe Bayes classifier
the Bayesian belief network.
1/9/2023 Introduction to Data Mining, 2nd Edition 110
111. NaiVe Bayes Classifier
A naive Bayes classifier estimates the class-conditional
probability by assuming that the attributes are
conditionally independent, given the class label Y.
The conditional independence assumption can be
formally stated as follows:
where each attribute set X : {X1, X2,..., Xd} consists of d
attributes.
1/9/2023 Introduction to Data Mining, 2nd Edition 111
117. Bayesian Belief Networks
A Bayesian belief network (BBN), or simply, Bayesian
network, provides a graphical representation of the
probabilistic relationships among a set of random
variables.
There are two key elements of a Bayesian network:
1. A directed acyclic graph (dag) encoding the dependence
relationships among a set of variables.
2. A probability table associating each node to its
immediate parent nodes.
1/9/2023 Introduction to Data Mining, 2nd Edition 117
121. 1/9/2023 Introduction to Data Mining, 2nd Edition 121
Model building in Bayesian networks involves two steps:
(1) creating the structure of the network, and
(2) estimating the probability values in the tables associated with each node.
ModelBuilding