Classification in Data Mining

What is Classification?
• In Classification, a model or classifier is constructed to predict categorical labels
• Data classification is a two-step process
1. Learning 2. Classification
Training
Data
Classification
algorithm
Classification
rules
Test Data
New Data
(unknown
class label)
Class Label
Data Mining: Classification and Prediction 3

• Learning step:
• A classification algorithm builds the classifier by analyzing or “learning from” a training
set made up of database tuples and their associated class labels.
• The individual tuples making up the training set are referred to as training tuples.
• Data tuples can be referred to as samples, examples, instances, data points, or objects
• This is supervised learning step
• The class label of each training tuple is provided
• This process can be viewed as the learning of a mapping or function y = 𝑓(𝑥)
• Predicts the associated class label 𝑦 of a given tuple 𝑋
• This mapping is represented in the form of classification rules, decision trees, or
mathematical formulae

• Classification Step
• The model is used for classification.
• A test set is used, made up of test tuples and their associated class labels.
• Randomly selected tuples from the general data set
• The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier.
• The associated class label of each test tuple is compared with the learned classifier’s class
prediction for that tuple.
• If the accuracy of the classifier is considered acceptable, the classifier can be used to
classify future data tuples for which the class label is not known.

What is Prediction?
• Data prediction is a two step process similar to data classification
• There is no class attribute
• Because attribute values to be predicted are continuous-valued (ordered) rather than
categorical (discrete-valued)
• Predicted attribute
• Prediction can also be viewed as a mapping or function y = 𝑓(𝑥)

How classification and prediction are
different?
• Data classification classifies categorical attribute values
• Data prediction predicts continuous-valued attribute value
• Testing data is used to assess accuracy of a classifier
• The accuracy of a predictor is estimated by computing an error based on the
difference between the predicted value and the actual known value of y for each
of the test tuples, X

Decision Tree Induction
• It is the learning of decision trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where
• Each internal node (non-leaf node) denotes a test on an attribute,
• Each branch represents an outcome of the test, and
• Each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node
• Is a person fit?????
• Binary decision tree
• Non-binary decision tree
Age <30?
Eats lots of
pizza?
Exercises
daily?
Fit Fit
Unfit! Unfit!
Yes No
Yes Yes
No No
Fig. Decision Tree for the concept being fit

• How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the attribute values of
the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that
tuple.
• Advantages of decision trees:
• Does not require any domain knowledge or parameter setting
• Can handle high dimensional data.
• The learning and classification steps of decision tree induction are simple and fast
• Have good accuracy

• Attribute selection measures
• Used to select the attribute that best partitions the tuples into distinct
classes.
• Information gain, Gain Ratio, Gini Index
• A decision tree algorithm is known as ID3 (Iterative Dichotomiser).
• C4.5 algorithm (successor of ID3) benchmark to newer supervised
learning algorithms
• Classification and Regression Trees (CART)
• Adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner
Info Gain
Gain Ratio
Gini Index

• Information Gain
• ID3 uses information gain as its attribute selection measure
• The expected information needed to classify a tuple in 𝐷 is given by
𝑰𝒏𝒇𝒐 𝑫 = − ෍
𝒊=𝟏
𝒎
𝒑𝒊 𝐥𝐨𝐠𝟐( 𝒑𝒊)
• where 𝑝𝑖 is the probability that an arbitrary tuple in 𝐷 belongs to class 𝐶𝑖,estimated as ൗ
𝐶𝑖,𝐷
𝐷
• Info(D) is also known as the entropy of D.
• The expected information required to classify a tuple from 𝐷 based on the partitioning by
attribute 𝐴.
𝑰𝒏𝒇𝒐𝑨 𝑫 = ෍
𝒋=𝟏
𝒗
𝑫𝒋
𝑫
× 𝑰𝒏𝒇𝒐(𝑫𝒋)

• Information Gain
• Information gain is defined as the difference between the original information
requirement and the new requirement
𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐 𝑫 − 𝑰𝒏𝒇𝒐𝑨 𝑫
• The attribute 𝐴 with the highest information gain, 𝑮𝒂𝒊𝒏(𝑨), is chosen as the splitting
attribute at node 𝑁.

• Decision Tree Generation Algorithm
• Input:
• Data partition, D,
which is a set of training tuples and their associated class labels;
• Attribute_list,
the set of candidate attributes;
• Attribute_selection_method,
a procedure to determine the splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting attribute and, possibly, either a split point
or splitting subset.

• Generate_decision_tree Algorithm
• Method
1. create a node N;
2. if tuples in D are all of the same class, C then
3. return N as a leaf node labeled with the class C;
4. if Attribute_list is empty then
5. return N as a leaf node labeled with the majority class in D;
6. apply Attribute_selection_method(D, Attribute_list) to find the “best” splitting
criterion;
7. label node N with splitting criterion;

• Decision Tree Generation Algorithm
• Method
8. if splitting_attribute is discrete-valued and multiway splits allowed then
9. Attribute_list ← Attribute_list − Splitting_attribute; // remove splitting attribute
10. for each outcome j of splitting criterion
11. let Dj be the set of data tuples in D satisfying outcome j; // a partition
12. if Dj is empty then
13. attach a leaf labeled with the majority class in D to node N;
14. else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N;
15. End for
16. return N;

Example
Patient ID Age Sex BP Cholesterol Class: Drug
P1 <=30 F High Normal Drug A
P2 <=30 F High High Drug A
P3 31…50 F High Normal Drug B
P4 >50 F Normal Normal Drug B
P5 >50 M Low Normal Drug B
P6 >50 M Low High Drug A
P7 31…50 M Low High Drug B
P8 <=30 F Normal Normal Drug A
P9 <=30 M Low Normal Drug B
P10 >50 M Normal Normal Drug B
P11 <=30 M Normal High Drug B
P12 31…50 F Normal High Drug B
P13 31…50 M High Normal Drug B
P14 >50 F Normal High Drug A
P15 31…50 F Low Normal ?

Example
• Reduced Training Data
• Establish the target classification
Which Drug to advice???
• 5/14 → Drug A
• 9/14 → Drug B
Age Gender BP Cholesterol Class: Drug
<=30 F High Normal Drug A
<=30 F High High Drug A
31…50 F High Normal Drug B
>50 F Normal Normal Drug B
>50 M Low Normal Drug B
>50 M Low High Drug A
31…50 M Low High Drug B
<=30 F Normal Normal Drug A
<=30 M Low Normal Drug B
>50 M Normal Normal Drug B
<=30 M Normal High Drug B
31…50 F Normal High Drug B
31…50 M High Normal Drug B
>50 F Normal High Drug A

Example
• Calculate Information gain of class attribute: Drug
𝐼𝑛𝑓𝑜 𝐷 = −
5
14
log2
5
14
−
9
14
log2
9
14
𝐼𝑛𝑓𝑜 𝐷 = 0.9403
• Calculate information gain of remaining attributes to determine the root node

Example
• Attribute: Age
• <=30 →5, 31-50 →4, >50 →5
• 3 distinct values for attribute Age, so we need to calculate 3 entropy calculations
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
5
14 × 𝐼𝑛𝑓𝑜(≤30) + ൗ
4
14 × 𝐼𝑛𝑓𝑜(31−50) + ൗ
5
14 × 𝐼𝑛𝑓𝑜(>50)
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝟎. 𝟐𝟒𝟔𝟕
<=30: 3-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(≤30) = − Τ
3
5 log2 Τ
3
5 − Τ
2
5 log2 Τ
2
5 ≈ 0.9710
31-50: 0-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(31−50) = − Τ
0
4 log2 Τ
0
4 − Τ
4
4 log2 Τ
4
4 = 0
>50 : 2-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(>50) = − Τ
2
5 log2 Τ
2
5 − Τ
3
5 log2 Τ
3
5 ≈ 0.9710

Example
• Attribute: Gender
• M→7, F→ 7
• 2 distinct values for attribute Gender, so we need to calculate 2 entropy calculations
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
7
14 × 𝐼𝑛𝑓𝑜𝑀 + ൗ
17
14 × 𝐼𝑛𝑓𝑜𝐹
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 0.9403 − 0.7885 = 𝟎. 𝟏𝟓𝟏𝟗
F: 4 Drug A, 3 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ
4
7 log2 Τ
4
7 − Τ
3
7 log2 Τ
3
7 ≈ 0.9852
M: 1 Drug A, 6 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ
1
7 log2 Τ
1
7 − Τ
6
7 log2 Τ
6
7 ≈ 0.5917

Example
• Attribute: BP
• High→ 4 , Normal→ 6 , Low→ 4
• 3 distinct values for attribute BP, so we need to calculate 3 entropy calculations
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
4
14 × 𝐼𝑛𝑓𝑜𝐻𝑖𝑔ℎ + ൗ
6
14 × 𝐼𝑛𝑓𝑜𝑁𝑜𝑟𝑚𝑎𝑙 + ൗ
4
14 × 𝐼𝑛𝑓𝑜𝐿𝑜𝑤
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 0.9403 − 0.9111 = 𝟎. 𝟎𝟐𝟗𝟐
High: 2-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ
2
4 log2 Τ
2
4 − Τ
2
4 log2 Τ
2
4 ≈1.00
Normal: 2-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ
2
6 log2 Τ
2
6 − Τ
4
6 log2 Τ
4
6 = 0.9183
Low: 1-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐿𝑜𝑤) = − Τ
1
4 log2 Τ
1
4 − Τ
3
4 log2 Τ
3
4 ≈ 0.8113

Cholesterol Class: Drug
High Drug A
High Drug A
High Drug B
High Drug B
High Drug B
High Drug A
Cholesterol Class: Drug
Normal Drug A
Normal Drug B
Normal Drug B
Normal Drug B
Normal Drug A
Normal Drug B
Normal Drug B
Normal Drug B

Example
• Attribute: Cholesterol
• High→ 6 , Normal →8
• 2 distinct values for attribute Cholesterol, so we need to calculate 2 entropy calculations
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
6
14 × 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) + ൗ
8
14 × 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙)
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 0.9403 − 0.8922 = 𝟎. 𝟎𝟒𝟖𝟏
High: 3-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ
3
6 log2 Τ
3
6 − Τ
3
6 log2 Τ
3
6 = 1.00
Normal: 2-Drug A , 6-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ
2
8 log2 Τ
2
8 − Τ
6
8 log2 Τ
6
8 ≈ 0.8113

Example
• Recap
• We choose Age being a root node.
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 0.2467
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481
𝑰𝒏𝒇𝒐𝑨𝒈𝒆 𝑫 0.2467
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481
Age
<=30 31-50 >50
Drug B
? ?
Repeat the steps

Example
Age
<=30 31-50 >50
Drug B
? ?
Gender
Male Female
Drug B Drug A
Cholesterol
Male
Normal High
Drug B
Drug A

• What if the splitting attribute 𝐴 is continuous-valued?
• The test at node N has two possible outcomes, corresponding to the conditions 𝐴 ≤
𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 and 𝐴 > 𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 , respectively
• where 𝒔𝒑𝒊𝒍𝒕_𝒑𝒐𝒊𝒏𝒕 is the split-point returned by Attribute selection method as part of the
splitting criterion.
• When A is discrete-valued and a binary tree must be produced
• The test at node N is of the form “𝐴 ∈ 𝑆𝐴”,
• where 𝑆𝐴 is the splitting subset for 𝐴 returned by Attribute selection method as part of the
splitting criterion.

1
2 3

Attribute Selection Measures
• Gain Ratio
• The information gain measure is biased toward tests with many outcomes.
• C4.5, a successor of ID3, uses an extension to information gain known as gain ratio
• Applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = − ෍
𝑗=1
𝑣
|𝐷𝑗|
|𝐷|
× log2
|𝐷𝑗|
|𝐷|
• This represents the potential information generated by splitting the training data set, 𝐷, into 𝑣
partitions, corresponding to the 𝑣 outcomes of a test on attribute 𝐴.
• The gain ratio is defined as
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 =
𝐺𝑎𝑖𝑛(𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
• The attribute with the maximum gain ratio is selected as the splitting attribute.

• Gain Ratio
• Computation of gain ratio for the attribute weight.
• Attribute: Weight has three values as Heavy, Average and Light containing 5, 6 and 4 tuples
respectively.
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = −
5
15
× log2
5
15
−
6
15
× log2
6
15
−
4
15
× log2
4
15
= 1.5655
𝐺𝑎𝑖𝑛 𝑊𝑒𝑖𝑔ℎ𝑡 = 0.0622
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑊𝑒𝑖𝑔ℎ𝑡 =
0.0622
1.5655
= 0.040

• Gini Index
• The Gini index is used in CART
• The Gini index measures the impurity of 𝐷, a data partition or set of training tuples, as
𝐺𝑖𝑛𝑖 𝐷 = 1 − ෍
𝑖=1
𝑚
𝑝𝑖
2
• where 𝑝𝑖 is the probability that a tuple in 𝐷 belongs to class 𝐶𝑖 and is estimated by ൗ
|𝐶𝑖,𝐷|
|𝐷|.
• The sum is computed over 𝑚 classes.
• The Gini index considers a binary split for each attribute
• If a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning
is
𝐺𝑖𝑛𝑖𝐴 𝐷 =
|𝐷1|
|𝐷|
× 𝐺𝑖𝑛𝑖(𝐷1) +
𝐷2
𝐷
× 𝐺𝑖𝑛𝑖(𝐷2)

• Gini Index
• For a discrete-valued attribute, the subset that gives the minimum gini index for
that attribute is selected as its splitting subset.
• For continuous-valued attributes, each possible split-point must be considered.
• The reduction in impurity that would be incurred by a binary split on a discrete-
or continuous-valued attribute A is
Δ𝐺𝑖𝑛𝑖 𝐴 = 𝐺𝑖𝑛𝑖 𝐷 − 𝐺𝑖𝑛𝑖𝐴 𝐷
• The attribute that maximizes the reduction in impurity is selected as the splitting attribute.

Bayesian Classification
• Bayesian classifiers are statistical classifiers
• Predicts class membership probabilities.
• based on Bayes’ theorem .
• Exhibits high accuracy and speed when applied to large databases.
• A simple Bayesian classifier is known as the naïve Bayesian classifier
• Assumes that the effect of an attribute value on a given class is independent of the values of
the other attributes: class conditional independence.
• Bayesian belief networks are graphical models, allow the representation of
dependencies among subsets of attributes

Bayesian Classification
• Bayes’ Theorem
• Let 𝑿 be a data tuple (X is considered “evidence”).
• Let 𝑯 be some hypothesis, such as that the data tuple 𝑿 belongs to a specified class 𝑪.
• Determine 𝑷 𝑯|𝑿 , the probability that the hypothesis 𝑯 holds given the “evidence” or
observed data tuple X.
• 𝑷 𝑯|𝑿 is the posterior probability of 𝑯 conditioned on X.
• 𝑷 𝑯 is the prior probability of 𝑯.
• 𝑷 𝑿|𝑯 is the posterior probability of 𝑿 conditioned on H.
• 𝑷 𝑿 is the prior probability of 𝑿.
• “How are these probabilities estimated?”
𝑷 𝑯|𝑿 =
𝑷 𝑿|𝑯 𝑷 𝑯
𝑷 𝑿
…Bayes’ Theorem

Naïve Bayesian Classification
• A simple Bayesian classifier is known as the naïve Bayesian classifier
• Assumes that the effect of an attribute value on a given class is independent of
the values of the other attributes: class conditional independence.
• It is made to simplify the computations involved and, in this sense, is
considered “naïve.”

• Let 𝐷 be a training set of tuples and their associated class labels.
• Suppose that there are m classes, 𝐶1, 𝐶2, … 𝐶𝑚.
• Given a tuple, 𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑛) depicting 𝑛 measurements made on the tuple from 𝑛
attributes, the classifier will predict that 𝑋 belongs to the class having the highest
posterior probability, conditioned on 𝑋.
• The naïve Bayesian classifier predicts that tuple X belongs to the class 𝐶𝑖 if and only if
𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖
• We maximize 𝑃 𝐶𝑖|𝑋
• The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori
hypothesis.

• By Bayes’ theorem
𝑷 𝑪𝒊|𝑿 =
𝑷 𝑿|𝑪𝒊 𝑷 𝑪𝒊
𝑷 𝑿
• As 𝑃 𝑋 is constant for all classes, only 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 need be maximized.
• The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖
• The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori
hypothesis.
• The class prior probabilities may be estimated by
𝑃 𝐶𝑖 =
𝐶𝑖,𝐷
𝐷
… … 𝑤ℎ𝑒𝑟𝑒 𝐶𝑖,𝐷 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑢𝑝𝑙𝑒𝑠 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝐶𝑖 𝑖𝑛 𝐷

Naïve Bayesian Classification: Example
RID age income student credit_rating class: Buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no

• Let 𝐶1 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑦𝑒𝑠 and 𝐶2 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑛𝑜
• The tuple we wish to classify is
𝑋 = (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ, 𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚, 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠, 𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟)
• We need to maximize 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , for 𝑖 = 1,2
• Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2
• 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
• 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2
9/14 = 0.643
5/14 = 0.357

6 senior low yes excellent no

• Now we calculate from above probabilities
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.222 × 0.444 × 0.667 × 0.667
Similarly
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.600 × 0.400 × 0.200 × 0.400
• To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for
tuple 𝑋.
=0.044
=0.019
0.028
0.007

6 senior low yes excellent yes
Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2
𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
10/14 = 0.714
𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
4/14 = 0.286

• Now we calculate from above probabilities
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.200 × 0.400 × 0.700 × 0.600
Similarly
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.750 × 0.500 × 0 × 0.500
• To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for
tuple 𝑋.
=0.034
=0
0.024
0
IS IT CORRECT CLASSIFICATION ??????????

• A zero probability cancels the effects of all of the other (posteriori) probabilities
(on 𝐶𝑖) involved in the product.
• To avoid the effect of zero probability value, Laplacian correction or Laplace
estimator is used.
• We add one to each count.

• E.g. If we have a training database D having 1500 tuples.
• Out of which, 1000 tuples are of class Buys_computer = yes.
• For income attribute we have
• 0 tuples for income = low,
• 960 tuple for income = medium,
• 40 tuples for income = high.
• Using the Laplacian correction for the three quantities, we pretend that we have 1 extra tuple for
each income-value pair.
1
1003
= 0.001,
961
1003
= 0.958,
41
1003
= 0.040
• The “corrected” probability estimates are close to their “uncorrected” counterparts, yet the zero
probability value is avoided.

Rule-Based Classification
• The learned model is represented as a set of IF-THEN rules.
• An IF-THEN rule is an expression of the form
𝑰𝑭 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑻𝑯𝑬𝑵 𝑐𝑜𝑛𝑐𝑙𝑢𝑠𝑖𝑜𝑛
• Example: 𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
• R1 can also be written as
𝑅1: (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ) ∧ (𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠) ⇒ (𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
Rule antecedent
or precondition
Rule consequent
Attribute sets Class Prediction

• If the condition in a rule antecedent holds true for a given tuple, the rule antecedent is
satisfied and that the rule covers the tuple.
• Evaluation of Rule R:
𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑅 =
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
|𝐷|
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅 =
𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
• Let 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 be the number of tuples covered by R
• 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 be the number of tuples correctly classified by R
• |𝐷| be the number of tuples in D.

• If a rule is satisfied by X, the rule is said to be triggered.
X= (age = youth, income = medium, student = yes, credit rating = fair)
• X satisfies the rule R1, which triggers the rule.
• If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X.
• If more than one rule is triggered, we need a conflict resolution strategy.
• Size ordering: assigns the highest priority to the triggering rule that has the “toughest”
requirements
• Rule ordering: prioritizes the rules beforehand. The ordering may be class-based or rule-
based.
• Class-based ordering: the classes are sorted in order of decreasing “importance”
• Rule-based ordering, the rules are organized into one long priority list

• Extracting rules from a decision tree
• One rule is created for each path from the root to a leaf node.
• Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part).
• The leaf node holds the class prediction, forming the rule consequent (“THEN”
part).

• Extracting rules from a decision tree
age?
student? credit_rating?
yes
middle_aged
youth senior
no yes no yes
no yes fair excellent

• Extracted rules from a decision tree are
𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅2: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no
𝑅3: 𝐼𝐹 𝑎𝑔𝑒 = 𝑚𝑖𝑑𝑑𝑙𝑒_𝑎𝑔𝑒𝑑 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅4: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = yes 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅5: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = 𝑛𝑜 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no

Extract the classification rule from given decision tree

X=(Color=Yellow, Type = SUV, Origin = Imported)

Prediction
• Numeric prediction is the task of predicting continuous (or ordered) values for
given input.
• Widely used approach for numeric prediction is regression.
• Regression is used to model the relationship between one or more independent
or predictor variables and a dependent or response variable.
• The predictor variables are the attributes of interest describing the tuple.
• The response variable is what we want to predict.
Predictor Variables Response Variable
𝑋 = {age = youth, "income = medium, student = yes, credit rating = fair", Buys_computer =? }

Prediction: Linear Regression
• Straight-line regression analysis involves a response variable, 𝑦, and a single
predictor variable, 𝑥.
• Simplest regression technique which models 𝑦 as a linear function of 𝑥.
𝑦 = 𝑏 + 𝑤𝑥
• 𝑏 and 𝑤 are regression coefficients specifying the Y-intercept and slope of the line.
• Coefficients can also be thought as weights
𝑦 = 𝑤0 + 𝑤1𝑥
• These coefficients can be solved for by the method of least squares, which estimates
the best-fitting straight line as the one that minimizes the error between the actual
data and the estimate of the line.

Prediction: Linear Regression
• The regression coefficients can be estimated
𝑤1 =
σ𝑖=1
|𝐷|
𝑥𝑖 − ҧ
𝑥 𝑦𝑖 − ത
𝑦
σ𝑖=1
|𝐷|
𝑥𝑖 − ҧ
𝑥 2
𝑤0 = ത
𝑦 − 𝑤1 ҧ
𝑥

Prediction: Linear Regression Age
(x)
Avg. amount spent on
medical expenses
(per month in Rs.) (y)
15 100
20 135
25 135
37 150
40 250
45 270
48 290
50 360
55 375
61 400
64 500
67 1000
70 1500
ҧ
𝑥 = 45.92
ത
𝑦 = 412.69
The regression coefficients are
𝑤1 = 16.89
𝑤0 = −355.32
The equation of the least square (best fitting) line is
𝑦 = −355.32 + 16.89𝑥

Prediction: Linear Regression Age
(x)
Avg. amount spent on
medical expenses
(per month in Rs.) (y)
15 100
20 135
25 135
37 150
40 250
45 270
48 290
50 360
55 375
61 400
64 500
67 1000
70 1500
y = 16.891x - 355.32
-200
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60 70 80
Avg. amount spent on medical expenses (per month in Rs.) (y)

Classifier Accuracy Measures
• Confusion Matrix:
• Given 𝒎 classes, a confusion matrix is a table of at least size 𝒎 by 𝒎
• where an entry is row 𝒊 and column 𝒋 shows the number of tuples of class 𝒊 that were
labeled by the classifier as class 𝒋.

Class – Low Class – Medium Class - High
Class – Low 250 10 0
Class – Medium 10 440 10
Class - High 0 10 270
1000 tuples

• Classifier Accuracy
• The percentage of test set tuples that are correctly classified by the classifier.
• Also referred to as the overall recognition rate of the classifier.
• Error Measure
• An error rate or misclassification rate of a classifier M, which is simply
1 − 𝐴𝑐𝑐 𝑀
where 𝐴𝑐𝑐(𝑀) is the accuracy of M.

• Confusion Matrix: Given 2 classes
• Positive tuples:
• tuples of the main class of interest
• Negative tuples:
• True Positive:
• The positive tuples that were correctly labeled by the classifier
• True negatives
• The negative tuples that were correctly labeled by the classifier
• False positives
• The negative tuples that were incorrectly labeled
• False negatives
• The positive tuples that were incorrectly labeled

• We would like to be able to access how well the classifier can recognize the
positive tuples and how well it can recognize the negative tuples.
• Sensitivity (true positive (recognition) rate)
• The proportion of positive tuples that are correctly identified.
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑡_𝑝𝑜𝑠
𝑝𝑜𝑠
• Specificity (true negative rate)
• The proportion of negative tuples that are correctly identified.
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑡_𝑛𝑒𝑔
𝑛𝑒𝑔
• Precision
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡 _𝑝𝑜𝑠
(𝑡 _𝑝𝑜𝑠 + 𝑓 _𝑝𝑜𝑠 )

• It can be shown that accuracy is a function of sensitivity and specificity.
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 ×
𝑝𝑜𝑠
𝑝𝑜𝑠 + 𝑛𝑒𝑔
+ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 ×
𝑛𝑒𝑔
𝑝𝑜𝑠 + 𝑛𝑒𝑔
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑝𝑜𝑠 + 𝑡𝑛𝑒𝑔
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑡𝑢𝑝𝑙𝑒𝑠

Predictor Accuracy Measures
• Instead of focusing on whether the predicted value 𝑦′𝑖 is an “exact” match with actual
value 𝑦𝑖 , we check how far off the predicted value is from the actual known value.
• Loss functions measures the error between the actual value 𝑦𝑖 and the predicted value
𝑦′𝑖.
𝑨𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓: 𝒚𝒊 − 𝒚′𝒊
𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓: (𝒚𝒊 − 𝒚′𝒊)𝟐
• The test error (rate), or generalization error, is the average loss over the test set.
𝑴𝒆𝒂𝒏 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
𝒚𝒊 − 𝒚′𝒊
𝒅
𝑴𝒆𝒂𝒏 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
(𝒚𝒊 − 𝒚′𝒊)𝟐
𝒅
• If we were to take the square root of the mean squared error, the resulting error measure
is called the root mean squared error.

Predictor Accuracy Measures
• Relative measures of error include
𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
𝒚𝒊 − 𝒚′𝒊
σ𝒊=𝟏
𝒅
𝒚𝒊 − ഥ
𝒚
𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
(𝒚𝒊 − 𝒚′𝒊)𝟐
σ𝒊=𝟏
𝒅
(𝒚𝒊 − ഥ
𝒚)𝟐
• We can take the root of the relative squared error to obtain the root relative squared
error so that the resulting error is of the same magnitude as the quantity predicted.

Accuracy Measures
• Evaluating the Accuracy of a Classifier or Predictor
Holdout
Random Subsampling
Cross Validation
Bootstrap

Accuracy Measures
• Holdout
• The given data are randomly partitioned into two independent sets, a training set and a
test set.
• Two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set.
Data
Training
set
Test set
Derive model
Estimate
Accuracy

Accuracy Measures
• Random Subsampling
• A variation of the holdout method in which the holdout method is repeated 𝒌 times.
• The overall accuracy estimate is taken as the average of the accuracies obtained from
each iteration
Data
Training
set 1
Test set
1
Derive model
Estimate
Accuracy
Data
Training
set 2
Test set
2
Derive model
Estimate
Accuracy
Data
Training
set k
Test set
k
Derive model
Estimate
Accuracy
Iteration 1
Iteration 2 Iteration k
. . .

Accuracy Measures
• Cross Validation
Data
𝐷2 𝐷3
𝐷1 𝐷𝑘
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
Iteration 1 Iteration 2 Iteration3 Iteration 𝑘
𝑘 mutually exclusive folds
i.e.𝑘 data partitions
. . .
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
Training Set Test Set

Accuracy Measures
• Cross Validation
• Each sample is used the same number of times for training and once for testing.
• For Classification, the accuracy estimate is the overall number of correct classifications
from the k iterations, divided by the total number of tuples in the initial data.
• For Prediction, the error estimate can be computed as the total loss from the k iterations,
divided by the total number of initial tuples.
• Leave-one-out
• 𝑘 is set to the number of initial tuples. So, only one sample is “left out” at a time for the test set.
• Stratified cross-validation
• The folds are stratified so that the class distribution of the tuples in each fold is approximately
the same as that in the initial data

Accuracy Measures
• Bootstrap
• The bootstrap method samples the given training tuples uniformly with replacement.
• i.e. each time a tuple is selected, it is equally likely to be selected again and readded to the
training set.
• .632 Bootstrap
• On an average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining
36.8% will form the test set
• Each tuple has a probability of Τ
1
𝑑 of being selected, so the probability of not being chosen is (1 −
Τ
1
𝑑).
• We have to select 𝑑 times, so the probability that a tuple will not be chosen during this whole time is
(1 − Τ
1
𝑑)𝑑.
• If 𝑑 is large, the probability approaches e−1
= 0.368 .
• Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the
remaining 63.2% will form the training set.

Accuracy Measures
• Bootstrap
• .632 Bootstrap
• Repeat the sampling procedure 𝑘 times, where in each iteration, we use the current test set to
obtain an accuracy estimate of the model obtained from the current bootstrap sample.
• The overall accuracy of the model is
𝐴𝑐𝑐 𝑀 = ෍
𝑖=1
𝑘
0.632 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑒𝑠𝑡_𝑠𝑒𝑡 + 0.368 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡

Classification in Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Classification in Data Mining

Similar to Classification in Data Mining (20)

More from Rashmi Bhat

More from Rashmi Bhat (17)

Recently uploaded

Recently uploaded (20)

Classification in Data Mining