SlideShare a Scribd company logo
Classification and Prediction
2
What is Classification?
• In Classification, a model or classifier is constructed to predict categorical labels
• Data classification is a two-step process
1. Learning 2. Classification
Training
Data
Classification
algorithm
Classification
rules
Test Data
New Data
(unknown
class label)
Class Label
Data Mining: Classification and Prediction 3
What is Classification?
• Learning step:
• A classification algorithm builds the classifier by analyzing or “learning from” a training
set made up of database tuples and their associated class labels.
• The individual tuples making up the training set are referred to as training tuples.
• Data tuples can be referred to as samples, examples, instances, data points, or objects
• This is supervised learning step
• The class label of each training tuple is provided
• This process can be viewed as the learning of a mapping or function y = 𝑓(𝑥)
• Predicts the associated class label 𝑦 of a given tuple 𝑋
• This mapping is represented in the form of classification rules, decision trees, or
mathematical formulae
Data Mining: Classification and Prediction 4
What is Classification?
• Classification Step
• The model is used for classification.
• A test set is used, made up of test tuples and their associated class labels.
• Randomly selected tuples from the general data set
• The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier.
• The associated class label of each test tuple is compared with the learned classifier’s class
prediction for that tuple.
• If the accuracy of the classifier is considered acceptable, the classifier can be used to
classify future data tuples for which the class label is not known.
Data Mining: Classification and Prediction 5
Data Mining: Classification and Prediction 6
What is Prediction?
• Data prediction is a two step process similar to data classification
• There is no class attribute
• Because attribute values to be predicted are continuous-valued (ordered) rather than
categorical (discrete-valued)
• Predicted attribute
• Prediction can also be viewed as a mapping or function y = 𝑓(𝑥)
Data Mining: Classification and Prediction 7
How classification and prediction are
different?
• Data classification classifies categorical attribute values
• Data prediction predicts continuous-valued attribute value
• Testing data is used to assess accuracy of a classifier
• The accuracy of a predictor is estimated by computing an error based on the
difference between the predicted value and the actual known value of y for each
of the test tuples, X
Data Mining: Classification and Prediction 8
Decision Tree Induction
• It is the learning of decision trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where
• Each internal node (non-leaf node) denotes a test on an attribute,
• Each branch represents an outcome of the test, and
• Each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node
• Is a person fit?????
• Binary decision tree
• Non-binary decision tree
Data Mining: Classification and Prediction 9
Age <30?
Eats lots of
pizza?
Exercises
daily?
Fit Fit
Unfit! Unfit!
Yes No
Yes Yes
No No
Fig. Decision Tree for the concept being fit
Decision Tree Induction
• How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the attribute values of
the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that
tuple.
• Advantages of decision trees:
• Does not require any domain knowledge or parameter setting
• Can handle high dimensional data.
• The learning and classification steps of decision tree induction are simple and fast
• Have good accuracy
Data Mining: Classification and Prediction 10
Decision Tree Induction
• Attribute selection measures
• Used to select the attribute that best partitions the tuples into distinct
classes.
• Information gain, Gain Ratio, Gini Index
• A decision tree algorithm is known as ID3 (Iterative Dichotomiser).
• C4.5 algorithm (successor of ID3) benchmark to newer supervised
learning algorithms
• Classification and Regression Trees (CART)
• Adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner
Data Mining: Classification and Prediction 11
Info Gain
Gain Ratio
Gini Index
Decision Tree Induction
• Information Gain
• ID3 uses information gain as its attribute selection measure
• The expected information needed to classify a tuple in 𝐷 is given by
𝑰𝒏𝒇𝒐 𝑫 = − ෍
𝒊=𝟏
𝒎
𝒑𝒊 𝐥𝐨𝐠𝟐( 𝒑𝒊)
• where 𝑝𝑖 is the probability that an arbitrary tuple in 𝐷 belongs to class 𝐶𝑖,estimated as ൗ
𝐶𝑖,𝐷
𝐷
• Info(D) is also known as the entropy of D.
• The expected information required to classify a tuple from 𝐷 based on the partitioning by
attribute 𝐴.
𝑰𝒏𝒇𝒐𝑨 𝑫 = ෍
𝒋=𝟏
𝒗
𝑫𝒋
𝑫
× 𝑰𝒏𝒇𝒐(𝑫𝒋)
Data Mining: Classification and Prediction 12
Decision Tree Induction
• Information Gain
• Information gain is defined as the difference between the original information
requirement and the new requirement
𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐 𝑫 − 𝑰𝒏𝒇𝒐𝑨 𝑫
• The attribute 𝐴 with the highest information gain, 𝑮𝒂𝒊𝒏(𝑨), is chosen as the splitting
attribute at node 𝑁.
Data Mining: Classification and Prediction 13
Decision Tree Induction
• Decision Tree Generation Algorithm
• Input:
• Data partition, D,
which is a set of training tuples and their associated class labels;
• Attribute_list,
the set of candidate attributes;
• Attribute_selection_method,
a procedure to determine the splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting attribute and, possibly, either a split point
or splitting subset.
Data Mining: Classification and Prediction 14
Decision Tree Induction
• Generate_decision_tree Algorithm
• Method
1. create a node N;
2. if tuples in D are all of the same class, C then
3. return N as a leaf node labeled with the class C;
4. if Attribute_list is empty then
5. return N as a leaf node labeled with the majority class in D;
6. apply Attribute_selection_method(D, Attribute_list) to find the “best” splitting
criterion;
7. label node N with splitting criterion;
Data Mining: Classification and Prediction 15
Decision Tree Induction
• Decision Tree Generation Algorithm
• Method
8. if splitting_attribute is discrete-valued and multiway splits allowed then
9. Attribute_list ← Attribute_list − Splitting_attribute; // remove splitting attribute
10. for each outcome j of splitting criterion
11. let Dj be the set of data tuples in D satisfying outcome j; // a partition
12. if Dj is empty then
13. attach a leaf labeled with the majority class in D to node N;
14. else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N;
15. End for
16. return N;
Data Mining: Classification and Prediction 16
Example
Data Mining: Classification and Prediction 17
Patient ID Age Sex BP Cholesterol Class: Drug
P1 <=30 F High Normal Drug A
P2 <=30 F High High Drug A
P3 31…50 F High Normal Drug B
P4 >50 F Normal Normal Drug B
P5 >50 M Low Normal Drug B
P6 >50 M Low High Drug A
P7 31…50 M Low High Drug B
P8 <=30 F Normal Normal Drug A
P9 <=30 M Low Normal Drug B
P10 >50 M Normal Normal Drug B
P11 <=30 M Normal High Drug B
P12 31…50 F Normal High Drug B
P13 31…50 M High Normal Drug B
P14 >50 F Normal High Drug A
P15 31…50 F Low Normal ?
Example
• Reduced Training Data
• Establish the target classification
Which Drug to advice???
• 5/14 → Drug A
• 9/14 → Drug B
Data Mining: Classification and Prediction 18
Age Gender BP Cholesterol Class: Drug
<=30 F High Normal Drug A
<=30 F High High Drug A
31…50 F High Normal Drug B
>50 F Normal Normal Drug B
>50 M Low Normal Drug B
>50 M Low High Drug A
31…50 M Low High Drug B
<=30 F Normal Normal Drug A
<=30 M Low Normal Drug B
>50 M Normal Normal Drug B
<=30 M Normal High Drug B
31…50 F Normal High Drug B
31…50 M High Normal Drug B
>50 F Normal High Drug A
Example
• Calculate Information gain of class attribute: Drug
𝐼𝑛𝑓𝑜 𝐷 = −
5
14
log2
5
14
−
9
14
log2
9
14
𝐼𝑛𝑓𝑜 𝐷 = 0.9403
• Calculate information gain of remaining attributes to determine the root node
Data Mining: Classification and Prediction 19
Example
• Attribute: Age
• <=30 →5, 31-50 →4, >50 →5
• 3 distinct values for attribute Age, so we need to calculate 3 entropy calculations
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
5
14 × 𝐼𝑛𝑓𝑜(≤30) + ൗ
4
14 × 𝐼𝑛𝑓𝑜(31−50) + ൗ
5
14 × 𝐼𝑛𝑓𝑜(>50)
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝟎. 𝟐𝟒𝟔𝟕
Data Mining: Classification and Prediction 21
<=30: 3-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(≤30) = − Τ
3
5 log2 Τ
3
5 − Τ
2
5 log2 Τ
2
5 ≈ 0.9710
31-50: 0-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(31−50) = − Τ
0
4 log2 Τ
0
4 − Τ
4
4 log2 Τ
4
4 = 0
>50 : 2-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(>50) = − Τ
2
5 log2 Τ
2
5 − Τ
3
5 log2 Τ
3
5 ≈ 0.9710
Example
• Attribute: Gender
• M→7, F→ 7
• 2 distinct values for attribute Gender, so we need to calculate 2 entropy calculations
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
7
14 × 𝐼𝑛𝑓𝑜𝑀 + ൗ
17
14 × 𝐼𝑛𝑓𝑜𝐹
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 0.9403 − 0.7885 = 𝟎. 𝟏𝟓𝟏𝟗
Data Mining: Classification and Prediction 23
F: 4 Drug A, 3 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ
4
7 log2 Τ
4
7 − Τ
3
7 log2 Τ
3
7 ≈ 0.9852
M: 1 Drug A, 6 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ
1
7 log2 Τ
1
7 − Τ
6
7 log2 Τ
6
7 ≈ 0.5917
Example
• Attribute: BP
• High→ 4 , Normal→ 6 , Low→ 4
• 3 distinct values for attribute BP, so we need to calculate 3 entropy calculations
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
4
14 × 𝐼𝑛𝑓𝑜𝐻𝑖𝑔ℎ + ൗ
6
14 × 𝐼𝑛𝑓𝑜𝑁𝑜𝑟𝑚𝑎𝑙 + ൗ
4
14 × 𝐼𝑛𝑓𝑜𝐿𝑜𝑤
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 0.9403 − 0.9111 = 𝟎. 𝟎𝟐𝟗𝟐
Data Mining: Classification and Prediction 25
High: 2-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ
2
4 log2 Τ
2
4 − Τ
2
4 log2 Τ
2
4 ≈1.00
Normal: 2-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ
2
6 log2 Τ
2
6 − Τ
4
6 log2 Τ
4
6 = 0.9183
Low: 1-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐿𝑜𝑤) = − Τ
1
4 log2 Τ
1
4 − Τ
3
4 log2 Τ
3
4 ≈ 0.8113
Data Mining: Classification and Prediction 26
Cholesterol Class: Drug
High Drug A
High Drug A
High Drug B
High Drug B
High Drug B
High Drug A
Cholesterol Class: Drug
Normal Drug A
Normal Drug B
Normal Drug B
Normal Drug B
Normal Drug A
Normal Drug B
Normal Drug B
Normal Drug B
Example
• Attribute: Cholesterol
• High→ 6 , Normal →8
• 2 distinct values for attribute Cholesterol, so we need to calculate 2 entropy calculations
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ
6
14 × 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) + ൗ
8
14 × 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙)
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 0.9403 − 0.8922 = 𝟎. 𝟎𝟒𝟖𝟏
Data Mining: Classification and Prediction 27
High: 3-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ
3
6 log2 Τ
3
6 − Τ
3
6 log2 Τ
3
6 = 1.00
Normal: 2-Drug A , 6-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ
2
8 log2 Τ
2
8 − Τ
6
8 log2 Τ
6
8 ≈ 0.8113
Example
• Recap
• We choose Age being a root node.
Data Mining: Classification and Prediction 29
𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 0.2467
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481
𝑰𝒏𝒇𝒐𝑨𝒈𝒆 𝑫 0.2467
𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519
𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292
𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481
Age
<=30 31-50 >50
Drug B
? ?
Repeat the steps
Example
Data Mining: Classification and Prediction 30
Age
<=30 31-50 >50
Drug B
? ?
Gender
Male Female
Drug B Drug A
Cholesterol
Male
Normal High
Drug B
Drug A
Decision Tree Induction
• What if the splitting attribute 𝐴 is continuous-valued?
• The test at node N has two possible outcomes, corresponding to the conditions 𝐴 ≤
𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 and 𝐴 > 𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 , respectively
• where 𝒔𝒑𝒊𝒍𝒕_𝒑𝒐𝒊𝒏𝒕 is the split-point returned by Attribute selection method as part of the
splitting criterion.
• When A is discrete-valued and a binary tree must be produced
• The test at node N is of the form “𝐴 ∈ 𝑆𝐴”,
• where 𝑆𝐴 is the splitting subset for 𝐴 returned by Attribute selection method as part of the
splitting criterion.
Data Mining: Classification and Prediction 31
Decision Tree Induction
Data Mining: Classification and Prediction 32
1
2 3
Attribute Selection Measures
• Gain Ratio
• The information gain measure is biased toward tests with many outcomes.
• C4.5, a successor of ID3, uses an extension to information gain known as gain ratio
• Applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = − ෍
𝑗=1
𝑣
|𝐷𝑗|
|𝐷|
× log2
|𝐷𝑗|
|𝐷|
• This represents the potential information generated by splitting the training data set, 𝐷, into 𝑣
partitions, corresponding to the 𝑣 outcomes of a test on attribute 𝐴.
• The gain ratio is defined as
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 =
𝐺𝑎𝑖𝑛(𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
• The attribute with the maximum gain ratio is selected as the splitting attribute.
Data Mining: Classification and Prediction 33
Attribute Selection Measures
• Gain Ratio
• Computation of gain ratio for the attribute weight.
• Attribute: Weight has three values as Heavy, Average and Light containing 5, 6 and 4 tuples
respectively.
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = −
5
15
× log2
5
15
−
6
15
× log2
6
15
−
4
15
× log2
4
15
= 1.5655
𝐺𝑎𝑖𝑛 𝑊𝑒𝑖𝑔ℎ𝑡 = 0.0622
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑊𝑒𝑖𝑔ℎ𝑡 =
0.0622
1.5655
= 0.040
Data Mining: Classification and Prediction 34
Attribute Selection Measures
• Gini Index
• The Gini index is used in CART
• The Gini index measures the impurity of 𝐷, a data partition or set of training tuples, as
𝐺𝑖𝑛𝑖 𝐷 = 1 − ෍
𝑖=1
𝑚
𝑝𝑖
2
• where 𝑝𝑖 is the probability that a tuple in 𝐷 belongs to class 𝐶𝑖 and is estimated by ൗ
|𝐶𝑖,𝐷|
|𝐷|.
• The sum is computed over 𝑚 classes.
• The Gini index considers a binary split for each attribute
• If a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning
is
𝐺𝑖𝑛𝑖𝐴 𝐷 =
|𝐷1|
|𝐷|
× 𝐺𝑖𝑛𝑖(𝐷1) +
𝐷2
𝐷
× 𝐺𝑖𝑛𝑖(𝐷2)
Data Mining: Classification and Prediction 35
Attribute Selection Measures
• Gini Index
• For a discrete-valued attribute, the subset that gives the minimum gini index for
that attribute is selected as its splitting subset.
• For continuous-valued attributes, each possible split-point must be considered.
• The reduction in impurity that would be incurred by a binary split on a discrete-
or continuous-valued attribute A is
Δ𝐺𝑖𝑛𝑖 𝐴 = 𝐺𝑖𝑛𝑖 𝐷 − 𝐺𝑖𝑛𝑖𝐴 𝐷
• The attribute that maximizes the reduction in impurity is selected as the splitting attribute.
Data Mining: Classification and Prediction 36
Data Mining: Classification and Prediction 37
Data Mining: Classification and Prediction 38
Bayesian Classification
• Bayesian classifiers are statistical classifiers
• Predicts class membership probabilities.
• based on Bayes’ theorem .
• Exhibits high accuracy and speed when applied to large databases.
• A simple Bayesian classifier is known as the naïve Bayesian classifier
• Assumes that the effect of an attribute value on a given class is independent of the values of
the other attributes: class conditional independence.
• Bayesian belief networks are graphical models, allow the representation of
dependencies among subsets of attributes
Data Mining: Classification and Prediction 39
Bayesian Classification
• Bayes’ Theorem
• Let 𝑿 be a data tuple (X is considered “evidence”).
• Let 𝑯 be some hypothesis, such as that the data tuple 𝑿 belongs to a specified class 𝑪.
• Determine 𝑷 𝑯|𝑿 , the probability that the hypothesis 𝑯 holds given the “evidence” or
observed data tuple X.
• 𝑷 𝑯|𝑿 is the posterior probability of 𝑯 conditioned on X.
• 𝑷 𝑯 is the prior probability of 𝑯.
• 𝑷 𝑿|𝑯 is the posterior probability of 𝑿 conditioned on H.
• 𝑷 𝑿 is the prior probability of 𝑿.
• “How are these probabilities estimated?”
Data Mining: Classification and Prediction 40
𝑷 𝑯|𝑿 =
𝑷 𝑿|𝑯 𝑷 𝑯
𝑷 𝑿
…Bayes’ Theorem
Naïve Bayesian Classification
• A simple Bayesian classifier is known as the naïve Bayesian classifier
• Assumes that the effect of an attribute value on a given class is independent of
the values of the other attributes: class conditional independence.
• It is made to simplify the computations involved and, in this sense, is
considered “naïve.”
Data Mining: Classification and Prediction 41
Naïve Bayesian Classification
• Let 𝐷 be a training set of tuples and their associated class labels.
• Suppose that there are m classes, 𝐶1, 𝐶2, … 𝐶𝑚.
• Given a tuple, 𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑛) depicting 𝑛 measurements made on the tuple from 𝑛
attributes, the classifier will predict that 𝑋 belongs to the class having the highest
posterior probability, conditioned on 𝑋.
• The naïve Bayesian classifier predicts that tuple X belongs to the class 𝐶𝑖 if and only if
𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖
• We maximize 𝑃 𝐶𝑖|𝑋
• The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori
hypothesis.
Data Mining: Classification and Prediction 42
Naïve Bayesian Classification
• By Bayes’ theorem
𝑷 𝑪𝒊|𝑿 =
𝑷 𝑿|𝑪𝒊 𝑷 𝑪𝒊
𝑷 𝑿
• As 𝑃 𝑋 is constant for all classes, only 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 need be maximized.
• The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖
• The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori
hypothesis.
• The class prior probabilities may be estimated by
𝑃 𝐶𝑖 =
𝐶𝑖,𝐷
𝐷
… … 𝑤ℎ𝑒𝑟𝑒 𝐶𝑖,𝐷 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑢𝑝𝑙𝑒𝑠 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝐶𝑖 𝑖𝑛 𝐷
Data Mining: Classification and Prediction 43
Naïve Bayesian Classification
• In order to reduce computation in evaluating 𝑃 𝑋|𝐶𝑖 , the naive assumption of class
conditional independence is made.
𝑷 𝑿|𝑪𝒊 = ෑ
𝒌=𝟏
𝒏
𝑷 𝒙𝒌|𝑪𝒊 = 𝑷 𝒙𝟏|𝑪𝒊 × 𝑷 𝒙𝟐|𝑪𝒊 × ⋯ × 𝑷 𝒙𝒏|𝑪𝒊
• Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
Data Mining: Classification and Prediction 44
Naïve Bayesian Classification: Example
RID age income student credit_rating class: Buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Data Mining: Classification and Prediction 45
Naïve Bayesian Classification: Example
Data Mining: Classification and Prediction 46
• Let 𝐶1 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑦𝑒𝑠 and 𝐶2 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑛𝑜
• The tuple we wish to classify is
𝑋 = (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ, 𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚, 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠, 𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟)
• We need to maximize 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , for 𝑖 = 1,2
• Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2
• 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
• 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2
9/14 = 0.643
5/14 = 0.357
Naïve Bayesian Classification: Example
RID age income student credit_rating class: Buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Data Mining: Classification and Prediction 47
Naïve Bayesian Classification: Example
Data Mining: Classification and Prediction 48
• Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2
• 𝑃 𝑥1|𝐶1 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥1|𝐶2 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
• 𝑃 𝑥2|𝐶1 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥2|𝐶2 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
• 𝑃 𝑥3|𝐶1 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥3|𝐶2 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
• 𝑃 𝑥4|𝐶1 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥4|𝐶2 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
= 2/9 = 0.222
= 3/5 = 0.600
= 4/9 = 0.444
= 2/5 = 0.400
= 6/9 = 0.667
= 1/5 = 0.200
= 6/9= 0.667
= 2/5 = 0.400
𝑋 = (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ, 𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚, 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠, 𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟)
Naïve Bayesian Classification: Example
Data Mining: Classification and Prediction 49
• Now we calculate from above probabilities
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.222 × 0.444 × 0.667 × 0.667
Similarly
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.600 × 0.400 × 0.200 × 0.400
• To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for
tuple 𝑋.
=0.044
=0.019
0.028
0.007
Naïve Bayesian Classification: Example
RID age income student credit_rating class: Buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent yes
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Data Mining: Classification and Prediction 50
Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2
𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
10/14 = 0.714
𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
4/14 = 0.286
Naïve Bayesian Classification: Example
Data Mining: Classification and Prediction 51
• Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2
• 𝑃 𝑥1|𝐶1 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥1|𝐶2 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
• 𝑃 𝑥2|𝐶1 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥2|𝐶2 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥3|𝐶1 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥3|𝐶2 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
• 𝑃 𝑥4|𝐶1 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
• 𝑃 𝑥4|𝐶2 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜)
= 2/10 = 0.200
= 3/4 = 0.750
= 4/10 = 0.400
= 2/4 = 0.500
= 7/10= 0.700
= 0/5 = 0
= 6/10 = 0.600
= 2/4 = 0.500
Naïve Bayesian Classification: Example
Data Mining: Classification and Prediction 52
• Now we calculate from above probabilities
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.200 × 0.400 × 0.700 × 0.600
Similarly
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.750 × 0.500 × 0 × 0.500
• To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 =
𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 =
• Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for
tuple 𝑋.
=0.034
=0
0.024
0
IS IT CORRECT CLASSIFICATION ??????????
Naïve Bayesian Classification
Data Mining: Classification and Prediction 53
• A zero probability cancels the effects of all of the other (posteriori) probabilities
(on 𝐶𝑖) involved in the product.
• To avoid the effect of zero probability value, Laplacian correction or Laplace
estimator is used.
• We add one to each count.
Naïve Bayesian Classification
Data Mining: Classification and Prediction 54
• E.g. If we have a training database D having 1500 tuples.
• Out of which, 1000 tuples are of class Buys_computer = yes.
• For income attribute we have
• 0 tuples for income = low,
• 960 tuple for income = medium,
• 40 tuples for income = high.
• Using the Laplacian correction for the three quantities, we pretend that we have 1 extra tuple for
each income-value pair.
1
1003
= 0.001,
961
1003
= 0.958,
41
1003
= 0.040
• The “corrected” probability estimates are close to their “uncorrected” counterparts, yet the zero
probability value is avoided.
Rule-Based Classification
• The learned model is represented as a set of IF-THEN rules.
• An IF-THEN rule is an expression of the form
𝑰𝑭 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑻𝑯𝑬𝑵 𝑐𝑜𝑛𝑐𝑙𝑢𝑠𝑖𝑜𝑛
• Example: 𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
• R1 can also be written as
𝑅1: (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ) ∧ (𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠) ⇒ (𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠)
Data Mining: Classification and Prediction 55
Rule antecedent
or precondition
Rule consequent
Attribute sets Class Prediction
Rule-Based Classification
• If the condition in a rule antecedent holds true for a given tuple, the rule antecedent is
satisfied and that the rule covers the tuple.
• Evaluation of Rule R:
𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑅 =
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
|𝐷|
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅 =
𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
• Let 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 be the number of tuples covered by R
• 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 be the number of tuples correctly classified by R
• |𝐷| be the number of tuples in D.
Data Mining: Classification and Prediction 56
Naïve Bayesian Classification: Example
RID age income student credit_rating class: Buys_computer
1 youth high no fair no
2 youth high no excellent no
8 youth medium no fair no
9 youth low yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
Data Mining: Classification and Prediction 57
Rule-Based Classification
• If a rule is satisfied by X, the rule is said to be triggered.
X= (age = youth, income = medium, student = yes, credit rating = fair)
• X satisfies the rule R1, which triggers the rule.
• If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X.
• If more than one rule is triggered, we need a conflict resolution strategy.
• Size ordering: assigns the highest priority to the triggering rule that has the “toughest”
requirements
• Rule ordering: prioritizes the rules beforehand. The ordering may be class-based or rule-
based.
• Class-based ordering: the classes are sorted in order of decreasing “importance”
• Rule-based ordering, the rules are organized into one long priority list
Data Mining: Classification and Prediction 58
Rule-Based Classification
• Extracting rules from a decision tree
• One rule is created for each path from the root to a leaf node.
• Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part).
• The leaf node holds the class prediction, forming the rule consequent (“THEN”
part).
Data Mining: Classification and Prediction 59
Rule-Based Classification
• Extracting rules from a decision tree
Data Mining: Classification and Prediction 60
age?
student? credit_rating?
yes
middle_aged
youth senior
no yes no yes
no yes fair excellent
Rule-Based Classification
• Extracted rules from a decision tree are
𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅2: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no
𝑅3: 𝐼𝐹 𝑎𝑔𝑒 = 𝑚𝑖𝑑𝑑𝑙𝑒_𝑎𝑔𝑒𝑑 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅4: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = yes 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠
𝑅5: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = 𝑛𝑜 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no
Data Mining: Classification and Prediction 61
Data Mining: Classification and Prediction 62
Extract the classification rule from given decision tree
Data Mining: Classification and Prediction 63
X=(Color=Yellow, Type = SUV, Origin = Imported)
Prediction
• Numeric prediction is the task of predicting continuous (or ordered) values for
given input.
• Widely used approach for numeric prediction is regression.
• Regression is used to model the relationship between one or more independent
or predictor variables and a dependent or response variable.
• The predictor variables are the attributes of interest describing the tuple.
• The response variable is what we want to predict.
Data Mining: Classification and Prediction 64
Predictor Variables Response Variable
𝑋 = {age = youth, "income = medium, student = yes, credit rating = fair", Buys_computer =? }
Prediction: Linear Regression
• Straight-line regression analysis involves a response variable, 𝑦, and a single
predictor variable, 𝑥.
• Simplest regression technique which models 𝑦 as a linear function of 𝑥.
𝑦 = 𝑏 + 𝑤𝑥
• 𝑏 and 𝑤 are regression coefficients specifying the Y-intercept and slope of the line.
• Coefficients can also be thought as weights
𝑦 = 𝑤0 + 𝑤1𝑥
• These coefficients can be solved for by the method of least squares, which estimates
the best-fitting straight line as the one that minimizes the error between the actual
data and the estimate of the line.
Data Mining: Classification and Prediction 65
Prediction: Linear Regression
• The regression coefficients can be estimated
𝑤1 =
σ𝑖=1
|𝐷|
𝑥𝑖 − ҧ
𝑥 𝑦𝑖 − ത
𝑦
σ𝑖=1
|𝐷|
𝑥𝑖 − ҧ
𝑥 2
𝑤0 = ത
𝑦 − 𝑤1 ҧ
𝑥
Data Mining: Classification and Prediction 66
Prediction: Linear Regression Age
(x)
Avg. amount spent on
medical expenses
(per month in Rs.) (y)
15 100
20 135
25 135
37 150
40 250
45 270
48 290
50 360
55 375
61 400
64 500
67 1000
70 1500
Data Mining: Classification and Prediction 67
ҧ
𝑥 = 45.92
ത
𝑦 = 412.69
The regression coefficients are
𝑤1 = 16.89
𝑤0 = −355.32
The equation of the least square (best fitting) line is
𝑦 = −355.32 + 16.89𝑥
Prediction: Linear Regression Age
(x)
Avg. amount spent on
medical expenses
(per month in Rs.) (y)
15 100
20 135
25 135
37 150
40 250
45 270
48 290
50 360
55 375
61 400
64 500
67 1000
70 1500
Data Mining: Classification and Prediction 68
y = 16.891x - 355.32
-200
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60 70 80
Avg. amount spent on medical expenses (per month in Rs.) (y)
Classifier Accuracy Measures
Data Mining: Classification and Prediction 69
• Confusion Matrix:
• Given 𝒎 classes, a confusion matrix is a table of at least size 𝒎 by 𝒎
• where an entry is row 𝒊 and column 𝒋 shows the number of tuples of class 𝒊 that were
labeled by the classifier as class 𝒋.
Class – Low Class – Medium Class - High
Class – Low 250 10 0
Class – Medium 10 440 10
Class - High 0 10 270
Data Mining: Classification and Prediction 70
1000 tuples
Classifier Accuracy Measures
Data Mining: Classification and Prediction 71
• Classifier Accuracy
• The percentage of test set tuples that are correctly classified by the classifier.
• Also referred to as the overall recognition rate of the classifier.
• Error Measure
• An error rate or misclassification rate of a classifier M, which is simply
1 − 𝐴𝑐𝑐 𝑀
where 𝐴𝑐𝑐(𝑀) is the accuracy of M.
Classifier Accuracy Measures
Data Mining: Classification and Prediction 72
• Confusion Matrix: Given 2 classes
• Positive tuples:
• tuples of the main class of interest
• Negative tuples:
• True Positive:
• The positive tuples that were correctly labeled by the classifier
• True negatives
• The negative tuples that were correctly labeled by the classifier
• False positives
• The negative tuples that were incorrectly labeled
• False negatives
• The positive tuples that were incorrectly labeled
Classifier Accuracy Measures
Data Mining: Classification and Prediction 73
• We would like to be able to access how well the classifier can recognize the
positive tuples and how well it can recognize the negative tuples.
• Sensitivity (true positive (recognition) rate)
• The proportion of positive tuples that are correctly identified.
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑡_𝑝𝑜𝑠
𝑝𝑜𝑠
• Specificity (true negative rate)
• The proportion of negative tuples that are correctly identified.
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑡_𝑛𝑒𝑔
𝑛𝑒𝑔
• Precision
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡 _𝑝𝑜𝑠
(𝑡 _𝑝𝑜𝑠 + 𝑓 _𝑝𝑜𝑠 )
Classifier Accuracy Measures
Data Mining: Classification and Prediction 74
• It can be shown that accuracy is a function of sensitivity and specificity.
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 ×
𝑝𝑜𝑠
𝑝𝑜𝑠 + 𝑛𝑒𝑔
+ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 ×
𝑛𝑒𝑔
𝑝𝑜𝑠 + 𝑛𝑒𝑔
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑝𝑜𝑠 + 𝑡𝑛𝑒𝑔
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑡𝑢𝑝𝑙𝑒𝑠
Predictor Accuracy Measures
Data Mining: Classification and Prediction 75
• Instead of focusing on whether the predicted value 𝑦′𝑖 is an “exact” match with actual
value 𝑦𝑖 , we check how far off the predicted value is from the actual known value.
• Loss functions measures the error between the actual value 𝑦𝑖 and the predicted value
𝑦′𝑖.
𝑨𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓: 𝒚𝒊 − 𝒚′𝒊
𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓: (𝒚𝒊 − 𝒚′𝒊)𝟐
• The test error (rate), or generalization error, is the average loss over the test set.
𝑴𝒆𝒂𝒏 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
𝒚𝒊 − 𝒚′𝒊
𝒅
𝑴𝒆𝒂𝒏 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
(𝒚𝒊 − 𝒚′𝒊)𝟐
𝒅
• If we were to take the square root of the mean squared error, the resulting error measure
is called the root mean squared error.
Predictor Accuracy Measures
Data Mining: Classification and Prediction 76
• Relative measures of error include
𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
𝒚𝒊 − 𝒚′𝒊
σ𝒊=𝟏
𝒅
𝒚𝒊 − ഥ
𝒚
𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 =
σ𝒊=𝟏
𝒅
(𝒚𝒊 − 𝒚′𝒊)𝟐
σ𝒊=𝟏
𝒅
(𝒚𝒊 − ഥ
𝒚)𝟐
• We can take the root of the relative squared error to obtain the root relative squared
error so that the resulting error is of the same magnitude as the quantity predicted.
Accuracy Measures
Data Mining: Classification and Prediction 77
• Evaluating the Accuracy of a Classifier or Predictor
Holdout
Random Subsampling
Cross Validation
Bootstrap
Accuracy Measures
Data Mining: Classification and Prediction 78
• Holdout
• The given data are randomly partitioned into two independent sets, a training set and a
test set.
• Two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set.
Data
Training
set
Test set
Derive model
Estimate
Accuracy
Accuracy Measures
Data Mining: Classification and Prediction 79
• Random Subsampling
• A variation of the holdout method in which the holdout method is repeated 𝒌 times.
• The overall accuracy estimate is taken as the average of the accuracies obtained from
each iteration
Data
Training
set 1
Test set
1
Derive model
Estimate
Accuracy
Data
Training
set 2
Test set
2
Derive model
Estimate
Accuracy
Data
Training
set k
Test set
k
Derive model
Estimate
Accuracy
Iteration 1
Iteration 2 Iteration k
. . .
Accuracy Measures
Data Mining: Classification and Prediction 80
• Cross Validation
Data
𝐷2 𝐷3
𝐷1 𝐷𝑘
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
Iteration 1 Iteration 2 Iteration3 Iteration 𝑘
𝑘 mutually exclusive folds
i.e.𝑘 data partitions
. . .
𝐷2
𝐷3
𝐷1
𝐷𝑘
…
Training Set Test Set
Accuracy Measures
Data Mining: Classification and Prediction 81
• Cross Validation
• Each sample is used the same number of times for training and once for testing.
• For Classification, the accuracy estimate is the overall number of correct classifications
from the k iterations, divided by the total number of tuples in the initial data.
• For Prediction, the error estimate can be computed as the total loss from the k iterations,
divided by the total number of initial tuples.
• Leave-one-out
• 𝑘 is set to the number of initial tuples. So, only one sample is “left out” at a time for the test set.
• Stratified cross-validation
• The folds are stratified so that the class distribution of the tuples in each fold is approximately
the same as that in the initial data
Accuracy Measures
Data Mining: Classification and Prediction 82
• Bootstrap
• The bootstrap method samples the given training tuples uniformly with replacement.
• i.e. each time a tuple is selected, it is equally likely to be selected again and readded to the
training set.
• .632 Bootstrap
• On an average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining
36.8% will form the test set
• Each tuple has a probability of Τ
1
𝑑 of being selected, so the probability of not being chosen is (1 −
Τ
1
𝑑).
• We have to select 𝑑 times, so the probability that a tuple will not be chosen during this whole time is
(1 − Τ
1
𝑑)𝑑.
• If 𝑑 is large, the probability approaches e−1
= 0.368 .
• Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the
remaining 63.2% will form the training set.
Accuracy Measures
Data Mining: Classification and Prediction 83
• Bootstrap
• .632 Bootstrap
• Repeat the sampling procedure 𝑘 times, where in each iteration, we use the current test set to
obtain an accuracy estimate of the model obtained from the current bootstrap sample.
• The overall accuracy of the model is
𝐴𝑐𝑐 𝑀 = ෍
𝑖=1
𝑘
0.632 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑒𝑠𝑡_𝑠𝑒𝑡 + 0.368 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡

More Related Content

What's hot

multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Data Preprocessing
Data PreprocessingData Preprocessing
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
maha797959
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
DataminingTools Inc
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 

What's hot (20)

multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Kmeans
KmeansKmeans
Kmeans
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preparation
Data preparationData preparation
Data preparation
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 

Similar to Classification in Data Mining

classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
321106410027
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
malathieswaran29
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
ssuser908de6
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya2
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
HimanshuSharma997566
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
AsrithaKorupolu
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
Kalpna Saharan
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
engrasi
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
ritumysterious1
 
Data Mining
Data MiningData Mining
Data Mining
IIIT ALLAHABAD
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
DrGnaneswariG
 
08ClassBasic VT.ppt
08ClassBasic VT.ppt08ClassBasic VT.ppt
08ClassBasic VT.ppt
GaneshaAdhik
 
08ClassBasic.ppt
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.ppt
GauravWani20
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
Subrata Kumer Paul
 
08ClassBasic.ppt
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.ppt
harsh708944
 
Basics of Classification.ppt
Basics of Classification.pptBasics of Classification.ppt
Basics of Classification.ppt
NBACriteria2SICET
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
Krish_ver2
 

Similar to Classification in Data Mining (20)

classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
Data Mining
Data MiningData Mining
Data Mining
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
 
08ClassBasic VT.ppt
08ClassBasic VT.ppt08ClassBasic VT.ppt
08ClassBasic VT.ppt
 
08ClassBasic.ppt
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.ppt
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 
08ClassBasic.ppt
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.ppt
 
Basics of Classification.ppt
Basics of Classification.pptBasics of Classification.ppt
Basics of Classification.ppt
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 

More from Rashmi Bhat

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
Rashmi Bhat
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
Rashmi Bhat
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
Rashmi Bhat
 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OS
Rashmi Bhat
 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating System
Rashmi Bhat
 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdf
Rashmi Bhat
 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdf
Rashmi Bhat
 
OLAP
OLAPOLAP
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data Mining
Rashmi Bhat
 
Web mining
Web miningWeb mining
Web mining
Rashmi Bhat
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
Rashmi Bhat
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
ETL Process
ETL ProcessETL Process
ETL Process
Rashmi Bhat
 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
Rashmi Bhat
 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality
Rashmi Bhat
 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual Reality
Rashmi Bhat
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
Rashmi Bhat
 

More from Rashmi Bhat (17)

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OS
 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating System
 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdf
 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdf
 
OLAP
OLAPOLAP
OLAP
 
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data Mining
 
Web mining
Web miningWeb mining
Web mining
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
Clustering
ClusteringClustering
Clustering
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality
 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual Reality
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Classification in Data Mining

  • 2. 2
  • 3. What is Classification? • In Classification, a model or classifier is constructed to predict categorical labels • Data classification is a two-step process 1. Learning 2. Classification Training Data Classification algorithm Classification rules Test Data New Data (unknown class label) Class Label Data Mining: Classification and Prediction 3
  • 4. What is Classification? • Learning step: • A classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. • The individual tuples making up the training set are referred to as training tuples. • Data tuples can be referred to as samples, examples, instances, data points, or objects • This is supervised learning step • The class label of each training tuple is provided • This process can be viewed as the learning of a mapping or function y = 𝑓(𝑥) • Predicts the associated class label 𝑦 of a given tuple 𝑋 • This mapping is represented in the form of classification rules, decision trees, or mathematical formulae Data Mining: Classification and Prediction 4
  • 5. What is Classification? • Classification Step • The model is used for classification. • A test set is used, made up of test tuples and their associated class labels. • Randomly selected tuples from the general data set • The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. • The associated class label of each test tuple is compared with the learned classifier’s class prediction for that tuple. • If the accuracy of the classifier is considered acceptable, the classifier can be used to classify future data tuples for which the class label is not known. Data Mining: Classification and Prediction 5
  • 6. Data Mining: Classification and Prediction 6
  • 7. What is Prediction? • Data prediction is a two step process similar to data classification • There is no class attribute • Because attribute values to be predicted are continuous-valued (ordered) rather than categorical (discrete-valued) • Predicted attribute • Prediction can also be viewed as a mapping or function y = 𝑓(𝑥) Data Mining: Classification and Prediction 7
  • 8. How classification and prediction are different? • Data classification classifies categorical attribute values • Data prediction predicts continuous-valued attribute value • Testing data is used to assess accuracy of a classifier • The accuracy of a predictor is estimated by computing an error based on the difference between the predicted value and the actual known value of y for each of the test tuples, X Data Mining: Classification and Prediction 8
  • 9. Decision Tree Induction • It is the learning of decision trees from class-labeled training tuples. • A decision tree is a flowchart-like tree structure, where • Each internal node (non-leaf node) denotes a test on an attribute, • Each branch represents an outcome of the test, and • Each leaf node (or terminal node) holds a class label. • The topmost node in a tree is the root node • Is a person fit????? • Binary decision tree • Non-binary decision tree Data Mining: Classification and Prediction 9 Age <30? Eats lots of pizza? Exercises daily? Fit Fit Unfit! Unfit! Yes No Yes Yes No No Fig. Decision Tree for the concept being fit
  • 10. Decision Tree Induction • How are decision trees used for classification? • Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. • A path is traced from the root to a leaf node, which holds the class prediction for that tuple. • Advantages of decision trees: • Does not require any domain knowledge or parameter setting • Can handle high dimensional data. • The learning and classification steps of decision tree induction are simple and fast • Have good accuracy Data Mining: Classification and Prediction 10
  • 11. Decision Tree Induction • Attribute selection measures • Used to select the attribute that best partitions the tuples into distinct classes. • Information gain, Gain Ratio, Gini Index • A decision tree algorithm is known as ID3 (Iterative Dichotomiser). • C4.5 algorithm (successor of ID3) benchmark to newer supervised learning algorithms • Classification and Regression Trees (CART) • Adopt a greedy (i.e., nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner Data Mining: Classification and Prediction 11 Info Gain Gain Ratio Gini Index
  • 12. Decision Tree Induction • Information Gain • ID3 uses information gain as its attribute selection measure • The expected information needed to classify a tuple in 𝐷 is given by 𝑰𝒏𝒇𝒐 𝑫 = − ෍ 𝒊=𝟏 𝒎 𝒑𝒊 𝐥𝐨𝐠𝟐( 𝒑𝒊) • where 𝑝𝑖 is the probability that an arbitrary tuple in 𝐷 belongs to class 𝐶𝑖,estimated as ൗ 𝐶𝑖,𝐷 𝐷 • Info(D) is also known as the entropy of D. • The expected information required to classify a tuple from 𝐷 based on the partitioning by attribute 𝐴. 𝑰𝒏𝒇𝒐𝑨 𝑫 = ෍ 𝒋=𝟏 𝒗 𝑫𝒋 𝑫 × 𝑰𝒏𝒇𝒐(𝑫𝒋) Data Mining: Classification and Prediction 12
  • 13. Decision Tree Induction • Information Gain • Information gain is defined as the difference between the original information requirement and the new requirement 𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐 𝑫 − 𝑰𝒏𝒇𝒐𝑨 𝑫 • The attribute 𝐴 with the highest information gain, 𝑮𝒂𝒊𝒏(𝑨), is chosen as the splitting attribute at node 𝑁. Data Mining: Classification and Prediction 13
  • 14. Decision Tree Induction • Decision Tree Generation Algorithm • Input: • Data partition, D, which is a set of training tuples and their associated class labels; • Attribute_list, the set of candidate attributes; • Attribute_selection_method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset. Data Mining: Classification and Prediction 14
  • 15. Decision Tree Induction • Generate_decision_tree Algorithm • Method 1. create a node N; 2. if tuples in D are all of the same class, C then 3. return N as a leaf node labeled with the class C; 4. if Attribute_list is empty then 5. return N as a leaf node labeled with the majority class in D; 6. apply Attribute_selection_method(D, Attribute_list) to find the “best” splitting criterion; 7. label node N with splitting criterion; Data Mining: Classification and Prediction 15
  • 16. Decision Tree Induction • Decision Tree Generation Algorithm • Method 8. if splitting_attribute is discrete-valued and multiway splits allowed then 9. Attribute_list ← Attribute_list − Splitting_attribute; // remove splitting attribute 10. for each outcome j of splitting criterion 11. let Dj be the set of data tuples in D satisfying outcome j; // a partition 12. if Dj is empty then 13. attach a leaf labeled with the majority class in D to node N; 14. else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N; 15. End for 16. return N; Data Mining: Classification and Prediction 16
  • 17. Example Data Mining: Classification and Prediction 17 Patient ID Age Sex BP Cholesterol Class: Drug P1 <=30 F High Normal Drug A P2 <=30 F High High Drug A P3 31…50 F High Normal Drug B P4 >50 F Normal Normal Drug B P5 >50 M Low Normal Drug B P6 >50 M Low High Drug A P7 31…50 M Low High Drug B P8 <=30 F Normal Normal Drug A P9 <=30 M Low Normal Drug B P10 >50 M Normal Normal Drug B P11 <=30 M Normal High Drug B P12 31…50 F Normal High Drug B P13 31…50 M High Normal Drug B P14 >50 F Normal High Drug A P15 31…50 F Low Normal ?
  • 18. Example • Reduced Training Data • Establish the target classification Which Drug to advice??? • 5/14 → Drug A • 9/14 → Drug B Data Mining: Classification and Prediction 18 Age Gender BP Cholesterol Class: Drug <=30 F High Normal Drug A <=30 F High High Drug A 31…50 F High Normal Drug B >50 F Normal Normal Drug B >50 M Low Normal Drug B >50 M Low High Drug A 31…50 M Low High Drug B <=30 F Normal Normal Drug A <=30 M Low Normal Drug B >50 M Normal Normal Drug B <=30 M Normal High Drug B 31…50 F Normal High Drug B 31…50 M High Normal Drug B >50 F Normal High Drug A
  • 19. Example • Calculate Information gain of class attribute: Drug 𝐼𝑛𝑓𝑜 𝐷 = − 5 14 log2 5 14 − 9 14 log2 9 14 𝐼𝑛𝑓𝑜 𝐷 = 0.9403 • Calculate information gain of remaining attributes to determine the root node Data Mining: Classification and Prediction 19
  • 20. Example • Attribute: Age • <=30 →5, 31-50 →4, >50 →5 • 3 distinct values for attribute Age, so we need to calculate 3 entropy calculations 𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ 5 14 × 𝐼𝑛𝑓𝑜(≤30) + ൗ 4 14 × 𝐼𝑛𝑓𝑜(31−50) + ൗ 5 14 × 𝐼𝑛𝑓𝑜(>50) 𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 = 𝟎. 𝟐𝟒𝟔𝟕 Data Mining: Classification and Prediction 21 <=30: 3-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(≤30) = − Τ 3 5 log2 Τ 3 5 − Τ 2 5 log2 Τ 2 5 ≈ 0.9710 31-50: 0-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(31−50) = − Τ 0 4 log2 Τ 0 4 − Τ 4 4 log2 Τ 4 4 = 0 >50 : 2-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(>50) = − Τ 2 5 log2 Τ 2 5 − Τ 3 5 log2 Τ 3 5 ≈ 0.9710
  • 21. Example • Attribute: Gender • M→7, F→ 7 • 2 distinct values for attribute Gender, so we need to calculate 2 entropy calculations 𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ 7 14 × 𝐼𝑛𝑓𝑜𝑀 + ൗ 17 14 × 𝐼𝑛𝑓𝑜𝐹 𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 = 0.9403 − 0.7885 = 𝟎. 𝟏𝟓𝟏𝟗 Data Mining: Classification and Prediction 23 F: 4 Drug A, 3 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ 4 7 log2 Τ 4 7 − Τ 3 7 log2 Τ 3 7 ≈ 0.9852 M: 1 Drug A, 6 drug B 𝐼𝑛𝑓𝑜𝑁𝑜 = − Τ 1 7 log2 Τ 1 7 − Τ 6 7 log2 Τ 6 7 ≈ 0.5917
  • 22. Example • Attribute: BP • High→ 4 , Normal→ 6 , Low→ 4 • 3 distinct values for attribute BP, so we need to calculate 3 entropy calculations 𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ 4 14 × 𝐼𝑛𝑓𝑜𝐻𝑖𝑔ℎ + ൗ 6 14 × 𝐼𝑛𝑓𝑜𝑁𝑜𝑟𝑚𝑎𝑙 + ൗ 4 14 × 𝐼𝑛𝑓𝑜𝐿𝑜𝑤 𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 = 0.9403 − 0.9111 = 𝟎. 𝟎𝟐𝟗𝟐 Data Mining: Classification and Prediction 25 High: 2-Drug A , 2-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ 2 4 log2 Τ 2 4 − Τ 2 4 log2 Τ 2 4 ≈1.00 Normal: 2-Drug A , 4-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ 2 6 log2 Τ 2 6 − Τ 4 6 log2 Τ 4 6 = 0.9183 Low: 1-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐿𝑜𝑤) = − Τ 1 4 log2 Τ 1 4 − Τ 3 4 log2 Τ 3 4 ≈ 0.8113
  • 23. Data Mining: Classification and Prediction 26 Cholesterol Class: Drug High Drug A High Drug A High Drug B High Drug B High Drug B High Drug A Cholesterol Class: Drug Normal Drug A Normal Drug B Normal Drug B Normal Drug B Normal Drug A Normal Drug B Normal Drug B Normal Drug B
  • 24. Example • Attribute: Cholesterol • High→ 6 , Normal →8 • 2 distinct values for attribute Cholesterol, so we need to calculate 2 entropy calculations 𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 𝐼𝑛𝑓𝑜 𝐷 − ൗ 6 14 × 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) + ൗ 8 14 × 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) 𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 = 0.9403 − 0.8922 = 𝟎. 𝟎𝟒𝟖𝟏 Data Mining: Classification and Prediction 27 High: 3-Drug A , 3-Drug B 𝐼𝑛𝑓𝑜(𝐻𝑖𝑔ℎ) = − Τ 3 6 log2 Τ 3 6 − Τ 3 6 log2 Τ 3 6 = 1.00 Normal: 2-Drug A , 6-Drug B 𝐼𝑛𝑓𝑜(𝑁𝑜𝑟𝑚𝑎𝑙) = − Τ 2 8 log2 Τ 2 8 − Τ 6 8 log2 Τ 6 8 ≈ 0.8113
  • 25. Example • Recap • We choose Age being a root node. Data Mining: Classification and Prediction 29 𝐼𝑛𝑓𝑜𝐴𝑔𝑒 𝐷 0.2467 𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519 𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292 𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481 𝑰𝒏𝒇𝒐𝑨𝒈𝒆 𝑫 0.2467 𝐼𝑛𝑓𝑜𝐺𝑒𝑛𝑑𝑒𝑟 𝐷 0.1519 𝐼𝑛𝑓𝑜𝐵𝑃 𝐷 0.0292 𝐼𝑛𝑓𝑜𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 𝐷 0.0481 Age <=30 31-50 >50 Drug B ? ? Repeat the steps
  • 26. Example Data Mining: Classification and Prediction 30 Age <=30 31-50 >50 Drug B ? ? Gender Male Female Drug B Drug A Cholesterol Male Normal High Drug B Drug A
  • 27. Decision Tree Induction • What if the splitting attribute 𝐴 is continuous-valued? • The test at node N has two possible outcomes, corresponding to the conditions 𝐴 ≤ 𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 and 𝐴 > 𝑠𝑝𝑖𝑙𝑡_𝑝𝑜𝑖𝑛𝑡 , respectively • where 𝒔𝒑𝒊𝒍𝒕_𝒑𝒐𝒊𝒏𝒕 is the split-point returned by Attribute selection method as part of the splitting criterion. • When A is discrete-valued and a binary tree must be produced • The test at node N is of the form “𝐴 ∈ 𝑆𝐴”, • where 𝑆𝐴 is the splitting subset for 𝐴 returned by Attribute selection method as part of the splitting criterion. Data Mining: Classification and Prediction 31
  • 28. Decision Tree Induction Data Mining: Classification and Prediction 32 1 2 3
  • 29. Attribute Selection Measures • Gain Ratio • The information gain measure is biased toward tests with many outcomes. • C4.5, a successor of ID3, uses an extension to information gain known as gain ratio • Applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D) as 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = − ෍ 𝑗=1 𝑣 |𝐷𝑗| |𝐷| × log2 |𝐷𝑗| |𝐷| • This represents the potential information generated by splitting the training data set, 𝐷, into 𝑣 partitions, corresponding to the 𝑣 outcomes of a test on attribute 𝐴. • The gain ratio is defined as 𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 = 𝐺𝑎𝑖𝑛(𝐴) 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴) • The attribute with the maximum gain ratio is selected as the splitting attribute. Data Mining: Classification and Prediction 33
  • 30. Attribute Selection Measures • Gain Ratio • Computation of gain ratio for the attribute weight. • Attribute: Weight has three values as Heavy, Average and Light containing 5, 6 and 4 tuples respectively. 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 = − 5 15 × log2 5 15 − 6 15 × log2 6 15 − 4 15 × log2 4 15 = 1.5655 𝐺𝑎𝑖𝑛 𝑊𝑒𝑖𝑔ℎ𝑡 = 0.0622 𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑊𝑒𝑖𝑔ℎ𝑡 = 0.0622 1.5655 = 0.040 Data Mining: Classification and Prediction 34
  • 31. Attribute Selection Measures • Gini Index • The Gini index is used in CART • The Gini index measures the impurity of 𝐷, a data partition or set of training tuples, as 𝐺𝑖𝑛𝑖 𝐷 = 1 − ෍ 𝑖=1 𝑚 𝑝𝑖 2 • where 𝑝𝑖 is the probability that a tuple in 𝐷 belongs to class 𝐶𝑖 and is estimated by ൗ |𝐶𝑖,𝐷| |𝐷|. • The sum is computed over 𝑚 classes. • The Gini index considers a binary split for each attribute • If a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning is 𝐺𝑖𝑛𝑖𝐴 𝐷 = |𝐷1| |𝐷| × 𝐺𝑖𝑛𝑖(𝐷1) + 𝐷2 𝐷 × 𝐺𝑖𝑛𝑖(𝐷2) Data Mining: Classification and Prediction 35
  • 32. Attribute Selection Measures • Gini Index • For a discrete-valued attribute, the subset that gives the minimum gini index for that attribute is selected as its splitting subset. • For continuous-valued attributes, each possible split-point must be considered. • The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-valued attribute A is Δ𝐺𝑖𝑛𝑖 𝐴 = 𝐺𝑖𝑛𝑖 𝐷 − 𝐺𝑖𝑛𝑖𝐴 𝐷 • The attribute that maximizes the reduction in impurity is selected as the splitting attribute. Data Mining: Classification and Prediction 36
  • 33. Data Mining: Classification and Prediction 37
  • 34. Data Mining: Classification and Prediction 38
  • 35. Bayesian Classification • Bayesian classifiers are statistical classifiers • Predicts class membership probabilities. • based on Bayes’ theorem . • Exhibits high accuracy and speed when applied to large databases. • A simple Bayesian classifier is known as the naïve Bayesian classifier • Assumes that the effect of an attribute value on a given class is independent of the values of the other attributes: class conditional independence. • Bayesian belief networks are graphical models, allow the representation of dependencies among subsets of attributes Data Mining: Classification and Prediction 39
  • 36. Bayesian Classification • Bayes’ Theorem • Let 𝑿 be a data tuple (X is considered “evidence”). • Let 𝑯 be some hypothesis, such as that the data tuple 𝑿 belongs to a specified class 𝑪. • Determine 𝑷 𝑯|𝑿 , the probability that the hypothesis 𝑯 holds given the “evidence” or observed data tuple X. • 𝑷 𝑯|𝑿 is the posterior probability of 𝑯 conditioned on X. • 𝑷 𝑯 is the prior probability of 𝑯. • 𝑷 𝑿|𝑯 is the posterior probability of 𝑿 conditioned on H. • 𝑷 𝑿 is the prior probability of 𝑿. • “How are these probabilities estimated?” Data Mining: Classification and Prediction 40 𝑷 𝑯|𝑿 = 𝑷 𝑿|𝑯 𝑷 𝑯 𝑷 𝑿 …Bayes’ Theorem
  • 37. Naïve Bayesian Classification • A simple Bayesian classifier is known as the naïve Bayesian classifier • Assumes that the effect of an attribute value on a given class is independent of the values of the other attributes: class conditional independence. • It is made to simplify the computations involved and, in this sense, is considered “naïve.” Data Mining: Classification and Prediction 41
  • 38. Naïve Bayesian Classification • Let 𝐷 be a training set of tuples and their associated class labels. • Suppose that there are m classes, 𝐶1, 𝐶2, … 𝐶𝑚. • Given a tuple, 𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑛) depicting 𝑛 measurements made on the tuple from 𝑛 attributes, the classifier will predict that 𝑋 belongs to the class having the highest posterior probability, conditioned on 𝑋. • The naïve Bayesian classifier predicts that tuple X belongs to the class 𝐶𝑖 if and only if 𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖 • We maximize 𝑃 𝐶𝑖|𝑋 • The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori hypothesis. Data Mining: Classification and Prediction 42
  • 39. Naïve Bayesian Classification • By Bayes’ theorem 𝑷 𝑪𝒊|𝑿 = 𝑷 𝑿|𝑪𝒊 𝑷 𝑪𝒊 𝑷 𝑿 • As 𝑃 𝑋 is constant for all classes, only 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 need be maximized. • The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if 𝑃 𝐶𝑖|𝑋 > 𝑃 𝐶𝑗|𝑋 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖 • The class 𝐶𝑖 for which 𝑃 𝐶𝑖|𝑋 is maximized is called the maximum posteriori hypothesis. • The class prior probabilities may be estimated by 𝑃 𝐶𝑖 = 𝐶𝑖,𝐷 𝐷 … … 𝑤ℎ𝑒𝑟𝑒 𝐶𝑖,𝐷 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑢𝑝𝑙𝑒𝑠 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝐶𝑖 𝑖𝑛 𝐷 Data Mining: Classification and Prediction 43
  • 40. Naïve Bayesian Classification • In order to reduce computation in evaluating 𝑃 𝑋|𝐶𝑖 , the naive assumption of class conditional independence is made. 𝑷 𝑿|𝑪𝒊 = ෑ 𝒌=𝟏 𝒏 𝑷 𝒙𝒌|𝑪𝒊 = 𝑷 𝒙𝟏|𝑪𝒊 × 𝑷 𝒙𝟐|𝑪𝒊 × ⋯ × 𝑷 𝒙𝒏|𝑪𝒊 • Bayesian classifiers have the minimum error rate in comparison to all other classifiers. Data Mining: Classification and Prediction 44
  • 41. Naïve Bayesian Classification: Example RID age income student credit_rating class: Buys_computer 1 youth high no fair no 2 youth high no excellent no 3 middle_aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle_aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle_aged medium no excellent yes 13 middle_aged high yes fair yes 14 senior medium no excellent no Data Mining: Classification and Prediction 45
  • 42. Naïve Bayesian Classification: Example Data Mining: Classification and Prediction 46 • Let 𝐶1 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑦𝑒𝑠 and 𝐶2 be the 𝑐𝑙𝑎𝑠𝑠: 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒r = 𝑛𝑜 • The tuple we wish to classify is 𝑋 = (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ, 𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚, 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠, 𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟) • We need to maximize 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , for 𝑖 = 1,2 • Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2 • 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = • 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = • Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2 9/14 = 0.643 5/14 = 0.357
  • 43. Naïve Bayesian Classification: Example RID age income student credit_rating class: Buys_computer 1 youth high no fair no 2 youth high no excellent no 3 middle_aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle_aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle_aged medium no excellent yes 13 middle_aged high yes fair yes 14 senior medium no excellent no Data Mining: Classification and Prediction 47
  • 44. Naïve Bayesian Classification: Example Data Mining: Classification and Prediction 48 • Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2 • 𝑃 𝑥1|𝐶1 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥1|𝐶2 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) • 𝑃 𝑥2|𝐶1 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥2|𝐶2 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) • 𝑃 𝑥3|𝐶1 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥3|𝐶2 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) • 𝑃 𝑥4|𝐶1 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥4|𝐶2 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 2/9 = 0.222 = 3/5 = 0.600 = 4/9 = 0.444 = 2/5 = 0.400 = 6/9 = 0.667 = 1/5 = 0.200 = 6/9= 0.667 = 2/5 = 0.400 𝑋 = (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ, 𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚, 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠, 𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟)
  • 45. Naïve Bayesian Classification: Example Data Mining: Classification and Prediction 49 • Now we calculate from above probabilities 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.222 × 0.444 × 0.667 × 0.667 Similarly 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.600 × 0.400 × 0.200 × 0.400 • To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = • Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for tuple 𝑋. =0.044 =0.019 0.028 0.007
  • 46. Naïve Bayesian Classification: Example RID age income student credit_rating class: Buys_computer 1 youth high no fair no 2 youth high no excellent no 3 middle_aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent yes 7 middle_aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle_aged medium no excellent yes 13 middle_aged high yes fair yes 14 senior medium no excellent no Data Mining: Classification and Prediction 50 Calculate 𝑃 𝐶𝑖 , for 𝑖 = 1,2 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = 10/14 = 0.714 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = 4/14 = 0.286
  • 47. Naïve Bayesian Classification: Example Data Mining: Classification and Prediction 51 • Calculate 𝑃 𝑋|𝐶𝑖 , for 𝑖 = 1,2 • 𝑃 𝑥1|𝐶1 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥1|𝐶2 = 𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) • 𝑃 𝑥2|𝐶1 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥2|𝐶2 = 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑚𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥3|𝐶1 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥3|𝐶2 = 𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) • 𝑃 𝑥4|𝐶1 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) • 𝑃 𝑥4|𝐶2 = 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡 𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 2/10 = 0.200 = 3/4 = 0.750 = 4/10 = 0.400 = 2/4 = 0.500 = 7/10= 0.700 = 0/5 = 0 = 6/10 = 0.600 = 2/4 = 0.500
  • 48. Naïve Bayesian Classification: Example Data Mining: Classification and Prediction 52 • Now we calculate from above probabilities 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) = 0.200 × 0.400 × 0.700 × 0.600 Similarly 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) = 0.750 × 0.500 × 0 × 0.500 • To find the class, 𝐶𝑖, that maximizes 𝑃 𝑋|𝐶𝑖 𝑃 𝐶𝑖 , we compute 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = 𝑃(𝑋|𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜) 𝑃 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = • Therefore, the naïve Bayesian classifier predicts, 𝑩𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒚𝒆𝒔 for tuple 𝑋. =0.034 =0 0.024 0 IS IT CORRECT CLASSIFICATION ??????????
  • 49. Naïve Bayesian Classification Data Mining: Classification and Prediction 53 • A zero probability cancels the effects of all of the other (posteriori) probabilities (on 𝐶𝑖) involved in the product. • To avoid the effect of zero probability value, Laplacian correction or Laplace estimator is used. • We add one to each count.
  • 50. Naïve Bayesian Classification Data Mining: Classification and Prediction 54 • E.g. If we have a training database D having 1500 tuples. • Out of which, 1000 tuples are of class Buys_computer = yes. • For income attribute we have • 0 tuples for income = low, • 960 tuple for income = medium, • 40 tuples for income = high. • Using the Laplacian correction for the three quantities, we pretend that we have 1 extra tuple for each income-value pair. 1 1003 = 0.001, 961 1003 = 0.958, 41 1003 = 0.040 • The “corrected” probability estimates are close to their “uncorrected” counterparts, yet the zero probability value is avoided.
  • 51. Rule-Based Classification • The learned model is represented as a set of IF-THEN rules. • An IF-THEN rule is an expression of the form 𝑰𝑭 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑻𝑯𝑬𝑵 𝑐𝑜𝑛𝑐𝑙𝑢𝑠𝑖𝑜𝑛 • Example: 𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 • R1 can also be written as 𝑅1: (𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ) ∧ (𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑦𝑒𝑠) ⇒ (𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠) Data Mining: Classification and Prediction 55 Rule antecedent or precondition Rule consequent Attribute sets Class Prediction
  • 52. Rule-Based Classification • If the condition in a rule antecedent holds true for a given tuple, the rule antecedent is satisfied and that the rule covers the tuple. • Evaluation of Rule R: 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑅 = 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 |𝐷| 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅 = 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 • Let 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 be the number of tuples covered by R • 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 be the number of tuples correctly classified by R • |𝐷| be the number of tuples in D. Data Mining: Classification and Prediction 56
  • 53. Naïve Bayesian Classification: Example RID age income student credit_rating class: Buys_computer 1 youth high no fair no 2 youth high no excellent no 8 youth medium no fair no 9 youth low yes fair yes 11 youth medium yes excellent yes 12 middle_aged medium no excellent yes 13 middle_aged high yes fair yes 14 senior medium no excellent no Data Mining: Classification and Prediction 57
  • 54. Rule-Based Classification • If a rule is satisfied by X, the rule is said to be triggered. X= (age = youth, income = medium, student = yes, credit rating = fair) • X satisfies the rule R1, which triggers the rule. • If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X. • If more than one rule is triggered, we need a conflict resolution strategy. • Size ordering: assigns the highest priority to the triggering rule that has the “toughest” requirements • Rule ordering: prioritizes the rules beforehand. The ordering may be class-based or rule- based. • Class-based ordering: the classes are sorted in order of decreasing “importance” • Rule-based ordering, the rules are organized into one long priority list Data Mining: Classification and Prediction 58
  • 55. Rule-Based Classification • Extracting rules from a decision tree • One rule is created for each path from the root to a leaf node. • Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). • The leaf node holds the class prediction, forming the rule consequent (“THEN” part). Data Mining: Classification and Prediction 59
  • 56. Rule-Based Classification • Extracting rules from a decision tree Data Mining: Classification and Prediction 60 age? student? credit_rating? yes middle_aged youth senior no yes no yes no yes fair excellent
  • 57. Rule-Based Classification • Extracted rules from a decision tree are 𝑅1: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 𝑅2: 𝐼𝐹 𝑎𝑔𝑒 = 𝑆𝑒𝑛𝑖𝑜𝑟 𝐴𝑁𝐷 𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝑓𝑎𝑖𝑟 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no 𝑅3: 𝐼𝐹 𝑎𝑔𝑒 = 𝑚𝑖𝑑𝑑𝑙𝑒_𝑎𝑔𝑒𝑑 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 𝑅4: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = yes 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 𝑅5: 𝐼𝐹 𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ 𝐴𝑁𝐷 student = 𝑛𝑜 𝑇𝐻𝐸𝑁 𝐵𝑢𝑦𝑠_𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = no Data Mining: Classification and Prediction 61
  • 58. Data Mining: Classification and Prediction 62 Extract the classification rule from given decision tree
  • 59. Data Mining: Classification and Prediction 63 X=(Color=Yellow, Type = SUV, Origin = Imported)
  • 60. Prediction • Numeric prediction is the task of predicting continuous (or ordered) values for given input. • Widely used approach for numeric prediction is regression. • Regression is used to model the relationship between one or more independent or predictor variables and a dependent or response variable. • The predictor variables are the attributes of interest describing the tuple. • The response variable is what we want to predict. Data Mining: Classification and Prediction 64 Predictor Variables Response Variable 𝑋 = {age = youth, "income = medium, student = yes, credit rating = fair", Buys_computer =? }
  • 61. Prediction: Linear Regression • Straight-line regression analysis involves a response variable, 𝑦, and a single predictor variable, 𝑥. • Simplest regression technique which models 𝑦 as a linear function of 𝑥. 𝑦 = 𝑏 + 𝑤𝑥 • 𝑏 and 𝑤 are regression coefficients specifying the Y-intercept and slope of the line. • Coefficients can also be thought as weights 𝑦 = 𝑤0 + 𝑤1𝑥 • These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. Data Mining: Classification and Prediction 65
  • 62. Prediction: Linear Regression • The regression coefficients can be estimated 𝑤1 = σ𝑖=1 |𝐷| 𝑥𝑖 − ҧ 𝑥 𝑦𝑖 − ത 𝑦 σ𝑖=1 |𝐷| 𝑥𝑖 − ҧ 𝑥 2 𝑤0 = ത 𝑦 − 𝑤1 ҧ 𝑥 Data Mining: Classification and Prediction 66
  • 63. Prediction: Linear Regression Age (x) Avg. amount spent on medical expenses (per month in Rs.) (y) 15 100 20 135 25 135 37 150 40 250 45 270 48 290 50 360 55 375 61 400 64 500 67 1000 70 1500 Data Mining: Classification and Prediction 67 ҧ 𝑥 = 45.92 ത 𝑦 = 412.69 The regression coefficients are 𝑤1 = 16.89 𝑤0 = −355.32 The equation of the least square (best fitting) line is 𝑦 = −355.32 + 16.89𝑥
  • 64. Prediction: Linear Regression Age (x) Avg. amount spent on medical expenses (per month in Rs.) (y) 15 100 20 135 25 135 37 150 40 250 45 270 48 290 50 360 55 375 61 400 64 500 67 1000 70 1500 Data Mining: Classification and Prediction 68 y = 16.891x - 355.32 -200 0 200 400 600 800 1000 1200 1400 1600 0 10 20 30 40 50 60 70 80 Avg. amount spent on medical expenses (per month in Rs.) (y)
  • 65. Classifier Accuracy Measures Data Mining: Classification and Prediction 69 • Confusion Matrix: • Given 𝒎 classes, a confusion matrix is a table of at least size 𝒎 by 𝒎 • where an entry is row 𝒊 and column 𝒋 shows the number of tuples of class 𝒊 that were labeled by the classifier as class 𝒋.
  • 66. Class – Low Class – Medium Class - High Class – Low 250 10 0 Class – Medium 10 440 10 Class - High 0 10 270 Data Mining: Classification and Prediction 70 1000 tuples
  • 67. Classifier Accuracy Measures Data Mining: Classification and Prediction 71 • Classifier Accuracy • The percentage of test set tuples that are correctly classified by the classifier. • Also referred to as the overall recognition rate of the classifier. • Error Measure • An error rate or misclassification rate of a classifier M, which is simply 1 − 𝐴𝑐𝑐 𝑀 where 𝐴𝑐𝑐(𝑀) is the accuracy of M.
  • 68. Classifier Accuracy Measures Data Mining: Classification and Prediction 72 • Confusion Matrix: Given 2 classes • Positive tuples: • tuples of the main class of interest • Negative tuples: • True Positive: • The positive tuples that were correctly labeled by the classifier • True negatives • The negative tuples that were correctly labeled by the classifier • False positives • The negative tuples that were incorrectly labeled • False negatives • The positive tuples that were incorrectly labeled
  • 69. Classifier Accuracy Measures Data Mining: Classification and Prediction 73 • We would like to be able to access how well the classifier can recognize the positive tuples and how well it can recognize the negative tuples. • Sensitivity (true positive (recognition) rate) • The proportion of positive tuples that are correctly identified. 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑡_𝑝𝑜𝑠 𝑝𝑜𝑠 • Specificity (true negative rate) • The proportion of negative tuples that are correctly identified. 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑡_𝑛𝑒𝑔 𝑛𝑒𝑔 • Precision 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡 _𝑝𝑜𝑠 (𝑡 _𝑝𝑜𝑠 + 𝑓 _𝑝𝑜𝑠 )
  • 70. Classifier Accuracy Measures Data Mining: Classification and Prediction 74 • It can be shown that accuracy is a function of sensitivity and specificity. 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 × 𝑝𝑜𝑠 𝑝𝑜𝑠 + 𝑛𝑒𝑔 + 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 × 𝑛𝑒𝑔 𝑝𝑜𝑠 + 𝑛𝑒𝑔 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑝𝑜𝑠 + 𝑡𝑛𝑒𝑔 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑡𝑢𝑝𝑙𝑒𝑠
  • 71. Predictor Accuracy Measures Data Mining: Classification and Prediction 75 • Instead of focusing on whether the predicted value 𝑦′𝑖 is an “exact” match with actual value 𝑦𝑖 , we check how far off the predicted value is from the actual known value. • Loss functions measures the error between the actual value 𝑦𝑖 and the predicted value 𝑦′𝑖. 𝑨𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓: 𝒚𝒊 − 𝒚′𝒊 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓: (𝒚𝒊 − 𝒚′𝒊)𝟐 • The test error (rate), or generalization error, is the average loss over the test set. 𝑴𝒆𝒂𝒏 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 = σ𝒊=𝟏 𝒅 𝒚𝒊 − 𝒚′𝒊 𝒅 𝑴𝒆𝒂𝒏 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 = σ𝒊=𝟏 𝒅 (𝒚𝒊 − 𝒚′𝒊)𝟐 𝒅 • If we were to take the square root of the mean squared error, the resulting error measure is called the root mean squared error.
  • 72. Predictor Accuracy Measures Data Mining: Classification and Prediction 76 • Relative measures of error include 𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓 = σ𝒊=𝟏 𝒅 𝒚𝒊 − 𝒚′𝒊 σ𝒊=𝟏 𝒅 𝒚𝒊 − ഥ 𝒚 𝑹𝒆𝒍𝒂𝒕𝒊𝒗𝒆 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒆𝒓𝒓𝒐𝒓 = σ𝒊=𝟏 𝒅 (𝒚𝒊 − 𝒚′𝒊)𝟐 σ𝒊=𝟏 𝒅 (𝒚𝒊 − ഥ 𝒚)𝟐 • We can take the root of the relative squared error to obtain the root relative squared error so that the resulting error is of the same magnitude as the quantity predicted.
  • 73. Accuracy Measures Data Mining: Classification and Prediction 77 • Evaluating the Accuracy of a Classifier or Predictor Holdout Random Subsampling Cross Validation Bootstrap
  • 74. Accuracy Measures Data Mining: Classification and Prediction 78 • Holdout • The given data are randomly partitioned into two independent sets, a training set and a test set. • Two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. Data Training set Test set Derive model Estimate Accuracy
  • 75. Accuracy Measures Data Mining: Classification and Prediction 79 • Random Subsampling • A variation of the holdout method in which the holdout method is repeated 𝒌 times. • The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration Data Training set 1 Test set 1 Derive model Estimate Accuracy Data Training set 2 Test set 2 Derive model Estimate Accuracy Data Training set k Test set k Derive model Estimate Accuracy Iteration 1 Iteration 2 Iteration k . . .
  • 76. Accuracy Measures Data Mining: Classification and Prediction 80 • Cross Validation Data 𝐷2 𝐷3 𝐷1 𝐷𝑘 𝐷2 𝐷3 𝐷1 𝐷𝑘 … 𝐷2 𝐷3 𝐷1 𝐷𝑘 … 𝐷2 𝐷3 𝐷1 𝐷𝑘 … Iteration 1 Iteration 2 Iteration3 Iteration 𝑘 𝑘 mutually exclusive folds i.e.𝑘 data partitions . . . 𝐷2 𝐷3 𝐷1 𝐷𝑘 … Training Set Test Set
  • 77. Accuracy Measures Data Mining: Classification and Prediction 81 • Cross Validation • Each sample is used the same number of times for training and once for testing. • For Classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data. • For Prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number of initial tuples. • Leave-one-out • 𝑘 is set to the number of initial tuples. So, only one sample is “left out” at a time for the test set. • Stratified cross-validation • The folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data
  • 78. Accuracy Measures Data Mining: Classification and Prediction 82 • Bootstrap • The bootstrap method samples the given training tuples uniformly with replacement. • i.e. each time a tuple is selected, it is equally likely to be selected again and readded to the training set. • .632 Bootstrap • On an average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8% will form the test set • Each tuple has a probability of Τ 1 𝑑 of being selected, so the probability of not being chosen is (1 − Τ 1 𝑑). • We have to select 𝑑 times, so the probability that a tuple will not be chosen during this whole time is (1 − Τ 1 𝑑)𝑑. • If 𝑑 is large, the probability approaches e−1 = 0.368 . • Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2% will form the training set.
  • 79. Accuracy Measures Data Mining: Classification and Prediction 83 • Bootstrap • .632 Bootstrap • Repeat the sampling procedure 𝑘 times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. • The overall accuracy of the model is 𝐴𝑐𝑐 𝑀 = ෍ 𝑖=1 𝑘 0.632 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑒𝑠𝑡_𝑠𝑒𝑡 + 0.368 × 𝐴𝑐𝑐(𝑀𝑖)𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡