Unit-4
CLASSIFICATION
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision:The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
2
Classification vs. Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
3
Introduction to classification
• Classification is a supervised learning method.
• It is a data mining function that assigns items in a collection to target
categories or classes.
• The goal of classification is to accurately predict the target class for
each case in the data.
• For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
• In supervised learning, the learner(computer program) is provided with
two sets of data, training data set and test data set.
• The idea is for the learner to “learn” from a set of labeled examples in the
training set so that it can identify unlabeled examples in the test set
with the highest possible accuracy.
4
Cont.…
• Suppose a Database D is given as D = {t1,t2,..tn} and a set of desired
classes are C={C1,…,Cm}.
• The classification problem is to define the mapping m in such a way
that which tuple of database D belongs to which class of C.
• Actually we divides D into equivalence classes.
• Prediction is similar, but it viewed as having infinite number of
classes.
5
Classification Example
• Teachers classify students grades as A,B,C,D or E.
• Identify individuals with credit risks (high, low, medium or
unknown).
• In cricket (batsman, bowler, all-rounder)
• Websites (educational, sports, music)
6
x
<90 >=90
x A
<80 >=80
x B
<70 >=70
x C
<60 >=60
E D
Cont.…
• How teachers give grades to students based on their obtained marks?
• If x >= 90 then A grade.
• If 80 <= x < 90 then B grade.
• If 70 <= x < 80 then C grade.
• If 60 <= x < 70 then D grade.
• If x < 60 then E grade.
Classification : a two step process
1) Model Construction
Name Rank Years Tenured
Mike Asst. Prof. 3 No
Mary Asst. Prof. 7 Yes
Bill Prof. 2 Yes
Jim Asso. Prof. 7 Yes
Dave Asst. Prof. 6 No
Anne Asso. Prof. 3 No
Training
Data
Classifier
(Model)
If Rank = ‘professor’
OR year > 6
THEN tenured = ‘yes’
Classification
Algorithms
8
Cont..
2) Model Usage
Name Rank Years Tenured
Tom Asst. Prof. 2 No
Merlisa Asso. Prof. 7 No
George Prof. 5 Yes
Joseph Asst. Prof. 7 Yes
Testing
Data
Classifier
Unseen
Data
(Jeff, Professor, 4)
Tenured?
Yes
9
Cont..
1) Model Construction
• Describing a set of predetermined classes :
oEach tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute.
oThe set of tuples used for model construction is called as
training set.
oThe model is represented as classification rules, decision trees,
or mathematical formulae.
2) Model Usage
• For classifying future or unknown objects
oEstimate accuracy of the model
oThe known label of test sample is compared with the classified
result from the model.
oAccuracy rate is the percentage of test set samples that are
correctly classified by the model.
10
Classification & prediction issues
Data Preparation
 Data cleaning
• Pre-process data in order to reduce noise and handle missing
values.
 Relevance analysis (Feature selection)
• Remove the irrelevant or redundant attributes.
 Data transformation
• Generalize the data to higher level concepts using concept
hierarchies and/or normalize data which involves scaling the
values.
11
Cont..
Evaluating Classification Methods
 Predict accuracy
• This refers the ability of the model to correctly predict the class label
of new or previously unseen data.
 Speed and scalability
• Time to construct model
• Time to use the model
 Robustness
• Handling noise and missing values
 Interpretability
• Understanding and insight provided by model
 Goodness of rules
• Decision tree size
• Strongest rule or not
12
Classification & prediction
methods
Classification
DecisionTree
Bayesian
Classification
Rule Based
Classification
Neural Network
Prediction
Linear
Regression
Non Linear
Regressions
Logistic
Regression
13
Decision tree
 One of the most common tasks is to build models for the
prediction of the class of an object on the basis of its attributes.
• The objects can be seen as a customer, patient, transaction, e-mail
message or even a single character.
• Attributes of patient object can be heart rate, blood pressure,
weight and gender etc.
• The class of the patient object would most commonly be
positive/negative for a certain disease.
14
Decision tree (Cont..)
• In decision tree are represented by a fixed set of attributes (e.g.
gender) and their values (e.g. male, female) described as attribute-
value pairs.
• If the attribute has small number of disjoint possible values (e.g.
high, medium, low) or there are only two possible classes (e.g. true,
false) then decision tree learning is easy.
• Extension to decision tree algorithm also handles real value
attributes (e.g. salary).
• It gives a class label to each instance of dataset.
15
Decision tree (Cont..)
• Decision tree is a classifier in the form of a tree structure
• Decision node: Specifies a test on a single attribute
• Leaf node: Indicates the value of the target attribute
• Arc/edge: Split of one attribute
• Path: A disjunction of test to make the final decision
16
Decision tree representation-
example
Root
Node
Leaf
Node
Leaf
Node
Branches
Set of
Possible
Answers
Set of
Possible
Answers
17
DecisionTree Induction: An Example
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
Training data set: Buys_computer
The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
Resulting tree:
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Key requirements for classification
• Sufficient data:
• Enough training cases should be provided to learn the model.
• Attribute-value description:
• Object or case must be expressible in terms of a fixed collection of
properties or attributes (e.g., hot, mild, cold).
• Predefined classes (target values):
• The target function has discrete output values (boolean or multiclass)
19
Important terms for decision tree
• Entropy
• Attribute Selection Measure
• Information Gain
• Gain Ratio
• Gini Index
20
Entropy (E)
•It defines the certainty of a decision
• 1 if completely certain,
• 0 if completely uncertain,
• Normally data remains between 0 to 1 as entropy, a
probability-based measure used to calculate the
amount of uncertainty.
21
Entropy (E) (Cont..)
• It measures that of how much information we don't know (how
uncertain we are about the data).
• It can be also used to measure how much information we gain
from an attribute when the target attribute is revealed to us.
• Which attribute is best?
• The attribute with the largest expected reduction in entropy is the
'best' attribute to use next.
• Because if we have a large expected reduction it means taking away that
attribute has a big effect, meaning it must be very certain.
22
Entropy (E) (Cont..)
 A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with
similar values (homogenous).
 ID3 algorithm uses entropy to calculate the homogeneity of a
sample.
 If the sample is completely homogeneous the entropy is zero and if
the sample is an equally divided it has entropy of one.
23
Attribute Selection Measures
• It is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of a class-labelled training
tuples into individual classes.
• Attribute selection measures are also known as splitting rules.
24
Information Gain
• Information gain can be used for continues-valued (numeric)
attributes.
• The attribute which has the highest information gain is selected
for split.
• Assume, that there are two classes P(positive) & N(negative).
• Suppose we have S samples, out of these p samples belongs to
class P and n samples belongs to class N.
• The amount of information, needed to decide split in S belongs to P
or N & that is defined as
𝐼 𝑝, 𝑛 = −
𝑝
𝑝+𝑛
𝑙𝑜𝑔2
𝑝
𝑝+𝑛
-
𝑛
𝑝+𝑛
𝑙𝑜𝑔2
𝑛
𝑝+𝑛
25
26
Information Gain (ID3/C4.5)
 Expected information (entropy) needed to classify a
tuple in D:
 Information needed (after using A to split D into v
partitions) to classify D:
 Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

(D)
Info
Info(D)
Gain(A) A


Attribute Selection: Information Gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
means “age <=30” has 5 out of 14 samples, with 2
yes’es and 3 no’s. Hence,
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



 I
D
Info
Gain Ratio (C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
• gain_ratio(income) = 0.029/1.557 = 0.019
• The attribute with the maximum gain ratio is selected as the splitting
attribute
28
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 

Gini Index
• Assume there exist several possible split values for each attribute.
• We may need other tools, such as clustering, to get the possible
split values.
• It can be modified for categorical attributes.
• An alternative method to information gain is called the Gini Index.
• Gini is used in CART (Classification and RegressionTrees).
• If a dataset T Contains examples from n classes, gini index, gini(T) is
defined as
Gini (T) = 1 - 𝑗=1
𝑛
𝑝𝑗
2
• n: the number of classes
• pj: the probability that a tuple in D belongs to class Ci
29
Gini Index (Cont..)
• After splittingT into two subsetsT1 andT2 with sizes N1 and N2,
the gini index of the split data is defined as
Ginisplit (T) =
𝑁1
𝑁
gini (T1) +
𝑁2
𝑁
gini (T2)
30
Example – ID3
Age Income Student
Credit_Rati
ng
Class :
buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No 31
Solution – ID3
• Class P : buys_computer = “Yes” (9 records)
• Class N : buys_computer = “No” (5 records)
• Total number of Records 14.
• Now, Information Gain = I (p,n)
I 9,5 = −
9
14
𝑙𝑜𝑔2
9
14
-
5
14
𝑙𝑜𝑔2
5
14
I(9,5) = 0.940
32
Solution – ID3 (Age <=30,31..40,>40)
Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No
<=30 High No Excellent No
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
<=30 Medium Yes Excellent Yes
Age Incom
e
Stu. Cr_Rati
ng
Buy
s
31..40 High No Fair Yes
31..40 Low Yes Excelle
nt
Yes
31..40 Mediu
m
No Excelle
nt
Yes
31..40 High Yes Fair Yes
Age Income Stu
.
Cr_Ratin
g
Buys
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low No Excellent No
>40 Medium Yes Fair Yes
>40 Medium No Excellent No
33
Solution – ID3
 So the expected information needed to classify a given sample if the
samples are partitioned according to age is,
 Calculate entropy using the values from the Table and the formula given
below:
𝐼𝑛𝑓𝑜𝐴 𝐷 =
5
14
I (2,3) +
4
14
I (4,0) +
5
14
I (3,2)
34
)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

Solution – ID3 (Age <=30)
• Compute the information gain & Entropy For Age <=30,
• Pi =Yes class = 2
• Ni = No class = 3
So, Information Gain = I (p,n)
I 𝑝, 𝑛 = −
𝑝
𝑝+𝑛
𝑙𝑜𝑔2
𝑝
𝑝+𝑛
-
𝑛
𝑝+𝑛
𝑙𝑜𝑔2
𝑛
𝑝+𝑛
I 2,3 = −
2
5
𝑙𝑜𝑔2
2
5
-
3
5
𝑙𝑜𝑔2
3
5
I(2,3) = 0.971
35
Solution – ID3
Age Pi Ni I (Pi , Ni)
<=30 2 3 0.971
31..40 4 0 0
>40 3 2 0.971
 So the expected information needed to classify a given sample if the samples
are partitioned according to age is,
 Calculate entropy using the values from the Table and the formula given
below:
E (A) =
i=1
𝑣
Pi
+Ni
𝑝+𝑛
I (Pi , Ni)
E Age =
5
14
I (2,3) +
4
14
I (4,0) +
5
14
I (3,2)
E(Age) = 0.694
36
Solution – ID3
Gain (Age) = I (p, n) – E (Age)
= 0.940 – 0.694
= 0.246
Similarly, Gain value
Gain (age) 0.246
Gain (income) 0.029
Gain (student) 0.151
Gain (credit_rating) 0.048
So, here we start decision tree with root node Age.
37
Solution – ID3
• Now the age has highest information gain among all the attributes,
so select age as test attribute and create the node as age and show
all possible values of age for further splitting.
Age
<=30 31..40 >40
38
Solution – ID3 (Age <= 30)
Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No
<=30 High No Excellent No
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
<=30 Medium Yes Excellent Yes
39
Solution – ID3 (Age <=30)
• Compute Information gain & Entropy for Age with sample S<=30.
• For age <=30,
• Pi =Yes = 2
• Ni = No = 3
I 𝑝, 𝑛 = −
𝑝
𝑝+𝑛
𝑙𝑜𝑔2
𝑝
𝑝+𝑛
-
𝑛
𝑝+𝑛
𝑙𝑜𝑔2
𝑛
𝑝+𝑛
I 2,3 = −
2
5
𝑙𝑜𝑔2
2
5
-
3
5
𝑙𝑜𝑔2
3
5
I(3,2) = 0.971
40
Solution – ID3 (Age <= 30, Income)
Income Pi Ni I (Pi , Ni)
High 0 2 0
Medium 1 1 1
Low 1 0 0
E (A) =
i=1
𝑣
Pi
+Ni
𝑝+𝑛
I (Pi , Ni)
E Income =
2
5
I (0,2) +
2
5
I (1,1) +
1
5
I (1,0)
E(Income) = 0.4
In above table high (0,2) homogeneous so I(0,2) = 0, Medium
equal portion so I(1,1) = 1 & Low I(1,0) = 0.
Gain (S<=30,Income)
= I (p, n) – E (Income)
= 0.971 – 0.4
= 0.571
41
Solution – ID3 (Age <= 30,
Student)
student Pi Ni I (Pi , Ni)
No 0 3 0
Yes 2 0 0
In above table I (0,3) = 0 & I (2,0) = 0 So E(Student) is 0.
E(Student) = 0
Gain (S<=30,Student) = I (p,n) – E(Student)
= 0.971 – 0
= 0.971
42
Solution – ID3 (Age <= 30,
credit_rating)
credit_rating Pi Ni I (Pi , Ni)
Fair 1 2 0.918
Excellent 1 1 1
E (A) =
i=1
𝑣
Pi
+Ni
𝑝+𝑛
I (Pi , Ni)
E credit_rating =
3
5
I (1,2) +
2
5
I (1,1)
E(credit_rating) = 0.951
Gain (S<=30,credit_rating) = I (p, n) – E (credit_rating)
= 0.971 – 0.951
= 0.020 43
Solution – ID3 (Age <=30)
Gain (Age <= 30) value
Income 0.571
Student 0.971
Credit_rating 0.020
As shown in table we get maximum
gain for student so, select student
as leaf node for age <=30
44
<=30 31..40 >40
Student
No Yes
Age
Solution – ID3 (Age 31..40)
Age Income Student Credit_Rating Buys_Computer
31..40 High No Fair Yes
31..40 Low Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
Age
<=30 31..40 >40
Student
No Yes
Yes
45
Solution – ID3 (Age > 40)
Age Income Student Credit_Rating Buys_Computer
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low No Excellent No
>40 Medium Yes Fair Yes
>40 Medium No Excellent No
46
Solution – ID3 (Age > 40)
• Compute Information gain for Age with sample S>40.
• For age > 40,
• Pi =Yes = 3
• Ni = No = 2
I 𝑝, 𝑛 = −
𝑝
𝑝+𝑛
𝑙𝑜𝑔2
𝑝
𝑝+𝑛
-
𝑛
𝑝+𝑛
𝑙𝑜𝑔2
𝑛
𝑝+𝑛
I 3,2 = −
3
5
𝑙𝑜𝑔2
3
5
-
2
5
𝑙𝑜𝑔2
2
5
I(3,2) = 0.971
47
Solution – ID3 (Age > 40, Income)
Income Pi Ni I (Pi , Ni)
High 0 0 0
Medium 2 1 0.918
Low 1 1 1
Gain (S>40,Income)
= I (p, n) – E (Income)
= 0.971 – 0.951
= 0.020
48
E (A) =
i=1
𝑣
Pi
+Ni
𝑝+𝑛
I (Pi , Ni)
E Income =
0
5
I (0,0) +
3
5
I (2,1) +
2
5
I (1,1)
E(Income) = 0.951
Solution – ID3 (Age > 40,
credit_rating)
Credit_rating Pi Ni I (Pi , Ni)
Fair 3 0 0
Excellent 0 2 0
E (A) =
i=1
𝑣
Pi
+Ni
𝑝+𝑛
I (Pi , Ni)
E credit_rating =
3
5
I (3,0) +
2
5
I (0,2)
E(credit_rating) = 0
Gain (S>40,credit_rating) = I (p, n) – E (credit_rating)
= 0.971 – 0
= 0.971 49
Solution – ID3 (Age > 40)
Gain (Age > 40) value
Income 0.020
Credit_rating 0.971
As shown in table we get maximum
gain for credit_rating so, select
credit_rating as leaf node for age > 40
Age
<=30 31..40 >40
Student
No yes
Yes Credit_rating
50
Classification rules from decision
tree
• IF age = “<=30”AND student = “no”THEN buys_computer = “no”
• IF age = “<=30”AND student = “yes”THEN buys_computer = “yes”
• IF age = “31..40”THEN buys_computer = “yes”
• IF age = “>40”AND credit_rating = “excellent”THEN
buys_computer = “no”
• IF age = “>40”AND credit_rating = “fair”THEN buys_computer =
“yes”
51
Bayesian Classification
•Thomas Bayes, who proposed the Bayes Theorem so, it
named Bayesian theorem.
•It is statistical method & supervised learning method
for classification.
•It can solve problems involving both categorical and
continuous valued attributes.
•Bayesian classification is used to find conditional
probabilities.
52
The BayesTheorem
•The BayesTheorem:
• P(H|X)=
𝐏(𝐗|𝐇) 𝐏(𝐇)
𝐏(𝐗)
•P(H|X) : Probability that the customer will buy a computer given
that we know his age, credit rating and income. (Posterior
Probability of H)
•P(H) : Probability that the customer will buy a computer
regardless of age, credit rating, income (Prior Probability of H)
•P(X|H) : Probability that the customer is 35 years old, have fair
credit rating and earns $40,000, given that he has bought
computer (Posterior Probability of X)
•P(X) : Probability that a person from our set of customers is 35
years old, have fair credit rating and earns $40,000. (Prior
Probability of X)
53
Naïve Bayes classifier - Example
Age Income Student Credit_Rating
Class :
buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No 54
Naïve Bayes classifier - Solution
Age
P (<=30 | Yes) = 2/9 P (<=30 | No) = 3/5
P (Yes) = 9/14
P (No) = 5/14
P (31..40 | Yes) = 4/9 P (31..40 | No) = 0/5
P (> 40 | Yes) = 3/9 P (> 40 | No) = 2/5
Income
P (High | Yes) = 2/9 P (High | No) = 2/5
P (Medium | Yes) = 4/9 P (Medium | No) = 2/5
P (Low | Yes) = 3/9 P (Low | No) = 1/5
Student
P (No | Yes) = 3/9 P (No | No) = 4/5
P (Yes | Yes) = 6/9 P (Yes | No) = 1/5
Credit_rating
P (Fair | Yes) = 6/9 P (Fair | No) = 2/5
P (Excellent | Yes) = 3/9 P (Excellent | No) = 3/5
55
Naïve Bayes classifier - Solution
• An unseen sampleY = (<=30, Low,Yes, Excellent)
• P(Y|Yes).P(Yes) = P(<=30|Yes). P(Low|Yes). P(Yes|Yes).
P(Excellent|Yes) . P(Yes)
= 2/9 * 3/9 * 6/9 * 3/9 * 9/14
= 0.010582
 P(Y|No).P(No) = P(<=30|No). P(Low|No). P(Yes|No).
P(Excellent|No) . P(No)
= 3/5 * 1/5 * 1/5 * 3/5 * 5/14
= 0.005142
 Choose the class so that it maximizes this probability, this means
that new instance will be classified as Yes (Buys_computer)
56
Try yourself (Bayesian Classification)
Car No Color Type Origin Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
Unseen Data
Y = <Red, Domestic,
SUV>
Actual Data
Y = <Red, Sports,
Domestic>
57
0.024,0.072
(Unseen)
0.192, 0.048
(Actual)
Rule Based Classification
• It is featured by building rules based on an object attributes.
• Rule-based classifier makes use of a set of IF-THEN rules for classification.
• We can express a rule in the following from
• IF conditionTHEN conclusion
• Let us consider a rule R1,
R1: IF age=youth AND student=yesTHEN buy_computer=yes
• The IF part of the rule is called rule antecedent or precondition.
• TheTHEN part of the rule is called rule consequent (conclusion).
• The antecedent (IF) part the condition consist of one or more attribute
tests and these tests are logicallyANDed.
• The consequent (THEN) part consists of class prediction.
58
Rule Based Classification (Cont..)
• We can also write rule R1 as follows:
R1: ((age = youth) ^ (student = yes)) => (buys_computer = yes)
• If the condition (that is, all of the attribute tests) in a rule antecedent holds
true for a given tuple, we say that the rule antecedent is satisfied and that
the rule covers the tuple.
• A rule R can be assessed by its coverage and accuracy.
• Given a tuple X, from a class labeled data set D, let it covers the number of
tuples by R; the number of tuples correctly classified by R; and |D| be the
number of tuples in D.
• We can define the coverage and accuracy of R as
Coverage (R) =
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
|𝐷|
Accuracy (R) =
𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
𝑛𝑐𝑜𝑣𝑒𝑟𝑠
59
Characteristics of Rule-Based
Classifier
•Mutually exclusive rules
• Classifier contains mutually exclusive rules if no two rules are
triggered by the same record.
• Every record is covered by at most one rule
• If more than two rules are triggered for any record then conflict
resolution strategies are applied
•Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule
60
Conflict Resolution Strategies
• Size ordering
• Assigns the highest priority to the rule having “toughest”
requirement like ule antecedent size
• Rule ordering
• Prioritize the rules beforehand using rule-based (based on rule
quality like accuracy, size, coverage etc.) or class-based
ordering (decreasing order sorting of class prevalence)
61
Rule-based Ordering
• (Refund=Yes) ==> No
• (Refund=No, Marital
Status={Single,Divorced},Taxable
Income<80K) ==> No
• (Refund=No, Marital
Status={Single,Divorced},Taxable
Income>80K) ==> Yes
• (Refund=No, Marital
Status={Married}) ==> No
Class-based Ordering
• (Refund=Yes) ==> No
• (Refund=No, Marital
Status={Single,Divorced},Taxable
Income<80K) ==> No
• (Refund=No, Marital
Status={Married}) ==> No
• (Refund=No, Marital
Status={Single,Divorced},Taxable
Income>80K) ==> Yes
Two Methods to generate rules
• Direct method
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R
• Indirect method
• Extract rules from other classification models (e.g. decision
trees, neural networks, etc).
• e.g: C4.5rules
62
63
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves 64
65
Model Selection:
ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection
theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a
measure of the accuracy of the model
• Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
• The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model
 Vertical axis represents the
true positive rate
 Horizontal axis rep. the
false positive rate
 The plot also shows a
diagonal line
66
66
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
67
67
Classifier Evaluation Metrics: Confusion Matrix
Actual class
Predicted class
buy_computer =
yes
buy_computer =
no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Given m classes, an entry, CMi,j in a confusion matrix indicates # of
tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
68
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
• Classifier Accuracy, or
Recognition Rate: percentage
of test set tuples that are
correctly classified
Accuracy = (TP +TN)/(P+N)
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/(P+N)
 Class Imbalance Problem:
 One class may be rare,
e.g. fraud, or HIV-
positive
 Significant majority of
the negative class and
minority of the positive
class
 Sensitivity: True Positive
recognition rate
 Sensitivity = TP/P
 Specificity: True Negative
recognition rate
 Specificity = TN/N
AP C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
69
69
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples
that the classifier labeled as positive are
actually positive
• Recall: completeness – what % of positive
tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision &
recall
• F measure (F1 or F-score): harmonic mean
of precision and recall,
• Fß: weighted measure of precision and
recall
• assigns ß times as much weight to recall
as to precision
70
Classifier Evaluation Metrics: Example
• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual Class
Predicted class
cancer = yes cancer =
no
Total Recognition(%)
cancer = yes 90 210 300 30.00
(sensitivity)
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40
(accuracy)
71
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
•Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized data
•*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
72
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with the aim of
creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
73
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with
the most votes to X
• Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction 74
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
75
75
Regression
• Regression is a data mining function that predicts a number or
value.
• Age, weight, distance, temperature, income, or sales attributes can
be predicted using regression techniques.
• For example, a regression model could be used to predict children's
height, given their age, weight and other factors.
• A regression task begins with a data set in which the target values
are known.
• For example, a regression model that predicts children's height
could be developed based on observed data for many children over
a period of time.
• The data might track age, height, weight, developmental
milestones, family history etc.
76
Regression (Cont..)
• Height would be the target, the other attributes would be the
predictors, and the data for each child would constitute a case.
• Regression models are tested by computing various statistics
that measure the difference between the predicted values and
the expected values.
• It is required to understand the mathematics used in regression
analysis to develop quality regression models for data mining.
• The goal of regression analysis is to determine the values of
parameters for a function that cause the function to fit best in a set
of data observations that we provide.
77
Linear Regression
• The simplest form of regression to visualize is linear regression with
a single predictor.
• A linear regression technique can be used if the relationship
between x and y can be approximated with a straight line.
• Linear regression with a single predictor can be expressed with the
following equation with one dependent and one independent
variable is defined by the given formula
• Where y = Dependent variable which we are trying to predict, b1 = The
Slope, and X = independent variable, b0 = The Intercept, u = Random
Error/Residual
Y = b1X + b0 + u
78
Linear Regression (Cont..)
79
Example
Sr. No. X Y
1 0 2
2 1 3
3 2 5
4 3 4
5 4 6
80
Linear Regression Equation is,
Y= aX+b
Find value of a and b using Least square method
𝑎 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − ( 𝑥)2 𝑎𝑛𝑑 𝑏 =
1
𝑛
𝑦 − 𝑎 𝑥
Continued…
∑ 10 20 49 30
81
Sr. No. X Y XY X2
1 0 2 0 0
2 1 3 3 1
3 2 5 10 4
4 3 4 12 9
5 4 6 24 16
a= 0.9
b= 2.2
Multiple Linear Regression
• Multiple Linear regression is an extension of simple linear
regression analysis.
• It uses two or more independent variables to predict the outcome
and a single continuous dependent variable, Like..
• WhereY = dependent variable or response variable
X1,X2,….Xn are the independent variables or predictors
u = Random Error
b0,b1,b2,….bn are the regression coefficients
Y = b0 +b1X1+b2X2+ b3X3 +….+ bnXn + u
82
83
Predict the final exam
grade of a student who
received an 86 and 89
on the midterm exam
Nonlinear Regression
• Often the relationship between x and y cannot be approximated
with a straight line or curve for that nonlinear regression technique
may be used.
• Alternatively, the data could be preprocessed to make the
relationship linear.
84
Logistic Regression
• A linear regression is not appropriate for predicting the value of a
binary variable for two reasons:
• A linear regression will predict values outside the acceptable range (e.g.
predicting probabilities outside the range 0 to 1).
• Since the experiments can only have one of two possible values for each
experiment, the residuals(random errors) will not be normally distributed
about the predicted line.
• A logistic regression produces a logistic curve, which is limited
to values between 0 and 1.
• Logistic regression is similar to a linear regression, but the curve is
constructed using the natural logarithm “odds” of the target
variable, rather than the probability.
85
Logistic Regression (Cont..)
86
87

Unit 4 Classification of data and more info on it

  • 1.
  • 2.
    Supervised vs. UnsupervisedLearning • Supervised learning (classification) • Supervision:The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning (clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 2
  • 3.
    Classification vs. NumericPrediction • Classification • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Numeric Prediction • models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications • Credit/loan approval: • Medical diagnosis: if a tumor is cancerous or benign • Fraud detection: if a transaction is fraudulent • Web page categorization: which category it is 3
  • 4.
    Introduction to classification •Classification is a supervised learning method. • It is a data mining function that assigns items in a collection to target categories or classes. • The goal of classification is to accurately predict the target class for each case in the data. • For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. • In supervised learning, the learner(computer program) is provided with two sets of data, training data set and test data set. • The idea is for the learner to “learn” from a set of labeled examples in the training set so that it can identify unlabeled examples in the test set with the highest possible accuracy. 4
  • 5.
    Cont.… • Suppose aDatabase D is given as D = {t1,t2,..tn} and a set of desired classes are C={C1,…,Cm}. • The classification problem is to define the mapping m in such a way that which tuple of database D belongs to which class of C. • Actually we divides D into equivalence classes. • Prediction is similar, but it viewed as having infinite number of classes. 5
  • 6.
    Classification Example • Teachersclassify students grades as A,B,C,D or E. • Identify individuals with credit risks (high, low, medium or unknown). • In cricket (batsman, bowler, all-rounder) • Websites (educational, sports, music) 6
  • 7.
    x <90 >=90 x A <80>=80 x B <70 >=70 x C <60 >=60 E D Cont.… • How teachers give grades to students based on their obtained marks? • If x >= 90 then A grade. • If 80 <= x < 90 then B grade. • If 70 <= x < 80 then C grade. • If 60 <= x < 70 then D grade. • If x < 60 then E grade.
  • 8.
    Classification : atwo step process 1) Model Construction Name Rank Years Tenured Mike Asst. Prof. 3 No Mary Asst. Prof. 7 Yes Bill Prof. 2 Yes Jim Asso. Prof. 7 Yes Dave Asst. Prof. 6 No Anne Asso. Prof. 3 No Training Data Classifier (Model) If Rank = ‘professor’ OR year > 6 THEN tenured = ‘yes’ Classification Algorithms 8
  • 9.
    Cont.. 2) Model Usage NameRank Years Tenured Tom Asst. Prof. 2 No Merlisa Asso. Prof. 7 No George Prof. 5 Yes Joseph Asst. Prof. 7 Yes Testing Data Classifier Unseen Data (Jeff, Professor, 4) Tenured? Yes 9
  • 10.
    Cont.. 1) Model Construction •Describing a set of predetermined classes : oEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute. oThe set of tuples used for model construction is called as training set. oThe model is represented as classification rules, decision trees, or mathematical formulae. 2) Model Usage • For classifying future or unknown objects oEstimate accuracy of the model oThe known label of test sample is compared with the classified result from the model. oAccuracy rate is the percentage of test set samples that are correctly classified by the model. 10
  • 11.
    Classification & predictionissues Data Preparation  Data cleaning • Pre-process data in order to reduce noise and handle missing values.  Relevance analysis (Feature selection) • Remove the irrelevant or redundant attributes.  Data transformation • Generalize the data to higher level concepts using concept hierarchies and/or normalize data which involves scaling the values. 11
  • 12.
    Cont.. Evaluating Classification Methods Predict accuracy • This refers the ability of the model to correctly predict the class label of new or previously unseen data.  Speed and scalability • Time to construct model • Time to use the model  Robustness • Handling noise and missing values  Interpretability • Understanding and insight provided by model  Goodness of rules • Decision tree size • Strongest rule or not 12
  • 13.
    Classification & prediction methods Classification DecisionTree Bayesian Classification RuleBased Classification Neural Network Prediction Linear Regression Non Linear Regressions Logistic Regression 13
  • 14.
    Decision tree  Oneof the most common tasks is to build models for the prediction of the class of an object on the basis of its attributes. • The objects can be seen as a customer, patient, transaction, e-mail message or even a single character. • Attributes of patient object can be heart rate, blood pressure, weight and gender etc. • The class of the patient object would most commonly be positive/negative for a certain disease. 14
  • 15.
    Decision tree (Cont..) •In decision tree are represented by a fixed set of attributes (e.g. gender) and their values (e.g. male, female) described as attribute- value pairs. • If the attribute has small number of disjoint possible values (e.g. high, medium, low) or there are only two possible classes (e.g. true, false) then decision tree learning is easy. • Extension to decision tree algorithm also handles real value attributes (e.g. salary). • It gives a class label to each instance of dataset. 15
  • 16.
    Decision tree (Cont..) •Decision tree is a classifier in the form of a tree structure • Decision node: Specifies a test on a single attribute • Leaf node: Indicates the value of the target attribute • Arc/edge: Split of one attribute • Path: A disjunction of test to make the final decision 16
  • 17.
  • 18.
    DecisionTree Induction: AnExample age? student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no Training data set: Buys_computer The data set follows an example of Quinlan’s ID3 (Playing Tennis) Resulting tree: age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 19.
    Key requirements forclassification • Sufficient data: • Enough training cases should be provided to learn the model. • Attribute-value description: • Object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). • Predefined classes (target values): • The target function has discrete output values (boolean or multiclass) 19
  • 20.
    Important terms fordecision tree • Entropy • Attribute Selection Measure • Information Gain • Gain Ratio • Gini Index 20
  • 21.
    Entropy (E) •It definesthe certainty of a decision • 1 if completely certain, • 0 if completely uncertain, • Normally data remains between 0 to 1 as entropy, a probability-based measure used to calculate the amount of uncertainty. 21
  • 22.
    Entropy (E) (Cont..) •It measures that of how much information we don't know (how uncertain we are about the data). • It can be also used to measure how much information we gain from an attribute when the target attribute is revealed to us. • Which attribute is best? • The attribute with the largest expected reduction in entropy is the 'best' attribute to use next. • Because if we have a large expected reduction it means taking away that attribute has a big effect, meaning it must be very certain. 22
  • 23.
    Entropy (E) (Cont..) A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous).  ID3 algorithm uses entropy to calculate the homogeneity of a sample.  If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one. 23
  • 24.
    Attribute Selection Measures •It is a heuristic for selecting the splitting criterion that “best” separates a given data partition, D, of a class-labelled training tuples into individual classes. • Attribute selection measures are also known as splitting rules. 24
  • 25.
    Information Gain • Informationgain can be used for continues-valued (numeric) attributes. • The attribute which has the highest information gain is selected for split. • Assume, that there are two classes P(positive) & N(negative). • Suppose we have S samples, out of these p samples belongs to class P and n samples belongs to class N. • The amount of information, needed to decide split in S belongs to P or N & that is defined as 𝐼 𝑝, 𝑛 = − 𝑝 𝑝+𝑛 𝑙𝑜𝑔2 𝑝 𝑝+𝑛 - 𝑛 𝑝+𝑛 𝑙𝑜𝑔2 𝑛 𝑝+𝑛 25
  • 26.
    26 Information Gain (ID3/C4.5) Expected information (entropy) needed to classify a tuple in D:  Information needed (after using A to split D into v partitions) to classify D:  Information gained by branching on attribute A ) ( log ) ( 2 1 i m i i p p D Info     ) ( | | | | ) ( 1 j v j j A D Info D D D Info     (D) Info Info(D) Gain(A) A  
  • 27.
    Attribute Selection: InformationGain Class P: buys_computer = “yes” Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence, Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 694 . 0 ) 2 , 3 ( 14 5 ) 0 , 4 ( 14 4 ) 3 , 2 ( 14 5 ) (     I I I D Infoage 048 . 0 ) _ ( 151 . 0 ) ( 029 . 0 ) (    rating credit Gain student Gain income Gain 246 . 0 ) ( ) ( ) (    D Info D Info age Gain age age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no ) 3 , 2 ( 14 5 I 940 . 0 ) 14 5 ( log 14 5 ) 14 9 ( log 14 9 ) 5 , 9 ( ) ( 2 2      I D Info
  • 28.
    Gain Ratio (C4.5) •Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) • GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. • gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute 28 ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A     
  • 29.
    Gini Index • Assumethere exist several possible split values for each attribute. • We may need other tools, such as clustering, to get the possible split values. • It can be modified for categorical attributes. • An alternative method to information gain is called the Gini Index. • Gini is used in CART (Classification and RegressionTrees). • If a dataset T Contains examples from n classes, gini index, gini(T) is defined as Gini (T) = 1 - 𝑗=1 𝑛 𝑝𝑗 2 • n: the number of classes • pj: the probability that a tuple in D belongs to class Ci 29
  • 30.
    Gini Index (Cont..) •After splittingT into two subsetsT1 andT2 with sizes N1 and N2, the gini index of the split data is defined as Ginisplit (T) = 𝑁1 𝑁 gini (T1) + 𝑁2 𝑁 gini (T2) 30
  • 31.
    Example – ID3 AgeIncome Student Credit_Rati ng Class : buys_computer <=30 High No Fair No <=30 High No Excellent No 31..40 High No Fair Yes >40 Medium No Fair Yes >40 Low Yes Fair Yes >40 Low Yes Excellent No 31..40 Low Yes Excellent Yes <=30 Medium No Fair No <=30 Low Yes Fair Yes >40 Medium Yes Fair Yes <=30 Medium Yes Excellent Yes 31..40 Medium No Excellent Yes 31..40 High Yes Fair Yes >40 Medium No Excellent No 31
  • 32.
    Solution – ID3 •Class P : buys_computer = “Yes” (9 records) • Class N : buys_computer = “No” (5 records) • Total number of Records 14. • Now, Information Gain = I (p,n) I 9,5 = − 9 14 𝑙𝑜𝑔2 9 14 - 5 14 𝑙𝑜𝑔2 5 14 I(9,5) = 0.940 32
  • 33.
    Solution – ID3(Age <=30,31..40,>40) Age Income Student Credit_Rating Buys_Computer <=30 High No Fair No <=30 High No Excellent No <=30 Medium No Fair No <=30 Low Yes Fair Yes <=30 Medium Yes Excellent Yes Age Incom e Stu. Cr_Rati ng Buy s 31..40 High No Fair Yes 31..40 Low Yes Excelle nt Yes 31..40 Mediu m No Excelle nt Yes 31..40 High Yes Fair Yes Age Income Stu . Cr_Ratin g Buys >40 Medium No Fair Yes >40 Low Yes Fair Yes >40 Low No Excellent No >40 Medium Yes Fair Yes >40 Medium No Excellent No 33
  • 34.
    Solution – ID3 So the expected information needed to classify a given sample if the samples are partitioned according to age is,  Calculate entropy using the values from the Table and the formula given below: 𝐼𝑛𝑓𝑜𝐴 𝐷 = 5 14 I (2,3) + 4 14 I (4,0) + 5 14 I (3,2) 34 ) ( | | | | ) ( 1 j v j j A D Info D D D Info    
  • 35.
    Solution – ID3(Age <=30) • Compute the information gain & Entropy For Age <=30, • Pi =Yes class = 2 • Ni = No class = 3 So, Information Gain = I (p,n) I 𝑝, 𝑛 = − 𝑝 𝑝+𝑛 𝑙𝑜𝑔2 𝑝 𝑝+𝑛 - 𝑛 𝑝+𝑛 𝑙𝑜𝑔2 𝑛 𝑝+𝑛 I 2,3 = − 2 5 𝑙𝑜𝑔2 2 5 - 3 5 𝑙𝑜𝑔2 3 5 I(2,3) = 0.971 35
  • 36.
    Solution – ID3 AgePi Ni I (Pi , Ni) <=30 2 3 0.971 31..40 4 0 0 >40 3 2 0.971  So the expected information needed to classify a given sample if the samples are partitioned according to age is,  Calculate entropy using the values from the Table and the formula given below: E (A) = i=1 𝑣 Pi +Ni 𝑝+𝑛 I (Pi , Ni) E Age = 5 14 I (2,3) + 4 14 I (4,0) + 5 14 I (3,2) E(Age) = 0.694 36
  • 37.
    Solution – ID3 Gain(Age) = I (p, n) – E (Age) = 0.940 – 0.694 = 0.246 Similarly, Gain value Gain (age) 0.246 Gain (income) 0.029 Gain (student) 0.151 Gain (credit_rating) 0.048 So, here we start decision tree with root node Age. 37
  • 38.
    Solution – ID3 •Now the age has highest information gain among all the attributes, so select age as test attribute and create the node as age and show all possible values of age for further splitting. Age <=30 31..40 >40 38
  • 39.
    Solution – ID3(Age <= 30) Age Income Student Credit_Rating Buys_Computer <=30 High No Fair No <=30 High No Excellent No <=30 Medium No Fair No <=30 Low Yes Fair Yes <=30 Medium Yes Excellent Yes 39
  • 40.
    Solution – ID3(Age <=30) • Compute Information gain & Entropy for Age with sample S<=30. • For age <=30, • Pi =Yes = 2 • Ni = No = 3 I 𝑝, 𝑛 = − 𝑝 𝑝+𝑛 𝑙𝑜𝑔2 𝑝 𝑝+𝑛 - 𝑛 𝑝+𝑛 𝑙𝑜𝑔2 𝑛 𝑝+𝑛 I 2,3 = − 2 5 𝑙𝑜𝑔2 2 5 - 3 5 𝑙𝑜𝑔2 3 5 I(3,2) = 0.971 40
  • 41.
    Solution – ID3(Age <= 30, Income) Income Pi Ni I (Pi , Ni) High 0 2 0 Medium 1 1 1 Low 1 0 0 E (A) = i=1 𝑣 Pi +Ni 𝑝+𝑛 I (Pi , Ni) E Income = 2 5 I (0,2) + 2 5 I (1,1) + 1 5 I (1,0) E(Income) = 0.4 In above table high (0,2) homogeneous so I(0,2) = 0, Medium equal portion so I(1,1) = 1 & Low I(1,0) = 0. Gain (S<=30,Income) = I (p, n) – E (Income) = 0.971 – 0.4 = 0.571 41
  • 42.
    Solution – ID3(Age <= 30, Student) student Pi Ni I (Pi , Ni) No 0 3 0 Yes 2 0 0 In above table I (0,3) = 0 & I (2,0) = 0 So E(Student) is 0. E(Student) = 0 Gain (S<=30,Student) = I (p,n) – E(Student) = 0.971 – 0 = 0.971 42
  • 43.
    Solution – ID3(Age <= 30, credit_rating) credit_rating Pi Ni I (Pi , Ni) Fair 1 2 0.918 Excellent 1 1 1 E (A) = i=1 𝑣 Pi +Ni 𝑝+𝑛 I (Pi , Ni) E credit_rating = 3 5 I (1,2) + 2 5 I (1,1) E(credit_rating) = 0.951 Gain (S<=30,credit_rating) = I (p, n) – E (credit_rating) = 0.971 – 0.951 = 0.020 43
  • 44.
    Solution – ID3(Age <=30) Gain (Age <= 30) value Income 0.571 Student 0.971 Credit_rating 0.020 As shown in table we get maximum gain for student so, select student as leaf node for age <=30 44 <=30 31..40 >40 Student No Yes Age
  • 45.
    Solution – ID3(Age 31..40) Age Income Student Credit_Rating Buys_Computer 31..40 High No Fair Yes 31..40 Low Yes Excellent Yes 31..40 Medium No Excellent Yes 31..40 High Yes Fair Yes Age <=30 31..40 >40 Student No Yes Yes 45
  • 46.
    Solution – ID3(Age > 40) Age Income Student Credit_Rating Buys_Computer >40 Medium No Fair Yes >40 Low Yes Fair Yes >40 Low No Excellent No >40 Medium Yes Fair Yes >40 Medium No Excellent No 46
  • 47.
    Solution – ID3(Age > 40) • Compute Information gain for Age with sample S>40. • For age > 40, • Pi =Yes = 3 • Ni = No = 2 I 𝑝, 𝑛 = − 𝑝 𝑝+𝑛 𝑙𝑜𝑔2 𝑝 𝑝+𝑛 - 𝑛 𝑝+𝑛 𝑙𝑜𝑔2 𝑛 𝑝+𝑛 I 3,2 = − 3 5 𝑙𝑜𝑔2 3 5 - 2 5 𝑙𝑜𝑔2 2 5 I(3,2) = 0.971 47
  • 48.
    Solution – ID3(Age > 40, Income) Income Pi Ni I (Pi , Ni) High 0 0 0 Medium 2 1 0.918 Low 1 1 1 Gain (S>40,Income) = I (p, n) – E (Income) = 0.971 – 0.951 = 0.020 48 E (A) = i=1 𝑣 Pi +Ni 𝑝+𝑛 I (Pi , Ni) E Income = 0 5 I (0,0) + 3 5 I (2,1) + 2 5 I (1,1) E(Income) = 0.951
  • 49.
    Solution – ID3(Age > 40, credit_rating) Credit_rating Pi Ni I (Pi , Ni) Fair 3 0 0 Excellent 0 2 0 E (A) = i=1 𝑣 Pi +Ni 𝑝+𝑛 I (Pi , Ni) E credit_rating = 3 5 I (3,0) + 2 5 I (0,2) E(credit_rating) = 0 Gain (S>40,credit_rating) = I (p, n) – E (credit_rating) = 0.971 – 0 = 0.971 49
  • 50.
    Solution – ID3(Age > 40) Gain (Age > 40) value Income 0.020 Credit_rating 0.971 As shown in table we get maximum gain for credit_rating so, select credit_rating as leaf node for age > 40 Age <=30 31..40 >40 Student No yes Yes Credit_rating 50
  • 51.
    Classification rules fromdecision tree • IF age = “<=30”AND student = “no”THEN buys_computer = “no” • IF age = “<=30”AND student = “yes”THEN buys_computer = “yes” • IF age = “31..40”THEN buys_computer = “yes” • IF age = “>40”AND credit_rating = “excellent”THEN buys_computer = “no” • IF age = “>40”AND credit_rating = “fair”THEN buys_computer = “yes” 51
  • 52.
    Bayesian Classification •Thomas Bayes,who proposed the Bayes Theorem so, it named Bayesian theorem. •It is statistical method & supervised learning method for classification. •It can solve problems involving both categorical and continuous valued attributes. •Bayesian classification is used to find conditional probabilities. 52
  • 53.
    The BayesTheorem •The BayesTheorem: •P(H|X)= 𝐏(𝐗|𝐇) 𝐏(𝐇) 𝐏(𝐗) •P(H|X) : Probability that the customer will buy a computer given that we know his age, credit rating and income. (Posterior Probability of H) •P(H) : Probability that the customer will buy a computer regardless of age, credit rating, income (Prior Probability of H) •P(X|H) : Probability that the customer is 35 years old, have fair credit rating and earns $40,000, given that he has bought computer (Posterior Probability of X) •P(X) : Probability that a person from our set of customers is 35 years old, have fair credit rating and earns $40,000. (Prior Probability of X) 53
  • 54.
    Naïve Bayes classifier- Example Age Income Student Credit_Rating Class : buys_computer <=30 High No Fair No <=30 High No Excellent No 31..40 High No Fair Yes >40 Medium No Fair Yes >40 Low Yes Fair Yes >40 Low Yes Excellent No 31..40 Low Yes Excellent Yes <=30 Medium No Fair No <=30 Low Yes Fair Yes >40 Medium Yes Fair Yes <=30 Medium Yes Excellent Yes 31..40 Medium No Excellent Yes 31..40 High Yes Fair Yes >40 Medium No Excellent No 54
  • 55.
    Naïve Bayes classifier- Solution Age P (<=30 | Yes) = 2/9 P (<=30 | No) = 3/5 P (Yes) = 9/14 P (No) = 5/14 P (31..40 | Yes) = 4/9 P (31..40 | No) = 0/5 P (> 40 | Yes) = 3/9 P (> 40 | No) = 2/5 Income P (High | Yes) = 2/9 P (High | No) = 2/5 P (Medium | Yes) = 4/9 P (Medium | No) = 2/5 P (Low | Yes) = 3/9 P (Low | No) = 1/5 Student P (No | Yes) = 3/9 P (No | No) = 4/5 P (Yes | Yes) = 6/9 P (Yes | No) = 1/5 Credit_rating P (Fair | Yes) = 6/9 P (Fair | No) = 2/5 P (Excellent | Yes) = 3/9 P (Excellent | No) = 3/5 55
  • 56.
    Naïve Bayes classifier- Solution • An unseen sampleY = (<=30, Low,Yes, Excellent) • P(Y|Yes).P(Yes) = P(<=30|Yes). P(Low|Yes). P(Yes|Yes). P(Excellent|Yes) . P(Yes) = 2/9 * 3/9 * 6/9 * 3/9 * 9/14 = 0.010582  P(Y|No).P(No) = P(<=30|No). P(Low|No). P(Yes|No). P(Excellent|No) . P(No) = 3/5 * 1/5 * 1/5 * 3/5 * 5/14 = 0.005142  Choose the class so that it maximizes this probability, this means that new instance will be classified as Yes (Buys_computer) 56
  • 57.
    Try yourself (BayesianClassification) Car No Color Type Origin Stolen? 1 Red Sports Domestic Yes 2 Red Sports Domestic No 3 Red Sports Domestic Yes 4 Yellow Sports Domestic No 5 Yellow Sports Imported Yes 6 Yellow SUV Imported No 7 Yellow SUV Imported Yes 8 Yellow SUV Domestic No 9 Red SUV Imported No 10 Red Sports Imported Yes Unseen Data Y = <Red, Domestic, SUV> Actual Data Y = <Red, Sports, Domestic> 57 0.024,0.072 (Unseen) 0.192, 0.048 (Actual)
  • 58.
    Rule Based Classification •It is featured by building rules based on an object attributes. • Rule-based classifier makes use of a set of IF-THEN rules for classification. • We can express a rule in the following from • IF conditionTHEN conclusion • Let us consider a rule R1, R1: IF age=youth AND student=yesTHEN buy_computer=yes • The IF part of the rule is called rule antecedent or precondition. • TheTHEN part of the rule is called rule consequent (conclusion). • The antecedent (IF) part the condition consist of one or more attribute tests and these tests are logicallyANDed. • The consequent (THEN) part consists of class prediction. 58
  • 59.
    Rule Based Classification(Cont..) • We can also write rule R1 as follows: R1: ((age = youth) ^ (student = yes)) => (buys_computer = yes) • If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied and that the rule covers the tuple. • A rule R can be assessed by its coverage and accuracy. • Given a tuple X, from a class labeled data set D, let it covers the number of tuples by R; the number of tuples correctly classified by R; and |D| be the number of tuples in D. • We can define the coverage and accuracy of R as Coverage (R) = 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 |𝐷| Accuracy (R) = 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑛𝑐𝑜𝑣𝑒𝑟𝑠 59
  • 60.
    Characteristics of Rule-Based Classifier •Mutuallyexclusive rules • Classifier contains mutually exclusive rules if no two rules are triggered by the same record. • Every record is covered by at most one rule • If more than two rules are triggered for any record then conflict resolution strategies are applied •Exhaustive rules • Classifier has exhaustive coverage if it accounts for every possible combination of attribute values • Each record is covered by at least one rule 60
  • 61.
    Conflict Resolution Strategies •Size ordering • Assigns the highest priority to the rule having “toughest” requirement like ule antecedent size • Rule ordering • Prioritize the rules beforehand using rule-based (based on rule quality like accuracy, size, coverage etc.) or class-based ordering (decreasing order sorting of class prevalence) 61 Rule-based Ordering • (Refund=Yes) ==> No • (Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No • (Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes • (Refund=No, Marital Status={Married}) ==> No Class-based Ordering • (Refund=Yes) ==> No • (Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No • (Refund=No, Marital Status={Married}) ==> No • (Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
  • 62.
    Two Methods togenerate rules • Direct method • Extract rules directly from data • e.g.: RIPPER, CN2, Holte’s 1R • Indirect method • Extract rules from other classification models (e.g. decision trees, neural networks, etc). • e.g: C4.5rules 62
  • 63.
  • 64.
    Model Evaluation andSelection • Evaluation metrics: How can we measure accuracy? Other metrics to consider? • Use validation test set of class-labeled tuples instead of training set when assessing accuracy • Methods for estimating a classifier’s accuracy: • Holdout method, random subsampling • Cross-validation • Bootstrap • Comparing classifiers: • Confidence intervals • Cost-benefit analysis and ROC Curves 64
  • 65.
  • 66.
    Model Selection: ROC Curves •ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models • Originated from signal detection theory • Shows the trade-off between the true positive rate and the false positive rate • The area under the ROC curve is a measure of the accuracy of the model • Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list • The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model  Vertical axis represents the true positive rate  Horizontal axis rep. the false positive rate  The plot also shows a diagonal line 66 66
  • 67.
    Issues Affecting ModelSelection • Accuracy • classifier accuracy: predicting class label • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 67 67
  • 68.
    Classifier Evaluation Metrics:Confusion Matrix Actual class Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 • Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j • May have extra rows/columns to provide totals Confusion Matrix: Actual classPredicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: 68
  • 69.
    Classifier Evaluation Metrics:Accuracy, Error Rate, Sensitivity and Specificity • Classifier Accuracy, or Recognition Rate: percentage of test set tuples that are correctly classified Accuracy = (TP +TN)/(P+N) • Error rate: 1 – accuracy, or Error rate = (FP + FN)/(P+N)  Class Imbalance Problem:  One class may be rare, e.g. fraud, or HIV- positive  Significant majority of the negative class and minority of the positive class  Sensitivity: True Positive recognition rate  Sensitivity = TP/P  Specificity: True Negative recognition rate  Specificity = TN/N AP C ¬C C TP FN P ¬C FP TN N P’ N’ All 69 69
  • 70.
    Classifier Evaluation Metrics: Precisionand Recall, and F-measures • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples did the classifier label as positive? • Perfect score is 1.0 • Inverse relationship between precision & recall • F measure (F1 or F-score): harmonic mean of precision and recall, • Fß: weighted measure of precision and recall • assigns ß times as much weight to recall as to precision 70
  • 71.
    Classifier Evaluation Metrics:Example • Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% Actual Class Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity) cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) 71
  • 72.
    Evaluating Classifier Accuracy: Holdout& Cross-Validation Methods • Holdout method • Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation •Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into k mutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out: k folds where k = # of tuples, for small sized data •*Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 72
  • 73.
    Ensemble Methods: Increasingthe Accuracy • Ensemble methods • Use a combination of models to increase accuracy • Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods • Bagging: averaging the prediction over a collection of classifiers • Boosting: weighted vote with a collection of classifiers • Ensemble: combining a set of heterogeneous classifiers 73
  • 74.
    Bagging: Boostrap Aggregation •Analogy: Diagnosis based on multiple doctors’ majority vote • Training • Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) • A classifier model Mi is learned for each training set Di • Classification: classify an unknown sample X • Each classifier Mi returns its class prediction • The bagged classifier M* counts the votes and assigns the class with the most votes to X • Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple • Accuracy • Often significantly better than a single classifier derived from D • For noise data: not considerably worse, more robust • Proved improved accuracy in prediction 74
  • 75.
    Boosting • Analogy: Consultseveral doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy • How boosting works? • Weights are assigned to each training tuple • A series of k classifiers is iteratively learned • After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi • The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy • Boosting algorithm can be extended for numeric prediction • Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data 75 75
  • 76.
    Regression • Regression isa data mining function that predicts a number or value. • Age, weight, distance, temperature, income, or sales attributes can be predicted using regression techniques. • For example, a regression model could be used to predict children's height, given their age, weight and other factors. • A regression task begins with a data set in which the target values are known. • For example, a regression model that predicts children's height could be developed based on observed data for many children over a period of time. • The data might track age, height, weight, developmental milestones, family history etc. 76
  • 77.
    Regression (Cont..) • Heightwould be the target, the other attributes would be the predictors, and the data for each child would constitute a case. • Regression models are tested by computing various statistics that measure the difference between the predicted values and the expected values. • It is required to understand the mathematics used in regression analysis to develop quality regression models for data mining. • The goal of regression analysis is to determine the values of parameters for a function that cause the function to fit best in a set of data observations that we provide. 77
  • 78.
    Linear Regression • Thesimplest form of regression to visualize is linear regression with a single predictor. • A linear regression technique can be used if the relationship between x and y can be approximated with a straight line. • Linear regression with a single predictor can be expressed with the following equation with one dependent and one independent variable is defined by the given formula • Where y = Dependent variable which we are trying to predict, b1 = The Slope, and X = independent variable, b0 = The Intercept, u = Random Error/Residual Y = b1X + b0 + u 78
  • 79.
  • 80.
    Example Sr. No. XY 1 0 2 2 1 3 3 2 5 4 3 4 5 4 6 80 Linear Regression Equation is, Y= aX+b Find value of a and b using Least square method 𝑎 = 𝑛 𝑥𝑦 − 𝑥 𝑦 𝑛 𝑥2 − ( 𝑥)2 𝑎𝑛𝑑 𝑏 = 1 𝑛 𝑦 − 𝑎 𝑥
  • 81.
    Continued… ∑ 10 2049 30 81 Sr. No. X Y XY X2 1 0 2 0 0 2 1 3 3 1 3 2 5 10 4 4 3 4 12 9 5 4 6 24 16 a= 0.9 b= 2.2
  • 82.
    Multiple Linear Regression •Multiple Linear regression is an extension of simple linear regression analysis. • It uses two or more independent variables to predict the outcome and a single continuous dependent variable, Like.. • WhereY = dependent variable or response variable X1,X2,….Xn are the independent variables or predictors u = Random Error b0,b1,b2,….bn are the regression coefficients Y = b0 +b1X1+b2X2+ b3X3 +….+ bnXn + u 82
  • 83.
    83 Predict the finalexam grade of a student who received an 86 and 89 on the midterm exam
  • 84.
    Nonlinear Regression • Oftenthe relationship between x and y cannot be approximated with a straight line or curve for that nonlinear regression technique may be used. • Alternatively, the data could be preprocessed to make the relationship linear. 84
  • 85.
    Logistic Regression • Alinear regression is not appropriate for predicting the value of a binary variable for two reasons: • A linear regression will predict values outside the acceptable range (e.g. predicting probabilities outside the range 0 to 1). • Since the experiments can only have one of two possible values for each experiment, the residuals(random errors) will not be normally distributed about the predicted line. • A logistic regression produces a logistic curve, which is limited to values between 0 and 1. • Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm “odds” of the target variable, rather than the probability. 85
  • 86.
  • 87.