Attribute selection measure

Attribute Selection measure
The information gain measure is used to select the test attribute at each node in the tree.
The attribute with highest information gain is chosen as the test attribute for the node.
Algorithm
Let S be a set consist of s data samples.
The class label attribute has m distinct values defining m distinct classes. Ci(for i=1,2…m)
Si be the number of samples of S in class Ci
1) The expected information needed to classify a given sample is given by
I(s1,s2….sm)=-∑Pilog2(Pi) where i=1….m
Pi- probability that arbitrary sample belongs to class Ci
2) The entropy or expected information based on the partitioning into subsets by A, is given by
E(A)=∑(Sij+….+smj/S).I(Sij…..Smj) where j=1….r
3) Calculate gain value
Gain(A)= I(S1,S2….Sm)-E(A)
the algorithm computes the information gain of each attribute. The attribute with the highest
information gain is chosen as the test attribute for the given set S.
A node is created and labeled with the attribute, branches are created for each value of the attribute,
and the samples are partitioned accordingly.
Induction of a decision tree
Training data tuples from customer database.
Si.No Age Income Marital status Employed Class:
diagnosed
1 31…40 High Unmarried No No
2 31….40 High Unmarried Yes No
3 41…50 High Unmarried No Yes
4 51….60 Medium Unmarried No Yes
5 51…60 Low Married No Yes
6 51….60 Low Married Yes No
7 41….50 Low Married Yes Yes
8 31….40 Medium Unmarried No No
9 31….40 Low Married No Yes
10 51….60 Medium married No Yes
11 31….40 Medium married Yes Yes
12 41….50 Medium Unmarried Yes Yes
13 41…..50 High Married No Yes
14 51….60 medium unmarried yes no
Training data tuples from palayam, kanyakumari district database (30 samples)

This data set collected from palayam, kanyakumari district during medical camp on 07/04/2019.
There are 30 samples collected for research purpose. Totally 64 attributes collected. From these
select 5 attributes for testing.
Attributes collected: 64
Attributes selected: 05
The class label attribute, diagnosis has 2 distinct values ({yes,no})
Therefore m=2
C1 corresponds to yes
C2 corresponds to no
There are 9 samples of class yes and 5 samples of class no.
1) Information gain of each attribute,
I(s1,s2)=I(9,5)=-[9/14 log29/14+5/14 log25/4]
=0.940
Next compute the entropy of each attribute
Let’s start with attribute ‘age’
For age=’31…40’
S11=2, s21=3
I(s11,S21)=0.971
For age= ’41…50’
S12=4, s22=0
I(s21,s22)=0
For age =’51…60’
S13=3,s23=2
I(S13,S23)=0.9
Then entropy is calculated.
2)E(age)=(((2+3)/14).I(S11,S21)+((4+0)/14).I(S12,S22)+((3+2)/14).I(S13,S23))
E(age)=0.694
3) Gain(age)=I(s1,s2)-E(age)
=0.940-0.694
=0.246
4) Decision tree
Since age has the highest information gain. Therefore it becomes a test attribute at the root node of
the decision tree.
5) Rule based classifier
Generating classification rules from a decision tree
The rules extracted from the above decision tree
If age=’31…40’ AND marital status=”unmarried” then diagnosis=’no’
If age=’31…40’ AND marital status=’married’ THEN diagnosis=’yes’
If age =’41..50’ THEN diagnosis=’yes’
If age=’51…60’ AND employed=’yes’ THEN diagnosis=’no’
If age=’51…60’ AND employed=’no’ THEN diagnosis=’yes’
J48 algorithm uses the training samples to estimate the accuracy of each rule. A rule can be pruned
by removing any condition in its antecedent that does not improve the estimated accuracy of the
rule.

Attribute selection measure

Recommended

Recommended

More Related Content

Similar to Attribute selection measure

Similar to Attribute selection measure (14)

Recently uploaded

Recently uploaded (20)

Attribute selection measure