Decision Tree Induction: Using Frequency Tables for Attribute SelectionNguyễnDươngTrungDũng1
Content1. Calculating Entropy in Practice 2. Gini Index of Diversity3. Inductive Bias4. Using Gain Ratio for Attribute Selection 2
Calculating Entropy in Practice                               Training Set 1 (age=1) for lens243
Calculating Entropy in Practice 		Frequency Table for Attribute age for lens24The cells of this table show the number of occurences of each combination of class and attribute value in the training set. 4
Calculating Entropy in Practice The value of Enew can be calculated by a sum as follows:For every non-zero value V in the main body of the table, subtract V*𝑙𝑜𝑔2VFor every non-zero value S in the column sum row, add S*𝑙𝑜𝑔2𝐸𝑛𝑒𝑤=(-2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−1𝑙𝑜𝑔21−2𝑙𝑜𝑔22−2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−4𝑙𝑜𝑔2 4−5𝑙𝑜𝑔25−6𝑙𝑜𝑔26)+(8𝑙𝑜𝑔28+8𝑙𝑜𝑔28+8𝑙𝑜𝑔28)=1.2867Agrees with the value calculated previously 5
GiniIndex of Diversity6
Gini Index of Diversity                              Training Set 1 (age=1) for lens247
Gini Index of DiversityIf there are K classes, with the probability of the ith class being 𝑝𝑖, the Gini Index is defined as: 𝐺𝑠𝑡𝑎𝑟𝑡=1- 𝑖=1𝐾𝑝𝑖2Lens24 dataset: 24 instances, 3 classes (hard:4, soft:5, no: 15)𝑝1=4/24; 𝑝2=5/24; 𝑝3=15/24𝐺𝑠𝑡𝑎𝑟𝑡=0.5382 8
Gini Index of DiversityWe can now calculate the new value of the Gini Index as followsFor each non-empty column, form the sum of the squares of the values in the body of the table and divide by the column sum. Add the values obtained for all the columns and divide by N (the number of instances)Subtract the total from 19
Gini Index of Diversity		Frequency Table for Attribute age for lens24Age=1: (22+22+42)/8=3Age=2: (12+22+52)/8=3.75Age=3: (12+12+62)/8=4.75𝐺𝑛𝑒𝑤=1-(3+3.75+4.75)/24=0.5208 10
Gini Index of DiversityspecRx: 𝐺𝑛𝑒𝑤 = 0.5278, so G = 0.5382-0.5278=0.0104astig: 𝐺𝑛𝑒𝑤 = 0.4653, so G = 0.5382-0.4653=0.0729tears: 𝐺𝑛𝑒𝑤 = 0.3264, so G = 0.5382-0.3264=0.2118𝐺𝑚𝑎𝑥= 𝐺𝑛𝑒𝑤(tears) This is the same attribute that was selected using entropy 11
Inductive BiasFin the next term in the sequences1, 4, 9, 16, ? Most readers will probably have chosen the answer 25, but this is misguided. The correct answer is 20. nth term = (−5𝑛4+50𝑛3−151𝑛2+250-120)/24 12
Inductive BiasInductive bias:A preference for one choice rather than another
Determined by external factors such as our preferences, simplicity, familiarity
Any formula we use for it introduces an inductive bias13
Using Gain Ratio for Attribute SelectionGain Ratio is used to reduce the effect of the bias resulting from the use of information gain. Information Gain = 𝐸𝑠𝑡𝑎𝑟𝑡-𝐸𝑛𝑒𝑤Gain Ratio = Information Gain/Split Information Split Information is a value based on the column sums, each non-zero column sum s contributes – (s/N)𝑙𝑜𝑔2(s/N) to the Split Information 14

Chapter 5 decision tree induction using frequency tables for attribute selection

  • 1.
    Decision Tree Induction:Using Frequency Tables for Attribute SelectionNguyễnDươngTrungDũng1
  • 2.
    Content1. Calculating Entropyin Practice 2. Gini Index of Diversity3. Inductive Bias4. Using Gain Ratio for Attribute Selection 2
  • 3.
    Calculating Entropy inPractice Training Set 1 (age=1) for lens243
  • 4.
    Calculating Entropy inPractice Frequency Table for Attribute age for lens24The cells of this table show the number of occurences of each combination of class and attribute value in the training set. 4
  • 5.
    Calculating Entropy inPractice The value of Enew can be calculated by a sum as follows:For every non-zero value V in the main body of the table, subtract V*𝑙𝑜𝑔2VFor every non-zero value S in the column sum row, add S*𝑙𝑜𝑔2𝐸𝑛𝑒𝑤=(-2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−1𝑙𝑜𝑔21−2𝑙𝑜𝑔22−2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−4𝑙𝑜𝑔2 4−5𝑙𝑜𝑔25−6𝑙𝑜𝑔26)+(8𝑙𝑜𝑔28+8𝑙𝑜𝑔28+8𝑙𝑜𝑔28)=1.2867Agrees with the value calculated previously 5
  • 6.
  • 7.
    Gini Index ofDiversity Training Set 1 (age=1) for lens247
  • 8.
    Gini Index ofDiversityIf there are K classes, with the probability of the ith class being 𝑝𝑖, the Gini Index is defined as: 𝐺𝑠𝑡𝑎𝑟𝑡=1- 𝑖=1𝐾𝑝𝑖2Lens24 dataset: 24 instances, 3 classes (hard:4, soft:5, no: 15)𝑝1=4/24; 𝑝2=5/24; 𝑝3=15/24𝐺𝑠𝑡𝑎𝑟𝑡=0.5382 8
  • 9.
    Gini Index ofDiversityWe can now calculate the new value of the Gini Index as followsFor each non-empty column, form the sum of the squares of the values in the body of the table and divide by the column sum. Add the values obtained for all the columns and divide by N (the number of instances)Subtract the total from 19
  • 10.
    Gini Index ofDiversity Frequency Table for Attribute age for lens24Age=1: (22+22+42)/8=3Age=2: (12+22+52)/8=3.75Age=3: (12+12+62)/8=4.75𝐺𝑛𝑒𝑤=1-(3+3.75+4.75)/24=0.5208 10
  • 11.
    Gini Index ofDiversityspecRx: 𝐺𝑛𝑒𝑤 = 0.5278, so G = 0.5382-0.5278=0.0104astig: 𝐺𝑛𝑒𝑤 = 0.4653, so G = 0.5382-0.4653=0.0729tears: 𝐺𝑛𝑒𝑤 = 0.3264, so G = 0.5382-0.3264=0.2118𝐺𝑚𝑎𝑥= 𝐺𝑛𝑒𝑤(tears) This is the same attribute that was selected using entropy 11
  • 12.
    Inductive BiasFin thenext term in the sequences1, 4, 9, 16, ? Most readers will probably have chosen the answer 25, but this is misguided. The correct answer is 20. nth term = (−5𝑛4+50𝑛3−151𝑛2+250-120)/24 12
  • 13.
    Inductive BiasInductive bias:Apreference for one choice rather than another
  • 14.
    Determined by externalfactors such as our preferences, simplicity, familiarity
  • 15.
    Any formula weuse for it introduces an inductive bias13
  • 16.
    Using Gain Ratiofor Attribute SelectionGain Ratio is used to reduce the effect of the bias resulting from the use of information gain. Information Gain = 𝐸𝑠𝑡𝑎𝑟𝑡-𝐸𝑛𝑒𝑤Gain Ratio = Information Gain/Split Information Split Information is a value based on the column sums, each non-zero column sum s contributes – (s/N)𝑙𝑜𝑔2(s/N) to the Split Information 14