Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chapter 5 decision tree induction using frequency tables for attribute selection

1,496 views

Published on

Decision Tree Induction Using Frequency Tables for Attribute Selection

  • Be the first to comment

Chapter 5 decision tree induction using frequency tables for attribute selection

  1. 1. Decision Tree Induction: Using Frequency Tables for Attribute Selection<br />NguyễnDươngTrungDũng<br />1<br />
  2. 2. Content<br />1. Calculating Entropy in Practice <br />2. Gini Index of Diversity<br />3. Inductive Bias<br />4. Using Gain Ratio for Attribute Selection <br />2<br />
  3. 3. Calculating Entropy in Practice <br /> Training Set 1 (age=1) for lens24<br />3<br />
  4. 4. Calculating Entropy in Practice <br /> Frequency Table for Attribute age for lens24<br />The cells of this table show the number of occurences of each combination of class and attribute value in the training set. <br />4<br />
  5. 5. Calculating Entropy in Practice <br />The value of Enew can be calculated by a sum as follows:<br />For every non-zero value V in the main body of the table, subtract V*𝑙𝑜𝑔2V<br />For every non-zero value S in the column sum row, add S*𝑙𝑜𝑔2<br />𝐸𝑛𝑒𝑤=(-2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−1𝑙𝑜𝑔21−2𝑙𝑜𝑔22−2𝑙𝑜𝑔22−1𝑙𝑜𝑔21−4𝑙𝑜𝑔2 4−5𝑙𝑜𝑔25−6𝑙𝑜𝑔26)+(8𝑙𝑜𝑔28+8𝑙𝑜𝑔28+8𝑙𝑜𝑔28)=1.2867<br />Agrees with the value calculated previously<br /> <br />5<br />
  6. 6. GiniIndex of Diversity<br />6<br />
  7. 7. Gini Index of Diversity<br /> Training Set 1 (age=1) for lens24<br />7<br />
  8. 8. Gini Index of Diversity<br />If there are K classes, with the probability of the ith class being 𝑝𝑖, the Gini Index is defined as: <br />𝐺𝑠𝑡𝑎𝑟𝑡=1- 𝑖=1𝐾𝑝𝑖2<br />Lens24 dataset: 24 instances, 3 classes (hard:4, soft:5, no: 15)<br />𝑝1=4/24; 𝑝2=5/24; 𝑝3=15/24<br />𝐺𝑠𝑡𝑎𝑟𝑡=0.5382<br /> <br />8<br />
  9. 9. Gini Index of Diversity<br />We can now calculate the new value of the Gini Index as follows<br />For each non-empty column, form the sum of the squares of the values in the body of the table and divide by the column sum. <br />Add the values obtained for all the columns and divide by N (the number of instances)<br />Subtract the total from 1<br />9<br />
  10. 10. Gini Index of Diversity<br /> Frequency Table for Attribute age for lens24<br />Age=1: (22+22+42)/8=3<br />Age=2: (12+22+52)/8=3.75<br />Age=3: (12+12+62)/8=4.75<br />𝐺𝑛𝑒𝑤=1-(3+3.75+4.75)/24=0.5208<br /> <br />10<br />
  11. 11. Gini Index of Diversity<br />specRx: 𝐺𝑛𝑒𝑤 = 0.5278, so G = 0.5382-0.5278=0.0104<br />astig: 𝐺𝑛𝑒𝑤 = 0.4653, so G = 0.5382-0.4653=0.0729<br />tears: 𝐺𝑛𝑒𝑤 = 0.3264, so G = 0.5382-0.3264=0.2118<br />𝐺𝑚𝑎𝑥= 𝐺𝑛𝑒𝑤(tears) This is the same attribute that was selected using entropy<br /> <br />11<br />
  12. 12. Inductive Bias<br />Fin the next term in the sequences<br />1, 4, 9, 16, ? <br />Most readers will probably have chosen the answer 25, but this is misguided. The correct answer is 20. <br />nth term = (−5𝑛4+50𝑛3−151𝑛2+250-120)/24<br /> <br />12<br />
  13. 13. Inductive Bias<br />Inductive bias:<br /><ul><li>A preference for one choice rather than another
  14. 14. Determined by external factors such as our preferences, simplicity, familiarity
  15. 15. Any formula we use for it introduces an inductive bias</li></ul>13<br />
  16. 16. Using Gain Ratio for Attribute Selection<br />Gain Ratio is used to reduce the effect of the bias resulting from the use of information gain. <br />Information Gain = 𝐸𝑠𝑡𝑎𝑟𝑡-𝐸𝑛𝑒𝑤<br />Gain Ratio = Information Gain/Split Information <br />Split Information is a value based on the column sums, each non-zero column sum s contributes – (s/N)𝑙𝑜𝑔2(s/N) to the Split Information<br /> <br />14<br />
  17. 17. Using Gain Ratio for Attribute Selection<br /> Frequency Table for Attribute age for lens24<br />Split Information = - (8/24)𝑙𝑜𝑔2824− (8/24)𝑙𝑜𝑔2824- (8/24)𝑙𝑜𝑔2824=1.5850<br />Gain Ratio = 0.0394/1.5850=0.0249<br />Gain Ratio for splitting on attributes specRx, astig, and tears are 0.0395, 0.3770 and 0. 5488. <br />The largest value is for attribute tears, so in this case Gain Ratio selects the same attribute as entropy. <br /> <br />15<br />
  18. 18. The end<br />16<br />

×