Decision tree.10.11

298
-1

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
298
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Decision tree.10.11

  1. 1. Constructing Decision Trees
  2. 2. A Decision Tree Example The weather data example. ID code Outlook Temperature Humidity Windy Play a b c d e f g h i j k l m n Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
  3. 3. ~continues Outlook humidity windyyes no yesyes no sunny overcast rainy high normal false true Decision tree for the weather data.
  4. 4. The Process of Constructing a Decision Tree • Select an attribute to place at the root of the decision tree and make one branch for every possible value. • Repeat the process recursively for each branch.
  5. 5. Which Attribute Should Be Placed at a Certain Node • One common approach is based on the information gained by placing a certain attribute at this node.
  6. 6. Information Gained by Knowing the Result of a Decision • In the weather data example, there are 9 instances of which the decision to play is “yes” and there are 5 instances of which the decision to play is “no’. Then, the information gained by knowing the result of the decision is bits.940.0 14 5 log 14 5 14 9 log 14 9 =      −×      +      −×
  7. 7. The General Form for Calculating the Information Gain • Entropy of a decision = P1, P2, …, Pn are the probabilities of the n possible outcomes. nn PPPPPP logloglog 2211 ×−−×−×− 
  8. 8. Information Further Required If “Outlook” Is Placed at the Root Outlook yes yes no no no yes yes yes yes yes yes yes no no sunny overcast rainy .693.0971.0 14 5 0 14 4 971.0 14 5 requiredfurthernInformatio bits=×      +×      +×      =
  9. 9. Information Gained by Placing Each of the 4 Attributes • Gain(outlook) = 0.940 bits – 0.693 bits = 0.247 bits. • Gain(temperature) = 0.029 bits. • Gain(humidity) = 0.152 bits. • Gain(windy) = 0.048 bits.
  10. 10. The Strategy for Selecting an Attribute to Place at a Node • Select the attribute that gives us the largest information gain. • In this example, it is the attribute “Outlook”. Outlook 2 “yes” 3 “no” 4 “yes” 3 “yes” 2 “no” sunny overcast rainy
  11. 11. The Recursive Procedure for Constructing a Decision Tree • The operation discussed above is applied to each branch recursively to construct the decision tree. • For example, for the branch “Outlook = Sunny”, we evaluate the information gained by applying each of the remaining 3 attributes. • Gain(Outlook=sunny;Temperature) = 0.971 – 0.4 = 0.571 • Gain(Outlook=sunny;Humidity) = 0.971 – 0 = 0.971 • Gain(Outlook=sunny;Windy) = 0.971 – 0.951 = 0.02
  12. 12. • Similarly, we also evaluate the information gained by applying each of the remaining 3 attributes for the branch “Outlook = rainy”. • Gain(Outlook=rainy;Temperature) = 0.971 – 0.951 = 0.02 • Gain(Outlook=rainy;Humidity) = 0.971 – 0.951 = 0.02 • Gain(Outlook=rainy;Windy) =0.971 – 0 = 0.971
  13. 13. The Over-fitting Issue • Over-fitting is caused by creating decision rules that work accurately on the training set based on insufficient quantity of samples. • As a result, these decision rules may not work well in more general cases.
  14. 14. Example of the Over-fitting Problem in Decision Tree Construction bits848.0 17 9 log 17 9 17 8 log 17 8 20 17 childrenat theentropyAverage bits993.0 20 9 log 20 9 20 11 log 20 11 subrootat theEntropy 22 22 =       +−= =       +−= 11 “Yes” and 9 “No” samples; prediction = “Yes” 8 “Yes” and 9 “No” samples; prediction = “No” 3 “Yes” and 0 “No” samples; prediction = “Yes” Ai=0 Ai=1
  15. 15. • Hence, with the binary split, we gain more information. • However, if we look at the pessimistic error rate, i.e. the upper bound of the confidence interval of the error rate, we may get different conclusion. • The formula for the pessimistic error rate is • Note that the pessimistic error rate is a function of ( ) user.by thespecifiedlevelconfidencetheisand, samples,ofnumbertheisrate,errorobservedtheiswhere 1 42 1 2 2 222 ccz nr n z n z n r n r z n z r e − Φ= + +−++ =
  16. 16. • The pessimistic error rates under 95% confidence are ( ) 6598.0 17 645.1 1 1156 706.2 17 17 8 17 17 8 645.1 34 645.1 17 8 4742.0 3 645.1 1 36 645.1 645.1 6 645.1 6278.0 20 645.1 1 1600 706.2 20 45.0 20 45.0 645.1 40 645.1 45.0 2 2 2 17 8 2 22 3 0 2 22 20 9 = + +−++ = = + + = = + +−++ = e e e
  17. 17. • Therefore, the average pessimistic error rate at the children is • Since the pessimistic error rate increases with the split, we do not want to keep the children. This practice is called “tree pruning”. 6278.0632.06598.0 20 17 4742.0 20 3 >=×+×
  18. 18. Tree Pruning based on χ2 Test of Independence • We construct the corresponding contingency table Ai=0 Ai= 1 Yes 3 8 11 No 0 9 9 3 17 20 11 “Yes” and 9 “No” samples; 8 “Yes” and 9 “No” samples; 3 “Yes” and 0 “No samples; Ai=0 Ai=1 15.1 20 719 20 719 -9 20 93 20 93 -0 20 1117 20 1117 -8 20 311 20 311 -3 statisticThe 2222 2 = ×       × + ×       × + ×       × + ×       × = χ
  19. 19. • Therefore, we should not split the subroot node, if we require that the χ2 statistic must be larger than χ2 k,0.05 , where k is the degree of freedom of the corresponding contingency table.
  20. 20. Constructing Decision Trees based on χ2 test of Independence • Using the following example, we can construct a contingency table accordingly. 75 “Yes”s out of 100 samples; Prediction = “Yes” 45 “Yes”s out of 50 samples; 20 “Yes”s out of 25 samples; 10 “Yes”s out of 25 samples; 100 100 50 100 25 100 25 100 25 5155 100 75 451020 210 No Yes Ai Ai=0 Ai=1 Ai=2
  21. 21. • Therefore, we may say that the split is statistically robust. 991.567.22 100 4 1 2 1 100 4 1 2 1 5 100 4 1 4 1 100 4 1 4 1 15 100 4 1 4 1 100 4 1 4 1 5 100 4 3 2 1 100 4 3 2 1 45 100 4 3 4 1 100 4 3 4 1 10 100 4 3 4 1 100 4 3 4 1 20 2 05.0,2 222 222 2 =>= ××       ××− + ××       ××− + ××       ××− + ××       ××− + ××       ××− + ××       ××− = χ χ
  22. 22. Assume that we have another attribute Aj to consider Aj=0 Aj=1 Yes 25 50 75 No 0 25 25 25 75 100 75 “Yes” out of 100 samples; 50 “Yes” out of 75 samples; 25 “Yes” out of 25 samples; Aj=0 Aj=1 841.311.11 100 5752 100 5752 -25 100 5757 100 5757 -50 100 2552 100 2552 -0 100 5752 100 5752 -25 2 05.0,1 2222 2 =≥= ×       × + ×       × + ×       × + ×       × = χ χ
  23. 23. • Now, both Ai and Aj pass our criterion. How should we make our selection? • We can make our selection based on the significance levels of the two contingency tables. ( ) ( ) ( ) ( ) .1080008.033.3)1,0(Prob2 33.3)1,0(Prob33.3)1,0(Prob' 11.11)1,0(Prob)11.11(1'11.11 4 22 ',1 2 1 − ×==≥⋅= −≤+≥=⇒ ≥=−=⇒= N NN NF α αχ χα
  24. 24. • Therefore, Ai is preferred over Aj. .1019.111 )67.22(1"67.22 5 )67.22( 2 1 2 ",2 2 2 − − ×=         −−= −=⇒= e Fχα αχ
  25. 25. • If a subtree is as follows ∀ χ2 = 4.543 < 5.991 • In this case, we do not want to carry out the split. 15 “Yes”s out of 20 samples; 9 “Yes”s out of 10 samples; 4 “Yes”s out of 5 samples; 2 “Yes”s out of 5 samples; Termination of Split due to Low Significance level
  26. 26. A More Realistic Example and Some Remarks • In the following example, a bank wants to derive a credit evaluation tree for future use based on the records of existing customers. • As the data set shows, it is highly likely that the training data set contains inconsistencies. • Furthermore, some values may be missing. • Therefore, for most cases, it is impossible to derive perfect decision trees, i.e. decision trees with 100% accuracy.
  27. 27. ~continues Attributes Class Education Annual Income Age Own House Sex Credit ranking College High Old Yes Male Good High school ----- Middle Yes Male Good High school Middle Young No Female Good College High Old Yes Male Poor College High Old Yes Male Good College Middle Young No Female Good High school High Old Yes Male Poor College Middle Middle ----- Female Good High school Middle Young No Male Poor
  28. 28. ~continues • A quality measure of decision trees can be based on the accuracy. There are alternative measures depending on the nature of applications. • Overfitting is a problem caused by making the derived decision tree work accurately for the training set. As a result, the decision tree may work less accurately in the real world.
  29. 29. ~continues • There are two situations in which overfitting may occur: • insufficient number of samples at the subroot. • some attributes are highly branched. • A conventional practice for handling missing values is to treat them as possible attribute values. That is, each attribute has one additional attribute value corresponding to the missing value.
  30. 30. Alternative Measures of Quality of Decision Trees • The recall rate and precision are two widely used measures. • where C is the set of samples in the class and C’ is the set of samples which the decision tree puts into the class. ' Precision RateRecall C CC C CC ' ' ∩ = ∩ =
  31. 31. ~continues • A situation in which the recall rate is the main concern: • “A bank wants to find all the potential credit card customers”. • A situation in which precision is the main concern: • “A bank wants to find a decision tree for credit approval.”

×