ID3 Algorithm & ROC Analysis

  • 2,082 views
Uploaded on

A description to decision trees, ID3 Algorithm and ROC Analysis

A description to decision trees, ID3 Algorithm and ROC Analysis

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,082
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ID3 Algorithm & ROC Analysis Talha KABAKUŞtalha.kabakus@ibu.edu.tr
  • 2. Agenda● Where are we now?● Decision Trees● What is ID3?● Entropy● Information Gain● Pros and Cons of ID3● An Example - The Simpsons● What is ROC Analysis?● ROC Space● ROC Space Example over predictions
  • 3. Where are we now?
  • 4. Decision Trees● One of the most used classification approach because of its clear model and presentation● Classification by using data attributes● Aim is to reaching estimating destination field value using source fields● Tree Induction ○ Create tree ○ Apply data into tree to classify● Each branch node represents a choice between a number of alternatives● Each leaf node represents a classification or decision● Leaf Count = Rule Count
  • 5. Decision Trees (Cont.)● Leafs are inserted through top to bottom A B C D E F G
  • 6. Sample Decision Tree
  • 7. Creating Tree Model by Training Data
  • 8. Decision Tree Classification Task
  • 9. Apply Model to Test Data
  • 10. Apply Model to Test Data (Cont.)
  • 11. Apply Model to Test Data (Cont.)
  • 12. Apply Model to Test Data (Cont.)
  • 13. Apply Model to Test Data (Cont.)
  • 14. Apply Model to Test Data (Cont.)
  • 15. Decision Tree Algorithms● Classification and Regression Algorithms ○ Twoig ○ Gini● Entropy-based Algorithms ○ ID3 ○ C4.5● Memory-based (Sample-based) Classification Algorithms
  • 16. Decision Trees by Variable Type● Single Variable Decision Trees ○ Classifications are done with asking questions over only one variable● Hybrid Decision Trees ○ Classifications are done with asking questions over both single and multiple variables● Multiple Variables Decision Trees ○ Classifications are done with asking questions over multiple variables
  • 17. ID3 Algorithm● Iterative Dichotomizer 3● Developed by J. Ross Quinlan in 1979● Based on Entropy● Only works for discrete data● Can not work with defective data● Advantage over Hunts algorithm is choosing the right attribute while classification. (Hunts algorithm chooses randomly)
  • 18. Entropy● A formula to calculate the homogeneity of a sample; gives idea about how much information gain provides each leaf● A complete homogeneous sample entropy value is 0● An equally divided sample entropy value is 1● Formula:
  • 19. Information Gain (IG)● Information Gain calculates effective change in entropy after making a decision based on the value of an attribute.● Which attribute creates the most homogeneous branches?● First the entropy of the total dataset is calculated.● The dataset is then split on the different attributes.
  • 20. Information Gain (Cont.)● The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split.● The resulting entropy is subtracted from the entropy before the split.● The result is the Information Gain, or decrease in entropy.● The attribute that yields the largest IG is chosen for the decision node.
  • 21. Information Gain (Cont.)● A branch set with entropy of 0 is a leaf node.● Otherwise, the branch needs further splitting to classify its dataset.● The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
  • 22. ID3 Algorithm Stepsfunction ID3 (R: a set of non-categorical attributes, C: the categorical attribute, S: a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified]; Let D be the attribute with largest Gain( D,S) among attributes in R; Let {dj| j=1,2, .., m} be the values of attribute D; Let {Sj| j=1,2, .., m} be the subsets of S consisting respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs labeled d1, d2, .., dm going respectively to the trees ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-{D}, C, Sm); end ID3;
  • 23. Pros of ID3 Algorithm● Builds decision tree in min. steps ○ The most important point while tree induction is collecting enough reliable associated data over specific properties. ○ Asking right questions determines tree induction.● Each level benefits from previous level choices● Whole dataset is scanned to create tree
  • 24. Cons of ID3 Algorithm● Tree can not be updated when new data is classified incorrectly, instead a new tree must be generated.● Only one attribute at a time is tested for making a decision.● Can not work with defective data● Can not work with numerical attributes
  • 25. An Example - The Simpsons Person Hair Length Weight Age Class Homer 0 250 36 M Marge 10 150 34 F Bart 2 90 10 M Lisa 6 78 8 M Maggie 4 20 1 F Abe 1 170 70 F Selma 8 160 41 F Otto 10 180 38 M Krusty 6 200 45 M
  • 26. Information Gain over Hair LengthE(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain Hair Length <= 5 Yes No E(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.9710 E(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) =0.8113Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.9710 + 5/9 * 0.8113) = 0.0911
  • 27. Information Gain over WeightE(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain Weight <= 160 Yes NoE(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 E(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0Gain(Weight<= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
  • 28. Information Gain over AgeE(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain Age <= 40 Yes NoE(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 E(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)= 0.9188Gain(Age z= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
  • 29. Results Attribute Information Gain (IG) Hair Length <= 5 0.0911 Weight <= 160 0.5900 Age <= 40 0.0183● As seen in the results, weight is the best attribute to classify these group.
  • 30. Constructed Decision Tree Weight <= 160 Yes No Hair Length <= 5 Yes No Female Male
  • 31. Entropy over Nominal Values● If an attribute has nominal values: ○ First calculate information gain for each attribute value ○ Then calculate attribute information gain
  • 32. Example IIIG= -(5/15)log2(5/15)-(10/15)log2(10/15) = ~0.918
  • 33. Example II (Cont.) Information Gain over Engine ● Engine: 6 small, 5 medium, 4 large ● 3 values for attribute engine, so we need 3 entropy calculations ● small: 5 no, 1 yes ○ IGsmall = -(5/6)log2(5/6)-(1/6)log2(1/6) = ~0.65 ● medium: 3 no, 2 yes ○ IGmedium = -(3/5)log2(3/5)-(2/5)log2(2/5) = ~0.97 ● large: 2 no, 2 yes ○ IGlarge = 1 (evenly distributed subset)=> IGEngine = IE(S) – [(6/15)*IGsmall + (5/15)*IGmedium +(4/15)*Ilarge] = IGEngine = 0.918 – 0.85 = 0.068
  • 34. Example II (Cont.) Information Gain over SC/Turbo● SC/Turbo: 4 yes, 11 no● 2 values for attribute SC/Turbo, so we need 2 entropy calculations● yes: 2 yes, 2 no ○ IGturbo = 1 (evenly distributed subset)● no: 3 yes, 8 no ○ IGnoturbo = -(3/11)log2(3/11)-(8/11)log2(8/11) = ~0.84 IGturbo = IE(S) – [(4/15)*IGturbo + (11/15)*IGnoturbo] IGturbo = 0.918 – 0.886 = 0.032
  • 35. Example II (Cont.) Information Gain over Weight● Weight: 6 Average, 4 Light, 5 Heavy● 3 values for attribute weight, so we need 3 entropy calculations● average: 3 no, 3 yes ○ IGaverage = 1 (evenly distributed subset)● light: 3 no, 1 yes ○ IGlight = -(3/4)log2(3/4)-(1/4)log2(1/4) = ~0.81● heavy: 4 no, 1 yes ○ IGheavy = -(4/5)log2(4/5)-(1/5)log2(1/5) = ~0.72 IGWeight = IE(S) – [(6/15)*IGaverage + (4/15)*IGlight + (5/15)*IGheavy] IGWeight = 0.918 – 0.856 = 0.062
  • 36. Example II (Cont.) Information Gain over Full Eco● Fuel Economy: 2 good, 3 average, 10 bad● 3 values for attribute Fuel Eco, so we need 3 entropy calculations● good: 0 yes, 2 no ○ IGgood = 0 (no variability)● average: 0 yes, 3 no ○ IGaverage = 0 (no variability)● bad: 5 yes, 5 no ○ IGbad = 1 (evenly distributed subset) We can omit calculations for good and average since they alwaysend up not fast. IGFuelEco = IE(S) – [(10/15)*IGbad] IGFuelEco = 0.918 – 0.667 = 0.251
  • 37. Example II (Cont.)● Results: IGEngine 0.068 ■ Root of the tree IGturbo 0.032 IGWeight 0.062 IGFuelEco 0.251
  • 38. Example II (Cont.)● Since we selected the Fuel Eco attribute for our Root Node, it is removed from the table for future calculations. General Information Gain = 1 (Evenly distributed set)
  • 39. Example II (Cont.) Information Gain over Engine● Engine: 1 small, 5 medium, 4 large● 3 values for attribute engine, so we need 3 entropy calculations● small: 1 yes, 0 no ○ IGsmall = 0 (no variability)● medium: 2 yes, 3 no ○ IGmedium = -(2/5)log2(2/5)-(3/5)log2(3/5) = ~0.97● large: 2 no, 2 yes ○ IGlarge = 1 (evenly distributed subset) IGEngine = IE(SFuelEco) – (5/10)*IGmedium + (4/10)*IGlarge] IGEngine = 1 – 0.885 = 0.115
  • 40. Example II (Cont.) Information Gain over SC/Turbo● SC/Turbo: 3 yes, 7 no● 2 values for attribute SC/Turbo, so we need 2 entropy calculations● yes: 2 yes, 1 no ○ IGturbo = -(2/3)log2(2/3)-(1/3)log2(1/3) = ~0.84● no: 3 yes, 4 no ○ IGnoturbo = -(3/7)log2(3/7)-(4/7)log2(4/7) = ~0.84 IGturbo = IE(SFuelEco) – [(3/10)*IGturbo + (7/10)*IGnoturbo] IGturbo = 1 – 0.965 = 0.035
  • 41. Example II (Cont.) Information Gain over Weight● Weight: 3 average, 5 heavy, 2 light● 3 values for attribute weight, so we need 3 entropy calculations● average: 3 yes, 0 no ○ IGaverage = 0 (no variability)● heavy: 1 yes, 4 no ○ IGheavy = -(1/5)log2(1/5)-(4/5)log2(4/5) = ~0.72● light: 1 yes, 1 no ○ IlGight = 1 (evenly distributed subset) IGEngine = IE(SFuel Eco) – [(5/10)*IGheavy+(2/10)*IGlight] IGEngine = 1 – 0.561 = 0.439
  • 42. Example II (Cont.)● Results:IGEngine 0.115IGturbo 0.035IGWeight 0.439Weight has the highest gain, and is thus thebest choice.
  • 43. Example II (Cont.)Since there are only two items for SC/Turbo whereWeight = Light, and the result is consistent, we cansimplify the weight = Light path.
  • 44. Example II (Cont.)● Updated Table: (Weight = Heavy)● All cars with large engines in this table are not fast.● Due to inconsistent patterns in the data, there is no way to proceed since medium size engines may lead to either fast or not fast.
  • 45. ROC Analysis● Receiver Operating Characteristic● The limitations of diagnostic “accuracy” as a measure of decision performance require introduction of the concepts of the “sensitivity” and “specificity” of a diagnostic test. These measures and the related indices, “true positive rate” and “false positive rate”, are more meaningful than “accuracy”.● ROC curve is shown to be a complete description of this decision threshold effect, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions.
  • 46. ROC Analysis (Cont.)● Combinations of correct & incorrect decisions:Actual Value Prediction Outcome Descriptionp p True Positive Rate (TPR)p n False Negative Rate (FNR)n p False Positive Rate (FPR)n n True Negative Rate (TNR)● TPR is equivalent with sensitivity.● FPR is equivalent with 1 - specificity.● Best possible prediction would be 100% sensitivity and 100% specificity (which means FPR = 0%).
  • 47. ROC Space● A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs).● Since TPR is equivalent with sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot.● Each prediction result one point in the ROC space.
  • 48. Calculations● Sensitivity ○ TPR = TP / P = TP / (TP + FN)● Specificity ○ FPR = FP / N = FP / (FP + TN)● Accuracy ○ ACC = (TP + TN) / (P + N)
  • 49. A ROC Space Example● Let A, B, C, D to be predictions over 100 negative and 100 positive instance:Prediction/ TP FP FN TN TPR FPR ACCCombination A 63 28 37 72 0.63 0.28 0.68 B 77 77 23 23 0.77 0.77 0.50 C 24 88 76 12 0.24 0.88 0.18 D 76 12 24 88 0.76 0.12 0.82
  • 50. A ROC Space Example (Cont.)
  • 51. References1. Data Mining Course Lectures, Ass. Prof. Nilüfer Yurtay2. Quinlan, J.R. 1986, Machine Learning, 1, 813. http://www.cse.unsw.edu. au/~billw/cs9414/notes/ml/06prop/id3/id3.html4. J. Han, M. Kamber, J. Pie, Data Mining Concepts and Techniques, 3rd Edition, Elsevier, 2011.5. http://www.cise.ufl.edu/~ddd/cap6635/Fall- 97/Short-papers/2.htm6. C. E. Metz, Basic Principles of ROC Analysis, Seminars in Nuclear Medicine, Volume 8, Issue 4, P 283-298