CC282  Decision trees Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
Lecture 2 - Outline <ul><li>More ML principles: </li></ul><ul><ul><li>Concept learning </li></ul></ul><ul><ul><li>Hypothes...
Concept learning <ul><li>Concept,  c  is the problem to be learned </li></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><...
Learning a concept <ul><li>Concept learning  </li></ul><ul><ul><li>Given many examples - <input, output> of what  c  does,...
More terms - Generalisation, overfitting, induction, deduction <ul><li>Generalisation </li></ul><ul><ul><li>The ability of...
Generalisation and overfitting example <ul><li>Assume, we have the inputs,  x  and corresponding outputs,  y  and we wish ...
Model (hypothesis) evaluation <ul><li>We need to have some performance measure to estimate how the model  h  approximates ...
Descriptive evaluation <ul><li>Example: bowel cancer classification problem </li></ul><ul><ul><li>True positives (TP) - di...
Descriptive evaluation (contd) <ul><li>For prediction problems, mean square error (MSE) is used </li></ul><ul><ul><li>wher...
Decision trees (DT) <ul><li>Simple form of inductive learning  </li></ul><ul><li>Yet successful form of learning algorithm...
Decision trees (DT) <ul><li>Decision tree takes a set of properties as input and provides a decision as output </li></ul><...
Forming rules from DT <ul><li>Example of concept: ‘Should I play tennis today’ </li></ul><ul><ul><li>Takes  inputs  (set o...
Another DT example <ul><ul><li>Another example (from Lecture 1) </li></ul></ul><ul><ul><li>Reading the tree on the right <...
Obtaining DT through top-down induction <ul><li>How can we obtain a DT? </li></ul><ul><li>Perform a top-down search, throu...
Information gain <ul><li>Information gain - > a reduction of  Entropy , E </li></ul><ul><li>But what is Entropy? </li></ul...
Entropy example <ul><li>A coin is flipped </li></ul><ul><ul><li>If the coin was fair -> 50% chance of head </li></ul></ul>...
Information Gain <ul><li>Information Gain, G will be defined as: </li></ul><ul><li>where </li></ul><ul><ul><li>Values (A) ...
Example – entropy calculation <ul><li>Compute the entropy of the play-tennis example: </li></ul><ul><li>We have two classe...
Example – information gain calculation <ul><li>Compute the information gain for the attributes wind in the play-tennis dat...
Example – information gain calculation <ul><li>Now, let us determine E(S weak ) </li></ul><ul><li>Instances=8, YES=6, NO=2...
Example – information gain calculation <ul><li>Now, let us determine E(S strong ) </li></ul><ul><li>Instances=6, YES=3, NO...
Example – information gain calculation <ul><li>Going back to information gain computation for the attribute wind: </li></u...
Example – information gain calculation <ul><li>Now, compute the information gain for  </li></ul><ul><li>the attribute humi...
Example – information gain calculation <ul><li>Now, compute the information gain for  </li></ul><ul><li>the attribute humi...
Example – information gain calculation <ul><li>Now, compute the information gain for the attribute outlook and temperature...
DT – next level <ul><li>After determining OUTLOOK as the root node, we need to expand the tree </li></ul><ul><li>E(S sunny...
DT – next level <ul><li>Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97 </li></ul><ul><li>Gain (Ssunny, Wind)= 0.97...
Continue ….. and Final DT <ul><li>Continue until all the examples are classified </li></ul><ul><ul><li>Gain (S rainy , Win...
ID3 algorithm –pseudocode <ul><li>Sufficient for exam </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappa...
ID3 algorithm –pseudocode (Mitchell) <ul><li>From Mitchell (1997) – not important for exam </li></ul>Lecture 2 slides for ...
Search strategy in ID3 <ul><li>Complete hypothesis space:  any finite discrete-valued function can be expressed  </li></ul...
Lecture 2 summary <ul><li>From this lecture, you should be able to:  </li></ul><ul><li>Define concept, learning model, hyp...
Upcoming SlideShare
Loading in …5
×

CC282 Decision trees Lecture 2 slides for CC282 Machine ...

867 views
737 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
867
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CC282 Decision trees Lecture 2 slides for CC282 Machine ...

  1. 1. CC282 Decision trees Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  2. 2. Lecture 2 - Outline <ul><li>More ML principles: </li></ul><ul><ul><li>Concept learning </li></ul></ul><ul><ul><li>Hypothesis space </li></ul></ul><ul><ul><li>Generalisation and overfitting </li></ul></ul><ul><ul><li>Model (hypothesis) evaluation </li></ul></ul><ul><li>Inductive learning </li></ul><ul><ul><li>Inductive bias </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>ID3 algorithm (entropy, information gain) </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  3. 3. Concept learning <ul><li>Concept, c is the problem to be learned </li></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Classification problem by an optician </li></ul></ul></ul><ul><ul><ul><li>Concept - whether to fit or not to fit contact lenses based on user’s budget, user’s eye condition, user’s environment etc </li></ul></ul></ul><ul><ul><ul><li>Inputs, x : user’s budget, user’s eye condition, user’s environment </li></ul></ul></ul><ul><ul><ul><li>Output, y : to fit or not to fit </li></ul></ul></ul><ul><li>A learning model is needed to learn a concept </li></ul><ul><li>The learning model should ideally </li></ul><ul><ul><li>Capture the training data, <x, y> -> descriptive ability </li></ul></ul><ul><ul><li>Generalise to unseen test data, <x new ,?> -> predictive ability </li></ul></ul><ul><ul><li>Provide plausible explanation on the learned concept, c -> explanatory ability </li></ul></ul><ul><ul><li>But descriptive and predictive abilities are generally considered sufficient </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  4. 4. Learning a concept <ul><li>Concept learning </li></ul><ul><ul><li>Given many examples - <input, output> of what c does, find a function h that approximates c </li></ul></ul><ul><ul><li>The number of examples is usually a small subset of all possible <input, output> pairs </li></ul></ul><ul><ul><li>h is known as a hypothesis (i.e. learning model) </li></ul></ul><ul><ul><li>There might be a number of h that are candidate solutions -we select h from a hypothesis space H </li></ul></ul><ul><ul><li>If the hypothesis matches the behaviour of the target concept for all training data, then it is a consistent hypothesis </li></ul></ul><ul><li>Occam’s razor </li></ul><ul><ul><li>Simpler hypothesis that fits c is preferred </li></ul></ul><ul><ul><li>Simpler h means shorter, smaller h </li></ul></ul><ul><ul><li>Simpler h is unlikely to be an effect of coincidence </li></ul></ul><ul><li>Learning == search in the H for an appropriate h </li></ul><ul><ul><li>Realisable task – H contains the h that fits the concept </li></ul></ul><ul><ul><li>Unreliasable task – H does not contain the h that fits the concept </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  5. 5. More terms - Generalisation, overfitting, induction, deduction <ul><li>Generalisation </li></ul><ul><ul><li>The ability of the trained model to perform well on test data </li></ul></ul><ul><li>Overfitting </li></ul><ul><ul><li>If the model learns the training data well but performs poorly on the test data </li></ul></ul><ul><li>Inductive learning (induction) </li></ul><ul><ul><li>learning a hypothesis by example, where a system tries to induce a general rule/model from a set of observed instances/samples </li></ul></ul><ul><li>Inductive bias </li></ul><ul><ul><li>Since many choices of h exist in H, any preference of one hypothesis over another without prior knowledge is called bias </li></ul></ul><ul><ul><li>Any hypothesis consistent with the training examples is likely to generalise to unseen examples - the trick is to find the right bias </li></ul></ul><ul><li>An unbiased learner </li></ul><ul><ul><li>Can never generalise so not practically useful </li></ul></ul><ul><li>Deduction </li></ul><ul><ul><li>ML gives an output (prediction, classification etc) based on the previously acquired learning </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  6. 6. Generalisation and overfitting example <ul><li>Assume, we have the inputs, x and corresponding outputs, y and we wish to have concept, c that matches x to y </li></ul><ul><li>Examples of hypotheses: </li></ul><ul><li>h1 will give good generalisation </li></ul><ul><li>h2 is overfitted </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  7. 7. Model (hypothesis) evaluation <ul><li>We need to have some performance measure to estimate how the model h approximates c , i.e. how good is h ? </li></ul><ul><li>Possible evaluation methods </li></ul><ul><ul><li>Explanatory, gives qualitative evaluation </li></ul></ul><ul><ul><li>Descriptive, gives quantitative (numerical) evaluation </li></ul></ul><ul><li>Explanatory evaluation </li></ul><ul><ul><li>Does the model provide a plausible description of the learned concept </li></ul></ul><ul><ul><li>Classification: does it base its classification on plausible rules? </li></ul></ul><ul><ul><li>Association: does it discover plausible relationships in the data? </li></ul></ul><ul><ul><li>Clustering: does it come up with plausible clusters? </li></ul></ul><ul><ul><li>The meaning of plausible to be defined by the human expert </li></ul></ul><ul><ul><li>Hence, not popular in ML </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  8. 8. Descriptive evaluation <ul><li>Example: bowel cancer classification problem </li></ul><ul><ul><li>True positives (TP) - diseased patients identified as with cancer </li></ul></ul><ul><ul><li>True negatives (TN) - healthy subjects identified as healthy </li></ul></ul><ul><ul><li>False negatives (FN)- test identifies cancer patient as healthy </li></ul></ul><ul><ul><li>False positives (FP) – test identifies healthy subject as with cancer </li></ul></ul><ul><li>Precision </li></ul><ul><li>Sensitivity (Recall) </li></ul><ul><li>F measure (balanced F score) </li></ul><ul><li>Simple classification accuracy </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Source: Wikipedia Patients with bowel cancer True False Blood Test Positive TP=2 FP=18 Negative FN=1 TN=182
  9. 9. Descriptive evaluation (contd) <ul><li>For prediction problems, mean square error (MSE) is used </li></ul><ul><ul><li>where </li></ul></ul><ul><ul><li>d i is the desired output in the data set </li></ul></ul><ul><ul><li>a i is the actual output from the model </li></ul></ul><ul><ul><li>n is the number of instances in the data set </li></ul></ul><ul><ul><li>If N=2, d 1 =1.0, a 1 =0.5, d 2 =0, a 2 =1.0: </li></ul></ul><ul><ul><li>MSE=1.25 </li></ul></ul><ul><li>Sometimes, root mean square is used instead =sqrt(MSE) </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  10. 10. Decision trees (DT) <ul><li>Simple form of inductive learning </li></ul><ul><li>Yet successful form of learning algorithm </li></ul><ul><li>Consider an example of playing tennis </li></ul><ul><li>Attributes (features) </li></ul><ul><ul><li>Outlook, temp, humidity, wind </li></ul></ul><ul><li>Values </li></ul><ul><ul><li>Description of features </li></ul></ul><ul><ul><li>Eg: Outlook values - sunny, cloudy, rainy </li></ul></ul><ul><li>Target </li></ul><ul><ul><li>Play </li></ul></ul><ul><ul><li>Represents the output of the model </li></ul></ul><ul><li>Instances </li></ul><ul><ul><li>Examples D1 to D14 of the dataset </li></ul></ul><ul><li>Concept </li></ul><ul><ul><li>Learn to decide whether to play tennis i.e. find h from given data set </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Adapted from Mitchell, 1997 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  11. 11. Decision trees (DT) <ul><li>Decision tree takes a set of properties as input and provides a decision as output </li></ul><ul><ul><li>each row of table corresponds to a path in the tree </li></ul></ul><ul><ul><li>decision tree may form more compact representation, especially if many attributes are irrelevant </li></ul></ul><ul><li>DT could be considered as the learning method when </li></ul><ul><ul><li>Instances describable by attribute-value pairs </li></ul></ul><ul><ul><li>Target function is discrete valued (eg: YES, NO) </li></ul></ul><ul><ul><li>Possibly noisy training data </li></ul></ul><ul><li>It is not suitable (needs further adaptation) </li></ul><ul><ul><li>When attribute values and/or target are numerical values </li></ul></ul><ul><ul><ul><li>Eg: Attribute values: Temp=22⁰ C, Windy=25 mph </li></ul></ul></ul><ul><ul><ul><li>Target function=70%, 30% </li></ul></ul></ul><ul><ul><li>Some functions require exponentially large decision tree </li></ul></ul><ul><ul><ul><li>parity function </li></ul></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  12. 12. Forming rules from DT <ul><li>Example of concept: ‘Should I play tennis today’ </li></ul><ul><ul><li>Takes inputs (set of attributes) </li></ul></ul><ul><ul><li>Outputs a decision (say YES/NO) </li></ul></ul><ul><li>Each non-leaf node is an attribute </li></ul><ul><ul><li>The first non-leaf node is root node </li></ul></ul><ul><li>Each leaf node is either Yes or No </li></ul><ul><li>Each link (branch) is labeled with </li></ul><ul><li>possible values of the associated attribute </li></ul><ul><li>Rule formation </li></ul><ul><ul><li>A decision tree can be expressed as a disjunction of conjunctions </li></ul></ul><ul><ul><ul><li>PLAY tennis IF (Outlook = sunny)  (Humidity = normal)  (Outlook=Cloudy)  (Outlook = Rainy)  (Wind=Weak) </li></ul></ul></ul><ul><ul><ul><li> is disjunction operator (OR) </li></ul></ul></ul><ul><ul><ul><li> is conjunction operator (AND) </li></ul></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 No Yes No Yes Weak Normal High Outlook Humidity Wind Yes Sunny Rainy Cloudy Strong
  13. 13. Another DT example <ul><ul><li>Another example (from Lecture 1) </li></ul></ul><ul><ul><li>Reading the tree on the right </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 If the parents visiting=yes, then go to the cinema or If the parents visiting=no and weather=sunny, then play tennis or If the parents visiting=no and weather=windy and money=rich, then go shopping or If the parents visiting=no and weather=windy and money=poor, then go to cinema or If the parents visiting=no and weather=rainy, then stay in. Source: http://wwwhomes.doc.ic.ac.uk/~sgc/teaching/v231/lecture10.html
  14. 14. Obtaining DT through top-down induction <ul><li>How can we obtain a DT? </li></ul><ul><li>Perform a top-down search, through the space of possible decision trees </li></ul><ul><li>Determine the attribute that best classifies the training data </li></ul><ul><li>Use this attribute as the root of the tree </li></ul><ul><li>Repeat this process for each branch from left to right </li></ul><ul><li>Proceed to the next level and determine the next best feature </li></ul><ul><li>Repeat until a leaf is reached. </li></ul><ul><li>How to choose the best attribute? </li></ul><ul><li>Choose the attribute that will yield more information (i.e. the attribute with the highest information gain ) </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  15. 15. Information gain <ul><li>Information gain - > a reduction of Entropy , E </li></ul><ul><li>But what is Entropy? </li></ul><ul><ul><li>Is the amount of energy that cannot be used to do work </li></ul></ul><ul><ul><li>Measured in bits </li></ul></ul><ul><ul><li>A measure of disorder in a system (high entropy = disorder) </li></ul></ul><ul><ul><li>where: </li></ul></ul><ul><ul><li>S is the training data set </li></ul></ul><ul><ul><li>c is the number of target classes </li></ul></ul><ul><ul><li>p i is the proportion of examples in S belonging to target class i </li></ul></ul><ul><ul><li>Note: if your calculator doesn't do log 2 , use log 2 (x)=1.443 ln(x) or 3.322 log 10 (x). For even better accuracy, use log 2 (x)=ln(x)/ln(2) or log 2 (x)=log 10 (x)/log 10 (2) </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  16. 16. Entropy example <ul><li>A coin is flipped </li></ul><ul><ul><li>If the coin was fair -> 50% chance of head </li></ul></ul><ul><ul><li>Now, let us rig the coin -> so that 99% of the time head comes up </li></ul></ul><ul><li>Let’s look at this in terms of entropy: </li></ul><ul><ul><li>Two outcomes: head, tail </li></ul></ul><ul><ul><li>Probability: p head , p tail </li></ul></ul><ul><ul><li>E (0.5, 0.5)= – 0.5 log 2 (0.5) – (0.5) log 2 (0.5) = 1 bit </li></ul></ul><ul><ul><li>E (0.01, 0.99) = – 0.01 log 2 (0.01) – 0.99 log 2 (0.99) = 0.08 bit </li></ul></ul><ul><li>If the probability of heads =1, then entropy=0 </li></ul><ul><ul><li>E (0, 1.0) = – 0 log 2 (0) – 1.0 log 2 (1.0) = 0 bit </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  17. 17. Information Gain <ul><li>Information Gain, G will be defined as: </li></ul><ul><li>where </li></ul><ul><ul><li>Values (A) is the set of all possible values of attribute A </li></ul></ul><ul><ul><li>S v is the subset of S for which A has a value v </li></ul></ul><ul><ul><li>|S| is the size of S and S v is the size of S v </li></ul></ul><ul><li>The information gain is the expected reduction in entropy caused by knowing the value of attribute A </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  18. 18. Example – entropy calculation <ul><li>Compute the entropy of the play-tennis example: </li></ul><ul><li>We have two classes, YES and NO </li></ul><ul><li>We have 14 instances with 9 classified as YES and 5 as NO </li></ul><ul><ul><li>i.e. no. of classes, c =2 </li></ul></ul><ul><li>E YES = - (9/14) log 2 (9/14) = 0.41 </li></ul><ul><li>E NO = - (5/14) log 2 (5/14) = 0.53 </li></ul><ul><li>E(S) = E YES + E NO = 0.94 </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  19. 19. Example – information gain calculation <ul><li>Compute the information gain for the attributes wind in the play-tennis data set: </li></ul><ul><ul><li>|S| =14 </li></ul></ul><ul><ul><li>Attribute wind </li></ul></ul><ul><ul><ul><li>Two values: weak and strong </li></ul></ul></ul><ul><ul><ul><li>|S weak | = 8 </li></ul></ul></ul><ul><ul><ul><li>|S strong | = 6 </li></ul></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  20. 20. Example – information gain calculation <ul><li>Now, let us determine E(S weak ) </li></ul><ul><li>Instances=8, YES=6, NO=2 </li></ul><ul><li>[6+,2-] </li></ul><ul><li>E(S weak )=-(6/8)log 2 (6/8)-(2/8)log 2 (2/8)=0.81 </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  21. 21. Example – information gain calculation <ul><li>Now, let us determine E(S strong ) </li></ul><ul><li>Instances=6, YES=3, NO=3 </li></ul><ul><li>[3+,3-] </li></ul><ul><li>E(S strong )=-(3/6)log 2 (3/6)-(3/6)log 2 (3/6)=1.0 </li></ul><ul><li>Note, do not waste time if p YES =p NO </li></ul>Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  22. 22. Example – information gain calculation <ul><li>Going back to information gain computation for the attribute wind: </li></ul><ul><ul><ul><ul><ul><li>= 0.94 - (8/14) 0.81 - (6/14)1.00 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>= 0.048 </li></ul></ul></ul></ul></ul>Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008
  23. 23. Example – information gain calculation <ul><li>Now, compute the information gain for </li></ul><ul><li>the attribute humidity in the play-tennis data set: </li></ul><ul><ul><li>|S| =14 </li></ul></ul><ul><ul><li>Attribute humidity </li></ul></ul><ul><ul><ul><li>Two values: high and normal </li></ul></ul></ul><ul><ul><ul><li>|S high | = 7 </li></ul></ul></ul><ul><ul><ul><li>|S normal | = 7 </li></ul></ul></ul><ul><ul><li>For value: high –> [3+,4-] </li></ul></ul><ul><ul><li>For value: normal->[6+,1-] </li></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  24. 24. Example – information gain calculation <ul><li>Now, compute the information gain for </li></ul><ul><li>the attribute humidity in the play-tennis: </li></ul><ul><ul><li>|S| =14 </li></ul></ul><ul><ul><li>Attribute humidity </li></ul></ul><ul><ul><ul><li>Two values: high and normal </li></ul></ul></ul><ul><ul><ul><li>|S high | = 7 </li></ul></ul></ul><ul><ul><ul><li>|S normal | = 7 </li></ul></ul></ul><ul><ul><li>For value: high –> [3+,4-] </li></ul></ul><ul><ul><li>For value: normal->[6+,1-] </li></ul></ul><ul><ul><ul><ul><ul><li>= 0.94 - (7/14) 0.98 - (7/14)0.59 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>= 0.15 </li></ul></ul></ul></ul></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 E(S high )=-(3/7)log 2 (3/7)-(4/7)log 2 (4/7)=0.98 E(S normal )=-(6/7)log 2 (6/7)-(1/7)log 2 (1/7)=0.59 So, humidity provides GREATER information gain than wind
  25. 25. Example – information gain calculation <ul><li>Now, compute the information gain for the attribute outlook and temperature in the play-tennis data set: </li></ul><ul><li>Attribute outlook: </li></ul><ul><li>Attribute temperature: </li></ul><ul><li>Gain(S, outlook)=0.25 </li></ul><ul><li>Gain(S, temp)=0.03 </li></ul><ul><li>Gain(S, humidity)=0.15 </li></ul><ul><li>Gain(S, wind)=0.048 </li></ul><ul><li>So, attribute with highest info. gain </li></ul><ul><ul><li>OUTLOOK, therefore use outlook as the root node </li></ul></ul>Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  26. 26. DT – next level <ul><li>After determining OUTLOOK as the root node, we need to expand the tree </li></ul><ul><li>E(S sunny )=-(2/5)log 2 (2/5)-(3/5)log 2 (3/5)=0.97 </li></ul><ul><li>Entropy (S sunny )=0.97 </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 Day Outlook Temp Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Cloudy Hot High Weak Yes D4 Rainy Mild High Weak Yes D5 Rainy Cool Normal Weak Yes D6 Rainy Cool Normal Strong No D7 Cloudy Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rainy Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Cloudy Mild High Strong Yes D13 Cloudy Hot Normal Weak Yes D14 Rainy Mild High Strong No
  27. 27. DT – next level <ul><li>Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97 </li></ul><ul><li>Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019 </li></ul><ul><li>Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57 </li></ul><ul><li>Highest information gain is humidity, so use this attribute </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  28. 28. Continue ….. and Final DT <ul><li>Continue until all the examples are classified </li></ul><ul><ul><li>Gain (S rainy , Wind), Gain (S rainy , Humidity),Gain (S rainy , Temp) </li></ul></ul><ul><ul><li>Gain (S rainy , Wind) is the highest </li></ul></ul><ul><li>All leaf nodes are associated with training examples from the same class (entropy=0) </li></ul><ul><li>The attribute temperature is not used </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  29. 29. ID3 algorithm –pseudocode <ul><li>Sufficient for exam </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  30. 30. ID3 algorithm –pseudocode (Mitchell) <ul><li>From Mitchell (1997) – not important for exam </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  31. 31. Search strategy in ID3 <ul><li>Complete hypothesis space: any finite discrete-valued function can be expressed </li></ul><ul><li>Incomplete search: searches incompletely through the hypothesis space until the tree is consistent with the data </li></ul><ul><li>Single hypothesis : only one current hypothesis (simplest one) is maintained </li></ul><ul><li>No backtracking : one an attribute is selected, this cannot be changed. Problem: might not be the optimum solution (globally) </li></ul><ul><li>Full training set : attributes are selected by computing information gain on the full training set. Advantage: Robustness to errors. Problem: Non-incremental </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008
  32. 32. Lecture 2 summary <ul><li>From this lecture, you should be able to: </li></ul><ul><li>Define concept, learning model, hypothesis, hypothesis space, consistent hypothesis, induction learning & bias, reliasable & unreliasable tasks, Occam’s razor in view of ML </li></ul><ul><li>Differentiate between generalisation and overfitting </li></ul><ul><li>Define entropy & information gain and know how to calculate them for a given data set </li></ul><ul><li>Explain the ID3 algorithm, how it works and describe it in pseudo-code </li></ul><ul><li>Apply ID3 algorithm on a given data set </li></ul>Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008

×