Machine Learning: Concept Learning &  Decision-Tree Learning  Yuval Shahar M.D., Ph.D. Medical Decision Support Systems
Machine Learning <ul><li>Learning: Improving (a program’s) performance in some task with experience </li></ul><ul><li>Mult...
Concept Learning <ul><li>Inference of a boolean-valued function ( concept ) from its I/O training examples </li></ul><ul><...
A Concept-Learning Example Enjoy? Fore-cast Water Wind Humid Air  temp Sky # Yes Same Warm Strong Normal Warm Sun 1 Yes Sa...
The Inductive Learning Hypothesis <ul><li>Any hypothesis approximating the target function well over a large set of  train...
Concept Learning as Search <ul><li>Learning  is  searching  through a large space of hypotheses </li></ul><ul><ul><li>Spac...
The Find-S Algorithm <ul><li>Start with  the most specific hypothesis  h  in H </li></ul><ul><ul><li>h    <  ,   ,   ,...
The Candidate-Elimination (CE) Algorithm (Mitchel, 1977, 1979) <ul><li>A  Version Space : The subset of hypotheses of  H  ...
Properties of The CE Algorithm <ul><li>Converges to the “correct” hypothesis if </li></ul><ul><ul><li>There are  no errors...
Inductive Biases <ul><li>Every learning method implicitly is  biased  towards a certain hypotheses space  H </li></ul><ul>...
Decision Tree learning <ul><li>Decision trees: A method for representing classification functions </li></ul><ul><ul><li>Ca...
Example Decision Tree Outlook? Humidity?  Wind?  Yes Yes Yes No No Sun Overcast Rain High Normal Strong Weak
When  Should Decision Trees Be Used? <ul><li>When instances are  <attribute, value> pairs </li></ul><ul><ul><li>Values are...
The Basic Decision-Tree Learning Algorithm: ID3 (Quinlan, 1986) <ul><li>A  top-down greedy search  through the hypothesis ...
Which Attribute is Best to Test? <ul><li>The central choice in the ID3 algorithm and similar approaches </li></ul><ul><li>...
Entropy <ul><li>Entropy : An information-theory measure that characterizes the (im)purity of an example set  S  using the ...
Entropy Function  for a Boolean Classification p  1.0  0.0  0.5  Entropy( S ) 1.0
Entropy and Surprise <ul><li>Entropy can also be considered as the  mean surprise  on seeing the outcome (actual class) </...
Information Gain of an Attribute <ul><li>Sometimes termed the  Mutual Information  (MI)  gained regarding a class (e.g., a...
Information Gain Example Humidity?  Wind?  {3+, 4-} E = 0.985 High Normal Strong Weak S: {9+,5-} E = 0.940 S: {9+,5-} E = ...
Properties of ID3 <ul><li>Searches the  hypothesis space  of  decision trees </li></ul><ul><li>A  complete  space  of  all...
The Data Over-Fitting Problem <ul><li>Occurs due to noise in data or too-few examples </li></ul><ul><li>Handling the data ...
Other Improvements to ID3 <ul><li>Handling  continuous values  of attributes </li></ul><ul><ul><li>Pick a threshold that m...
Summary:  Concept and Decision-Tree Learning <ul><li>Concept learning  is a search through a  hypothesis space </li></ul><...
Upcoming SlideShare
Loading in …5
×

Machine Learning

705 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
705
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Machine Learning

  1. 1. Machine Learning: Concept Learning & Decision-Tree Learning Yuval Shahar M.D., Ph.D. Medical Decision Support Systems
  2. 2. Machine Learning <ul><li>Learning: Improving (a program’s) performance in some task with experience </li></ul><ul><li>Multiple application domains, such as </li></ul><ul><ul><li>Game playing (e.g., TD-gammon) </li></ul></ul><ul><ul><li>Speech recognition (e.g., Sphinx) </li></ul></ul><ul><ul><li>Data mining (e.g., marketing) </li></ul></ul><ul><ul><li>Driving autonomous vehicles (e.g., ALVINN) </li></ul></ul><ul><ul><li>Classification of ER and ICU patients </li></ul></ul><ul><ul><li>Prediction of financial and other fraud </li></ul></ul><ul><ul><li>Prediction of pneumonia-patients recovery rate </li></ul></ul>
  3. 3. Concept Learning <ul><li>Inference of a boolean-valued function ( concept ) from its I/O training examples </li></ul><ul><li>The concept c is defined over a set of instances X </li></ul><ul><ul><li>c : X  {0,1} </li></ul></ul><ul><li>The learner is presented with a set of positive/negative training examples < x , c ( x )> taken from X </li></ul><ul><li>There is a set H of possible hypotheses that the learner might consider regarding the concept </li></ul><ul><li>Goal : Find a hypothesis h , s.t.  ( x  X ), h ( x) = c ( x ) </li></ul>
  4. 4. A Concept-Learning Example Enjoy? Fore-cast Water Wind Humid Air temp Sky # Yes Same Warm Strong Normal Warm Sun 1 Yes Same Warm Strong High Warm Sun 2 No Change Warm Strong High Cold Rain 3 Yes Change Cool Strong High Warm Sun 4
  5. 5. The Inductive Learning Hypothesis <ul><li>Any hypothesis approximating the target function well over a large set of training examples will also approximate that target function well over other, unobserved , examples </li></ul>
  6. 6. Concept Learning as Search <ul><li>Learning is searching through a large space of hypotheses </li></ul><ul><ul><li>Space is implicitly defined by the hypothesis representation </li></ul></ul><ul><li>General-to-specific ordering of hypotheses </li></ul><ul><ul><li>H1 is more-general-or-equal to H2 if any instance that satisfies H2 also satisfies H1 </li></ul></ul><ul><ul><ul><li><Sun, ?, ?, ?, ?, ?>  g <Sun, ?, ?, Strong, ?, ?> </li></ul></ul></ul>
  7. 7. The Find-S Algorithm <ul><li>Start with the most specific hypothesis h in H </li></ul><ul><ul><li>h  <  ,  ,  ,  ,  ,  > </li></ul></ul><ul><li>Generalize h by the next more general constraint (for each appropriate attribute) whenever it fails to classify correctly a positive training example </li></ul><ul><li>Leads here finally to h = <Sun, Warm, ?, Strong, ?, ?> </li></ul><ul><li>Finds only one (the most specific ) hypothesis </li></ul><ul><li>Cannot detect inconsistencies </li></ul><ul><ul><li>Ignores negative examples ! </li></ul></ul><ul><li>Assumes no noise and no errors in the input </li></ul>
  8. 8. The Candidate-Elimination (CE) Algorithm (Mitchel, 1977, 1979) <ul><li>A Version Space : The subset of hypotheses of H consistent with the training examples set D </li></ul><ul><li>A version space can be represented by: </li></ul><ul><ul><li>Its general (maximally general) boundary set G of hypotheses consistent with D (G 0  {<?, ?, ...,?>}) </li></ul></ul><ul><ul><li>Its specific (minimally general) boundary set S of hypotheses consistent with D (S 0  {<  ,  , ...,  >}) </li></ul></ul><ul><li>The CE algorithm updates the general and specific boundaries given each positive and negative example </li></ul><ul><li>The resultant version space contains all and only all hypotheses consistent with the training set </li></ul>
  9. 9. Properties of The CE Algorithm <ul><li>Converges to the “correct” hypothesis if </li></ul><ul><ul><li>There are no errors in the training set </li></ul></ul><ul><ul><ul><li>Else, the correct target concept is always eliminated! </li></ul></ul></ul><ul><ul><li>There is in fact such a hypothesis in H </li></ul></ul><ul><li>The next best query (new training example to ask for) separates maximally the hypotheses in the version space (best: into two halves) </li></ul><ul><li>Partially learned concepts might suffice to classify a new instance with certainty, or at least with some confidence </li></ul>
  10. 10. Inductive Biases <ul><li>Every learning method implicitly is biased towards a certain hypotheses space H </li></ul><ul><ul><li>The conjunctive hypothesis space (only one value per attribute) can only represent 973 out of 2 96 possible subsets, or target concepts, in our example domain (assuming 3x2x2x2x2x2=96 possible instances, for 3,2,2,2,2,2 respective attribute values ) </li></ul></ul><ul><li>Without an inductive bias (no a priori assumptions regarding the target concept) there is no way to classify new, unseen instances ! </li></ul><ul><ul><li>The S boundary will always be the disjunction of the positive example instances; the G boundary will be the negated disjunction of the negative example instances </li></ul></ul><ul><ul><li>Convergence possible only when all of X is seen! </li></ul></ul><ul><li>Strongly biased methods make more inductive leaps </li></ul><ul><li>Inductive bias of CE : The target concept c is in H ! </li></ul>
  11. 11. Decision Tree learning <ul><li>Decision trees: A method for representing classification functions </li></ul><ul><ul><li>Can be represented as a set of If-Then rules </li></ul></ul><ul><ul><li>Each node represents a test of some attribute </li></ul></ul><ul><ul><li>An instance is classified by starting at the root, testing attributes in each node and moving along the branch corresponding to that attribute’s value </li></ul></ul>
  12. 12. Example Decision Tree Outlook? Humidity? Wind? Yes Yes Yes No No Sun Overcast Rain High Normal Strong Weak
  13. 13. When Should Decision Trees Be Used? <ul><li>When instances are <attribute, value> pairs </li></ul><ul><ul><li>Values are typically discrete, but can be continuous </li></ul></ul><ul><li>The target function has discrete output values </li></ul><ul><li>Disjunctive descriptions might be needed </li></ul><ul><ul><li>Natural representation of disjunction of rules </li></ul></ul><ul><li>Training data might contain errors </li></ul><ul><ul><li>Robust to errors of classification and attribute values </li></ul></ul><ul><li>The training data might contain missing values </li></ul><ul><ul><li>Several methods for completion of unknown values </li></ul></ul>
  14. 14. The Basic Decision-Tree Learning Algorithm: ID3 (Quinlan, 1986) <ul><li>A top-down greedy search through the hypothesis space of possible decision trees </li></ul><ul><li>Originally intended for boolean-valued functions </li></ul><ul><li>Extensions incorporated in C4.5 (Quinlan, 1993) </li></ul><ul><li>In each step, the “ best ” attribute for testing is selected using some measure, and branching occurs along its values, continuing the process </li></ul><ul><li>Ends when all attributes have been used , or all examples in this node are either positive or negative </li></ul>
  15. 15. Which Attribute is Best to Test? <ul><li>The central choice in the ID3 algorithm and similar approaches </li></ul><ul><li>Here, an information gain measure is used, which measures how well each attribute separates training examples according to their target classification </li></ul>
  16. 16. Entropy <ul><li>Entropy : An information-theory measure that characterizes the (im)purity of an example set S using the proportion of positive (  ) and negative instances (  ) </li></ul><ul><li>Informally: Number of bits needed to encode the classification of an arbitrary member of S; </li></ul><ul><li>Entropy( S ) = –p  log 2 p  – p  log 2 p  </li></ul><ul><li>Entropy( S ) is in [0..1] </li></ul><ul><li>Entropy( S ) is 0 if all members are positive or negative </li></ul><ul><li>Entropy is maximal (1) when p  = p  = 0.5 (uniform distribution of positive and negative cases) </li></ul><ul><li>If there are c different values to the target concept, Entropy( S ) =  i=1.. c – p i log 2 p i (p i is proportion of class i ) </li></ul>
  17. 17. Entropy Function for a Boolean Classification p  1.0 0.0 0.5 Entropy( S ) 1.0
  18. 18. Entropy and Surprise <ul><li>Entropy can also be considered as the mean surprise on seeing the outcome (actual class) </li></ul><ul><li>-log 2 p is also called the surprisal [Tribus, 1961] </li></ul><ul><li>It is the only nonnegative function consistent with the principle that the amount we are surprised by the occurrence of two independent events with probabilities p 1 and p 2 is the same as we are surprised by the occurrence of a single event with probability p 1 x p 2 </li></ul>
  19. 19. Information Gain of an Attribute <ul><li>Sometimes termed the Mutual Information (MI) gained regarding a class (e.g., a disease) given an attribute (e.g., a test), since it is symmetric </li></ul><ul><li>The expected reduction in entropy E(S) caused by partitioning the examples in S using the attribute A and all its corresponding values </li></ul><ul><li>Gain( S , A )  E ( S ) –  v  Values( A ) (| S v |/| S |) E (S v ) </li></ul><ul><li>The attribute with maximal information gain is chosen by ID3 for splitting the node </li></ul><ul><li>Follows from intuitive axioms [Benish, in press], e.g. not caring how the test result is revealed </li></ul>
  20. 20. Information Gain Example Humidity? Wind? {3+, 4-} E = 0.985 High Normal Strong Weak S: {9+,5-} E = 0.940 S: {9+,5-} E = 0.940 {6+, 1-} E = 0.592 {6+, 2-} E = 0.811 {3+, 3-} E = 1.0 Gain( S , Humidity ) = 0.940-(7/14)0.985-(7/14)0.592 = 0.151 Gain( S , Wind ) = 0.940-(8/14)0.811-(6/14)1.0 = 0.048
  21. 21. Properties of ID3 <ul><li>Searches the hypothesis space of decision trees </li></ul><ul><li>A complete space of all finite discrete-valued functions (unlike using conjunctive hypotheses) </li></ul><ul><li>Maintains only a single hypothesis (unlike CE) </li></ul><ul><li>Performs no backtracking ; thus, might get stuck in a local optimum </li></ul><ul><li>Uses all training examples at every step to refine the current hypothesis (unlike Find-S or CE) </li></ul><ul><li>(Approximate) Inductive bias : Prefers shorter trees over larger trees ( Occam’s razor ), and trees that place high information gain close to the root over those that do not </li></ul>
  22. 22. The Data Over-Fitting Problem <ul><li>Occurs due to noise in data or too-few examples </li></ul><ul><li>Handling the data over-fitting problem: </li></ul><ul><ul><li>Stop growing the tree earlier, or </li></ul></ul><ul><ul><li>Prune the final tree retrospectively </li></ul></ul><ul><li>In either case, correct final tree size is determined by </li></ul><ul><ul><ul><li>A separate validation set of examples, or </li></ul></ul></ul><ul><ul><ul><li>Using all examples, deciding if expansion is likely to help </li></ul></ul></ul><ul><ul><ul><li>Using an explicit measure to encode the training examples and the tree and stop when the measure is minimized </li></ul></ul></ul>
  23. 23. Other Improvements to ID3 <ul><li>Handling continuous values of attributes </li></ul><ul><ul><li>Pick a threshold that maximizes information gain </li></ul></ul><ul><li>Avoid selection of many-valued attributes such as date by using more sophisticated measures, such as gain ratio (dividing the gain of S relative to A and the target concept by the entropy of S with respect to the values of A ) </li></ul><ul><li>Handling missing values (average value or distribution) </li></ul><ul><li>Handling costs of measuring attributes (e.g., laboratory tests) by including cost in the attribute selection process </li></ul>
  24. 24. Summary: Concept and Decision-Tree Learning <ul><li>Concept learning is a search through a hypothesis space </li></ul><ul><li>The Candidate Elimination algorithm uses general-to-specific ordering of hypotheses to compute the version space </li></ul><ul><li>Inductive learning algorithms can classify unseen examples only because of their implicit inductive bias </li></ul><ul><li>ID3 searches through the space of decision trees </li></ul><ul><li>ID3 searches a complete hypothesis space and can handle noise and missing values in the training set </li></ul><ul><li>Over-fitting the training is a common problem and requires handling by methods such as post-pruning </li></ul>

×