Learning From Observations <ul><li>Chapter 18.1-18.3 </li></ul><ul><li>Quotes & Concept </li></ul><ul><li>Components of th...
Learning? <ul><li>&quot;Learning is making useful changes in our minds.&quot; </li></ul><ul><li>Marvin Minsky  </li></ul>
Learning? <ul><li>&quot;Learning is constructing or modifying representations of what is being experienced.&quot;   </li><...
Learning? <ul><li>&quot;Learning denotes changes in a system that ... enable a system to do the same task more efficiently...
Learning? <ul><li>What are different paradigms of learning? </li></ul><ul><ul><li>Rote Learning </li></ul></ul><ul><ul><li...
Learning? <ul><li>Why do machine learning? </li></ul><ul><ul><li>Understand and improve efficiency of human learning </li>...
Components of a Learning Agent Environment Agent Reasoning & Decisions Making Sensors Effectors Model of World (being upda...
Components of a Learning Agent Environment Sensors Effectors Agent Reasn & Decisn World Model Actions Prior Knowldg Goals ...
Components of a Learning Agent Environment Sensors Effectors Performance Element
Components of a Learning Agent  Environment Performance Element Learning Element Sensors Effectors PE provides  knowledge ...
Components of a Learning Agent  Environment Performance Element Learning Element Sensors Effectors Critic C provides  feed...
Components of a Learning Agent  Environment Performance Element Learning Element Sensors Effectors Critic Problem Generato...
Components of a Learning Agent Environment Performance Element Critic Learning Element Problem Generator Sensors Effectors...
Supervised Concept Learning by Induction <ul><li>What is   inductive   learning (IL)? </li></ul><ul><ul><li>Its learning f...
Supervised Concept Learning by Induction <ul><li>Why is inductive learning an inherently conjectural process? </li></ul><u...
Supervised Concept Learning by Induction <ul><li>What is   supervised   vs.   unsupervised   learning? </li></ul><ul><ul><...
Supervised Concept Learning by Induction <ul><li>What is   concept   learning (CL)? </li></ul><ul><ul><li>CL determines fr...
Supervised Concept Learning by Induction <ul><li>What is   supervised CL by induction ? </li></ul><ul><ul><li>Given a  tra...
Supervised Concept Learning by Induction <ul><li>How might the performance of this learning algorithm be evaluated? </li><...
Inductive Bias <ul><li>Learning can be viewed as searching the Hypothesis Space  H  of possible  h  functions. </li></ul><...
Inductive Bias <ul><li>Biases commonly used in machine learning </li></ul><ul><ul><li>Restricted Hypothesis Space Bias : a...
Inductive Concept Learning Framework <ul><li>Preprocess raw sensor data </li></ul><ul><ul><li>extract a feature vector,  x...
Inductive Concept Learning Framework <ul><li>Each example can be interpreted as a point in an n-dimensional feature space,...
Inductive Concept Learning Approach: Nearest-Neighbor Classification <ul><li>Nearest Neighbor , a simple approach: </li></...
Inductive Concept Learning Approach: Nearest-Neighbor Classification <ul><li>Doesn't generalize well if the  +  and  –  ex...
Inductive Concept Learning Approach: Learning Decision Trees <ul><li>Goal: Build a decision tree for classifying examples ...
Inductive Concept Learning Approach: Learning Decision Trees <ul><li>A  decision tree  is a tree in which: </li></ul><ul><...
Inductive Concept Learning Approach: Learning Decision Trees Suit Rank - clubs hearts spades - + Size - + large small 9 ja...
Inductive Concept Learning Approach: Learning Decision Trees <ul><li>Preference Bias:  Ockham's Razor </li></ul><ul><ul><l...
Learning From Observations <ul><li>Chapter 18.1-18.3 </li></ul><ul><li>Decision Tree Algorithm </li></ul><ul><li>Informati...
Decision Tree Algorithm <ul><li>ID3 or C5.0, first developed by Quinlan (1987) </li></ul><ul><li>Top-down construction of ...
Decision Tree Algorithm <ul><li>Node decisionTreeLearning ( exs, atts, defaultClass ) </li></ul><ul><ul><li>exs : list of ...
Decision Tree Algorithm <ul><li>Node decisionTreeLearning ( exs, atts, defaultClass) { </li></ul><ul><li>//three base case...
Decision Tree Algorithm <ul><li>How might the best attribute be chosen? </li></ul><ul><ul><li>Random:  choose any attribut...
Information Gain <ul><li>For a training set  S = P  U  N , where  P  (positive) and  N  (negative) are two disjoint subset...
Information Gain <ul><li>Perfect Homogeneity  in  S : when examples are either all + or – </li></ul><ul><ul><li>given  P =...
Information Gain <ul><li>Perfect Balance  in  S :  when exs. are evenly divided between + and – </li></ul><ul><ul><li>give...
Information Gain <ul><li>I  measures the information content in  bits  associated with a set  S  of exs. consisting of the...
Selecting the Best Attribute Using Information Gain <ul><li>Goal:   Construct a small decision tree that correctly classif...
Selecting the Best Attribute Using Information Gain <ul><li>How is the information gain determined for a selected attribut...
Selecting the Best Attribute Using Information Gain <ul><li>Given a set of examples:  S = P  U  N </li></ul><ul><li>Given ...
Selecting the Best Attribute Using Information Gain <ul><li>Remainder(A) =  q i  I(P i  , N i ) </li></ul><ul><ul><li>q i ...
Selecting the Best Attribute Using Information Gain  <ul><li>The  best attribute of those available at a node is the attri...
Case Studies <ul><li>Decision trees have been shown to be at least as accurate as human experts. </li></ul><ul><li>Diagnos...
Extensions to Decision Tree Learning Algorithm <ul><li>Evaluating Performance Accuracy: use  test examples  to estimate ac...
Extensions to Decision Tree Learning Algorithm <ul><li>Noisy data could be in the examples: </li></ul><ul><ul><li>examples...
Extensions to Decision Tree Learning Algorithm <ul><li>Real-valued data: </li></ul><ul><ul><li>pick thresholds that sort t...
Extensions to Decision Tree Learning Algorithm <ul><li>Generation of rules: for each path from the root to a leaf </li></u...
Extensions to Decision Tree Learning Algorithm <ul><li>Setting Parameters: </li></ul><ul><ul><li>some learning algorithms ...
Extensions to Decision Tree Learning Algorithm <ul><li>Setting Parameters: Using TUNE set </li></ul><ul><ul><li>partition ...
Extensions to Decision Tree Learning Algorithm <ul><li>Cross-Validation for experimental validation of performance </li></...
Extensions to Decision Tree Learning Algorithm <ul><li>Overfitting meaningless regularity is found in the data </li></ul><...
Pruning Decision Trees  using a Greedy Algorithm <ul><li>randomly partition training examples into: TRAIN set  (~80% of tr...
Pruning Decision Trees  using a Greedy Algorithm <ul><li>//find better tree </li></ul><ul><li>progressMade = false; </li><...
Pruning Decision Trees  using a Greedy Algorithm <ul><li>//pruned copy of currentTree </li></ul><ul><li>replace interiorNo...
Summary <ul><li>One of the most widely used learning methods in practice </li></ul><ul><li>Can out-perform human experts i...
Summary <ul><li>Strengths </li></ul><ul><ul><li>fast </li></ul></ul><ul><ul><li>simple to implement </li></ul></ul><ul><ul...
Summary <ul><li>Weaknesses </li></ul><ul><ul><li>Univariate : partitions using only one attribute at a time, which limits ...
Upcoming SlideShare
Loading in …5
×

3_learning.ppt

775 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
775
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Professor of EE and CS, MIT Media Lab, MIT AI Lab, Cambridge, MA Marvin Minsky has made many contributions to AI, cognitive psychology, mathematics, computational linguistics, robotics, and optics. In recent years he has worked chiefly on imparting to machines the human capacity for commonsense reasoning. His conception of human intellectual structure and function is presented in The Society of Mind .
  • Professor of Computational Sciences, Director of the Machine Learning and Inference Laboratory , George Mason University, Fairfax, VA Dr. Michalski is a cofounder of the field of machine learning, and the originator of several research subareas, such as conceptual clustering, constructive induction, variable-valued logic, natural induction, variable-precision logic (with Patrick Winston, MIT), computational theory of human plausible reasoning (with Alan Collins), two-tiered representation of imprecise concepts, multistrategy task-adaptive learning, inferential theory of learning, learnable evolution model, and, most recently, inductive databases and knowledge scouts.
  • 1916-2001 , Professor of Psychology, CMU, Pittsburgh, PA Herbert Simon&apos;s main interests in computer science were in artificial intelligence, human-computer interaction, principles of the organization of humans and machines is information processing systems, the use of computers to study (by modeling) philosophical problems of the nature of intelligence and of epistemology, and the social implications of computer technology.
  • RL: One-to-one mapping from inputs to stored representation. &amp;quot;Learning by memorization.&amp;quot; Association-based storage and retrieval. I: Use specific examples to reach general conclusions C: A: Determine correspondence between two different representations D: Unsupervised, specific goal not given GA: R: Only feedback (positive or negative reward) given at end of a sequence of steps. Requires assigning reward to steps by solving the credit assignment problem--which steps should receive credit or blame for a final result?
  • Efficiency: use to improve methods for teaching and tutoring people, as done in CAI -- Computer-aided instruction Discover: Data mining Fill in: Large, complex AI systems cannot be completely derived by hand and require dynamic updating to incorporate new information. Learning new characteristics expands the domain or expertise and lessens the &amp;quot;brittleness&amp;quot; of the system.
  • e.g. feedback of success or failure
  • We&apos;ll concentrate on the LE.
  • card example: train-large non-face cards, test-small non-face + or -? Can prove truth? Deductive inference is truth preserving. It is a logical consequences of the information and will have the same truth-status of the original information.
  • Ockham&apos;s Razor is the principle proposed by William of Ockham (England) in the fourteenth century: ``Pluralitas non est ponenda sine neccesitate&apos;&apos;, which translates as ``entities should not be multiplied unnecessarily&apos;&apos;. A related rule, which can be used to slice open conspiracy theories, is Hanlon&apos;s Razor: ``Never attribute to malice that which can be adequately explained by stupidity&apos;&apos;.
  • do simple concept &amp;quot;all big queens&amp;quot;
  • lim x-&gt; 0, x log 2 x, is 0 since x goes to 0 faster than log 2 x goes to –infinity l&apos;Hopital&apos;s Rule for indeterminate forms, Guillaume Francois Antoine de l&apos;Hopital (1661 – 1704)
  • Rare base case: if (empty(atts)) return majorityClass(exs); Why? Means same set of feature values produced different classifications. Probably implies noise in data or more features are needed.
  • When the number of examples available is small (less than about 100) use Leave-1-Out where N-fold cross-validation uses N = number of exs.
  • 3_learning.ppt

    1. 1. Learning From Observations <ul><li>Chapter 18.1-18.3 </li></ul><ul><li>Quotes & Concept </li></ul><ul><li>Components of the Learning Agent </li></ul><ul><li>Supervised Concept Learning by Induction </li></ul><ul><li>Inductive Bias </li></ul><ul><li>Inductive Concept Learning Framework and Approaches </li></ul><ul><li>Decision Tree Algorithm </li></ul>
    2. 2. Learning? <ul><li>&quot;Learning is making useful changes in our minds.&quot; </li></ul><ul><li>Marvin Minsky </li></ul>
    3. 3. Learning? <ul><li>&quot;Learning is constructing or modifying representations of what is being experienced.&quot; </li></ul><ul><li>Ryszard Michalski </li></ul>
    4. 4. Learning? <ul><li>&quot;Learning denotes changes in a system that ... enable a system to do the same task more efficiently the next time.&quot; </li></ul><ul><li>Herbert Simon </li></ul>1916 - 2001
    5. 5. Learning? <ul><li>What are different paradigms of learning? </li></ul><ul><ul><li>Rote Learning </li></ul></ul><ul><ul><li>Induction </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Analogy </li></ul></ul><ul><ul><li>Discovery </li></ul></ul><ul><ul><li>Genetic Algorithms </li></ul></ul><ul><ul><li>Reinforcement </li></ul></ul>
    6. 6. Learning? <ul><li>Why do machine learning? </li></ul><ul><ul><li>Understand and improve efficiency of human learning </li></ul></ul><ul><ul><li>Discover new things or structures that are unknown to humans </li></ul></ul><ul><ul><li>Fill in skeletal or incomplete specifications about a domain </li></ul></ul>
    7. 7. Components of a Learning Agent Environment Agent Reasoning & Decisions Making Sensors Effectors Model of World (being updated) List of Possible Actions Prior Knowledge about the World Goals/Utility
    8. 8. Components of a Learning Agent Environment Sensors Effectors Agent Reasn & Decisn World Model Actions Prior Knowldg Goals /Utility
    9. 9. Components of a Learning Agent Environment Sensors Effectors Performance Element
    10. 10. Components of a Learning Agent Environment Performance Element Learning Element Sensors Effectors PE provides knowledge to LE LE changes PE based on how it is doing
    11. 11. Components of a Learning Agent Environment Performance Element Learning Element Sensors Effectors Critic C provides feedback to LE on how PE is doing C compares PE with a standard of performance that’s told (via sensors)
    12. 12. Components of a Learning Agent Environment Performance Element Learning Element Sensors Effectors Critic Problem Generator PG suggests problems or actions to PE that will generate new examples or experiences that will aid in achieving the goals from the LE Learning Agent
    13. 13. Components of a Learning Agent Environment Performance Element Critic Learning Element Problem Generator Sensors Effectors Learning Agent
    14. 14. Supervised Concept Learning by Induction <ul><li>What is inductive learning (IL)? </li></ul><ul><ul><li>Its learning from examples. </li></ul></ul><ul><ul><li>IL extrapolates from a given set of examples so that accurate predictions can be made about future examples. </li></ul></ul><ul><ul><li>Learn unknown function: f(x) = y </li></ul></ul><ul><ul><ul><li>x : an input example </li></ul></ul></ul><ul><ul><ul><li>y : the desired output </li></ul></ul></ul><ul><ul><ul><li>h (hypothesis) is learned which approximates f </li></ul></ul></ul>
    15. 15. Supervised Concept Learning by Induction <ul><li>Why is inductive learning an inherently conjectural process? </li></ul><ul><ul><li>any knowledge created by generalization from specific facts cannot be proven true </li></ul></ul><ul><ul><li>it can only be proven false </li></ul></ul><ul><li>Inductive inference is falsity preserving, not truth preserving. </li></ul><ul><ul><li>An induction produces an hypothesis that must be validated by relating to new fact or experiments. </li></ul></ul><ul><ul><li>It is only an opinion and not necessarily true. </li></ul></ul>
    16. 16. Supervised Concept Learning by Induction <ul><li>What is supervised vs. unsupervised learning? </li></ul><ul><ul><li>supervised: &quot;teacher&quot; gives a set of both the input examples and desired outputs, i.e. (x, y) pairs </li></ul></ul><ul><ul><li>unsupervised: only given the input examples, i.e. the x s </li></ul></ul><ul><li>In either case, the goal is to determine an hypothesis h that estimates f . </li></ul>
    17. 17. Supervised Concept Learning by Induction <ul><li>What is concept learning (CL)? </li></ul><ul><ul><li>CL determines from a given a set of examples if a given example is or isn't an instance of the concept/class/category. </li></ul></ul><ul><ul><ul><li>If it is, call it a positive example = + </li></ul></ul></ul><ul><ul><ul><li>If not, called it a negative example = - </li></ul></ul></ul>
    18. 18. Supervised Concept Learning by Induction <ul><li>What is supervised CL by induction ? </li></ul><ul><ul><li>Given a training set of positive and negative examples of a concept: </li></ul></ul><ul><ul><ul><li>{(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )} </li></ul></ul></ul><ul><ul><ul><li>each x i is an example </li></ul></ul></ul><ul><ul><ul><li>each y i is the classification, either + or - </li></ul></ul></ul><ul><ul><li>Construct a description that accurately classifies whether future examples are positive or negative: </li></ul></ul><ul><ul><ul><li>h(x n+1 ) = y n+1 </li></ul></ul></ul><ul><ul><ul><li>where y n+1 is the prediction, either + or – </li></ul></ul></ul>
    19. 19. Supervised Concept Learning by Induction <ul><li>How might the performance of this learning algorithm be evaluated? </li></ul><ul><ul><li>predictive accuracy of classifier (most common) </li></ul></ul><ul><ul><li>speed of learner </li></ul></ul><ul><ul><li>speed of classifier </li></ul></ul><ul><ul><li>space requirements </li></ul></ul>
    20. 20. Inductive Bias <ul><li>Learning can be viewed as searching the Hypothesis Space H of possible h functions. </li></ul><ul><li>Inductive Bias: </li></ul><ul><ul><li>is when one h is chosen over another </li></ul></ul><ul><ul><li>is needed to generalize beyond the specific training examples </li></ul></ul><ul><li>Completely unbiased inductive algorithm </li></ul><ul><ul><li>only memorizes training examples (rote learning) </li></ul></ul><ul><ul><li>can't predict about unseen examples </li></ul></ul>
    21. 21. Inductive Bias <ul><li>Biases commonly used in machine learning </li></ul><ul><ul><li>Restricted Hypothesis Space Bias : allow only certain types of h s, not arbitrary ones </li></ul></ul><ul><ul><li>Preference Bias : define a metric for comparing h s so as to determine whether one is better than another </li></ul></ul>
    22. 22. Inductive Concept Learning Framework <ul><li>Preprocess raw sensor data </li></ul><ul><ul><li>extract a feature vector, x , that describes attributes relevant for classifying examples </li></ul></ul><ul><li>Each x is a list of (attribute, value) pairs </li></ul><ul><ul><li>x = (Rank,queen), (Suit,hearts), (Size,big) </li></ul></ul><ul><ul><li>number of attributes is fixed: Rank Suit Size </li></ul></ul><ul><ul><li>number of possible values for each attribute is fixed Rank : 2 ..10 jack queen king ace Suit : diamonds hearts clubs spades Size : big small </li></ul></ul>
    23. 23. Inductive Concept Learning Framework <ul><li>Each example can be interpreted as a point in an n-dimensional feature space, where n is the number of attributes. </li></ul>Suit Rank spades clubs hearts diamonds 2 4 6 8 10 J Q K
    24. 24. Inductive Concept Learning Approach: Nearest-Neighbor Classification <ul><li>Nearest Neighbor , a simple approach: </li></ul><ul><ul><li>save each training example as a point in Feature Space </li></ul></ul><ul><ul><li>classify a new example by giving it the same classification ( + or - ) as its nearest neighbor in Feature Space </li></ul></ul>
    25. 25. Inductive Concept Learning Approach: Nearest-Neighbor Classification <ul><li>Doesn't generalize well if the + and – examples are not &quot;clustered&quot; </li></ul>Suit Rank Spades Clubs Hearts Diamonds 2 4 6 8 10 J Q K
    26. 26. Inductive Concept Learning Approach: Learning Decision Trees <ul><li>Goal: Build a decision tree for classifying examples as positive or negative instances of a concept </li></ul><ul><ul><li>is form of supervised concept learning by induction </li></ul></ul><ul><ul><li>uses preference & restricted hypothesis space bias </li></ul></ul><ul><ul><li>has two phases: </li></ul></ul><ul><ul><ul><li>learning : uses batch processing of training examples to produce a classifier </li></ul></ul></ul><ul><ul><ul><li>classifying : uses classifier to make predictions about unseen examples </li></ul></ul></ul>
    27. 27. Inductive Concept Learning Approach: Learning Decision Trees <ul><li>A decision tree is a tree in which: </li></ul><ul><ul><li>each non-leaf node has associated with it an attribute (feature) </li></ul></ul><ul><ul><li>each leaf node has associated with it a classification (+ or -) </li></ul></ul><ul><ul><li>each arc has associated with it one of the possible values of its parent node’s attribute </li></ul></ul>
    28. 28. Inductive Concept Learning Approach: Learning Decision Trees Suit Rank - clubs hearts spades - + Size - + large small 9 jack 10 Size + - large small interior node = feature leaf node = c l a s s i f i c a t i o n arc = value - diamonds … …
    29. 29. Inductive Concept Learning Approach: Learning Decision Trees <ul><li>Preference Bias: Ockham's Razor </li></ul><ul><ul><li>The simplest hypothesis that is consistent with all observations is most likely. </li></ul></ul><ul><ul><li>The smallest decision tree that correctly classifies all of the training examples is best. </li></ul></ul><ul><li>Finding the provably smallest decision tree is an NP-Hard problem, so instead construct one that is pretty small. </li></ul>
    30. 30. Learning From Observations <ul><li>Chapter 18.1-18.3 </li></ul><ul><li>Decision Tree Algorithm </li></ul><ul><li>Information Gain </li></ul><ul><li>Selecting the Best Attribute </li></ul><ul><li>Case Studies & Extensions </li></ul><ul><li>Pruning </li></ul>
    31. 31. Decision Tree Algorithm <ul><li>ID3 or C5.0, first developed by Quinlan (1987) </li></ul><ul><li>Top-down construction of the decision tree using a greedy algorithm: </li></ul><ul><ul><li>Select the &quot;best attribute&quot; to use for the new node at the current level in the tree. </li></ul></ul><ul><ul><li>For each possible value of the selected attribute: </li></ul></ul><ul><ul><ul><li>Partition the examples using the possible values of this attribute, and assign these subsets of the examples to the appropriate child node. </li></ul></ul></ul><ul><ul><ul><li>Recursively generate each child node until (ideally) all examples for a node are either all + or all -. </li></ul></ul></ul>
    32. 32. Decision Tree Algorithm <ul><li>Node decisionTreeLearning ( exs, atts, defaultClass ) </li></ul><ul><ul><li>exs : list of training examples </li></ul></ul><ul><ul><li>atts : list of candidate attributes for the current node </li></ul></ul><ul><ul><li>defaultClass : default class for when no examples are left initially set the majority class of all the training examples split ties in favor of - </li></ul></ul>
    33. 33. Decision Tree Algorithm <ul><li>Node decisionTreeLearning ( exs, atts, defaultClass) { </li></ul><ul><li>//three base cases each create leaf nodes if (empty(exs)) return new node defaultClass ; if (sameClass(exs)) return new node class(exs) ; if (empty(atts)) return new node majorityClass(exs) ; </li></ul><ul><li>//recursive case creates sub-tree STEPS: best = chooseBestAttribute(exs, atts); // 1. tree = new node with best ; majority = majorityClass(exs); </li></ul><ul><li>for ( each value v of best ) { // 2. v_exs = subset of exs with best == v ; // a. subtree = decisionTreeLearning( // b. v_exs, atts - best , majority); </li></ul><ul><li> link subtree to tree and label arc v ; </li></ul><ul><li>} return tree ; </li></ul><ul><li>} </li></ul>
    34. 34. Decision Tree Algorithm <ul><li>How might the best attribute be chosen? </li></ul><ul><ul><li>Random: choose any attribute at random </li></ul></ul><ul><ul><li>Least-Values: choose the attribute with the smallest number of possible values </li></ul></ul><ul><ul><li>Most-Values: choose the attribute with the largest number of possible values </li></ul></ul><ul><ul><li>Max-Gain: choose the attribute that has the largest expected information gain . C5.0 uses this. </li></ul></ul>
    35. 35. Information Gain <ul><li>For a training set S = P U N , where P (positive) and N (negative) are two disjoint subsets, information gain is defined as: I(P, N) = -(P log 2 P) - (N log 2 N) where P = |P|/|S| and N = |N|/|S| note: |x| is size of set x </li></ul><ul><li>Information gain is equivalent to a reduction in entropy , i.e. the amount of disorder in a system. </li></ul>
    36. 36. Information Gain <ul><li>Perfect Homogeneity in S : when examples are either all + or – </li></ul><ul><ul><li>given P = 1 (i.e. 100%) and N = 0 (i.e. 0%) </li></ul></ul><ul><ul><li>I(P, N) = -(P log 2 P) - (N log 2 N) </li></ul></ul><ul><ul><li>I(1, 0) = -1 log 2 1 - 0 log 2 0 </li></ul></ul><ul><ul><li>= -1 (0) - ??? </li></ul></ul><ul><ul><li>= -0 - 0 </li></ul></ul><ul><ul><li>= 0 no disorder, no information content </li></ul></ul>
    37. 37. Information Gain <ul><li>Perfect Balance in S : when exs. are evenly divided between + and – </li></ul><ul><ul><li>given P = N = ½ (i.e. 50%) </li></ul></ul><ul><ul><li>I(P, N) = -(P log 2 P) - (N log 2 N) </li></ul></ul><ul><ul><li>I(½, ½) = -½ log 2 ½ - ½ log 2 ½ </li></ul></ul><ul><ul><li>= -½ (log 2 1 - log 2 2) - ½ (log 2 1 - log 2 2) </li></ul></ul><ul><ul><li>= -½ (0 - 1) - ½ (0 - 1) </li></ul></ul><ul><ul><li>= ½ + ½ </li></ul></ul><ul><ul><li>= 1 max disorder, max information content </li></ul></ul>
    38. 38. Information Gain <ul><li>I measures the information content in bits associated with a set S of exs. consisting of the disjoint subsets P of positive examples. and N of negative examples. </li></ul><ul><ul><li>0 <= I(P,N) <= 1 where 0 is no info., and 1 is maximum info. </li></ul></ul><ul><ul><li>its a continuous value, not binary </li></ul></ul><ul><ul><li>for more see: http://www-2.cs.cmu.edu/~dst/Tutorials/Info-Theory/ </li></ul></ul>
    39. 39. Selecting the Best Attribute Using Information Gain <ul><li>Goal: Construct a small decision tree that correctly classifies the training examples. </li></ul><ul><li>How? Select the attribute that will result in the smallest child sub-trees. </li></ul><ul><li>Why would having low information content in the child nodes be desirable? </li></ul><ul><ul><li>means most, if not all, of examples in each child node are of the same class </li></ul></ul><ul><ul><li>The decision tree likely to be small! </li></ul></ul>
    40. 40. Selecting the Best Attribute Using Information Gain <ul><li>How is the information gain determined for a selected attribute A ? </li></ul><ul><ul><li>partition a parent node's examples by the possible values of the selected attribute </li></ul></ul><ul><ul><li>assign these subsets of examples to child nodes </li></ul></ul><ul><ul><li>then take the difference of the info. content of the parent node the remaining info. content in the children </li></ul></ul><ul><li>Gain(A) = I(P, N) – Remainder(A) </li></ul>
    41. 41. Selecting the Best Attribute Using Information Gain <ul><li>Given a set of examples: S = P U N </li></ul><ul><li>Given a selected attribute: A having m possible values </li></ul><ul><li>Define Remainder(A) </li></ul><ul><ul><li>weighted sum of the information content at each child node resulting from selecting attribute A </li></ul></ul><ul><ul><li>measures the total &quot;disorder&quot; or &quot;inhomogeneity&quot; of the child nodes </li></ul></ul><ul><ul><li>0 <= Remainder(A) <= 1 </li></ul></ul>
    42. 42. Selecting the Best Attribute Using Information Gain <ul><li>Remainder(A) = q i I(P i , N i ) </li></ul><ul><ul><li>q i = |S i |/|S| = proportion of examples on branch i where S i = subset of S with value i, i = 1,...,m </li></ul></ul><ul><ul><li>P i = |P i |/|S i | = proportion of + examples on branch i where P i = subset of S i that are + </li></ul></ul><ul><ul><li>N i = |N i |/|S i | = proportion of - examples on branch i where N i = subset of S i that are - </li></ul></ul>i=1 m
    43. 43. Selecting the Best Attribute Using Information Gain <ul><li>The best attribute of those available at a node is the attribute A with: </li></ul><ul><ul><li>maximum Gain(A) </li></ul></ul><ul><ul><li>or equivalently, with minimum Remainder(A) since I(P,N) is constant for a given node </li></ul></ul>
    44. 44. Case Studies <ul><li>Decision trees have been shown to be at least as accurate as human experts. </li></ul><ul><li>Diagnosing breast cancer </li></ul><ul><ul><li>humans correct 65% of the time </li></ul></ul><ul><ul><li>decision tree classified 72% correct </li></ul></ul><ul><li>BP designed a decision tree for gas-oil separation for offshore oil platforms </li></ul><ul><li>Cessna designed a flight controller using 90,000 exs. and 20 attributes per ex. </li></ul>
    45. 45. Extensions to Decision Tree Learning Algorithm <ul><li>Evaluating Performance Accuracy: use test examples to estimate accuracy </li></ul><ul><ul><li>randomly partition training exs. into: </li></ul></ul><ul><ul><ul><li>TRAIN set (~80% of all training exs.) </li></ul></ul></ul><ul><ul><ul><li>TEST set (~20% of all training exs.) </li></ul></ul></ul><ul><ul><li>generate decision tree using the TRAIN set </li></ul></ul><ul><ul><li>use TEST set to evaluate accuracy </li></ul></ul><ul><ul><ul><li>accuracy = # of correct tests / total # of tests </li></ul></ul></ul>
    46. 46. Extensions to Decision Tree Learning Algorithm <ul><li>Noisy data could be in the examples: </li></ul><ul><ul><li>examples have the same attribute values, but different classifications (rare case: if(empty(atts)) ) </li></ul></ul><ul><ul><li>classification is wrong </li></ul></ul><ul><ul><li>attributes values are incorrect because of errors getting or preprocessing the data </li></ul></ul><ul><ul><li>irrelevant attributes used in the decision-making process </li></ul></ul>
    47. 47. Extensions to Decision Tree Learning Algorithm <ul><li>Real-valued data: </li></ul><ul><ul><li>pick thresholds that sort training values for attribute into equal sized bins </li></ul></ul><ul><ul><li>preprocess values during learning to find the most informative thresholds </li></ul></ul><ul><li>Missing data: </li></ul><ul><ul><li>while learning: replace with most likely value </li></ul></ul><ul><ul><li>while learning: use NotKnown as a value </li></ul></ul><ul><ul><li>while classifying: follow arc for all values and weight each by the frequency of exs. crossing that arc </li></ul></ul>
    48. 48. Extensions to Decision Tree Learning Algorithm <ul><li>Generation of rules: for each path from the root to a leaf </li></ul><ul><ul><li>the rule's antecedent is the attribute tests </li></ul></ul><ul><ul><li>the consequent is the classification at leaf node </li></ul></ul><ul><ul><li>if ( Size == small && Suit == hearts ) class = '+' ; </li></ul></ul><ul><ul><li>Constructing theses rules yields an interpretation of the tree's meaning. </li></ul></ul>
    49. 49. Extensions to Decision Tree Learning Algorithm <ul><li>Setting Parameters: </li></ul><ul><ul><li>some learning algorithms require setting learning parameters </li></ul></ul><ul><ul><li>they must be set without looking at the test data </li></ul></ul><ul><ul><li>one approach: use a tuning set </li></ul></ul>
    50. 50. Extensions to Decision Tree Learning Algorithm <ul><li>Setting Parameters: Using TUNE set </li></ul><ul><ul><li>partition training exs. into TRAIN, TUNE, & TEST sets </li></ul></ul><ul><ul><li>for each candidate parameter value, generate a decision tree using the TRAIN set </li></ul></ul><ul><ul><li>use TUNE set to evaluate error rates and determine which parameter value is best </li></ul></ul><ul><ul><li>compute new decision tree using selected parameter values and both TRAIN and TUNE set </li></ul></ul><ul><ul><li>use TEST to compute performance accuracy </li></ul></ul>
    51. 51. Extensions to Decision Tree Learning Algorithm <ul><li>Cross-Validation for experimental validation of performance </li></ul><ul><ul><li>divide all examples into N disjoint subsets E = E 1 , E 2 , ..., E N </li></ul></ul><ul><ul><li>for each i = 1, ..., N </li></ul></ul><ul><ul><ul><li>let TEST set = E i and TRAIN set = E - E i </li></ul></ul></ul><ul><ul><ul><li>compute decision tree using TRAIN set </li></ul></ul></ul><ul><ul><ul><li>determine performance accuracy PA i using TEST set </li></ul></ul></ul><ul><ul><li>compute N-fold cross-validation estimate of performance = (PA 1 + PA 2 + ... + PA N )/N </li></ul></ul>
    52. 52. Extensions to Decision Tree Learning Algorithm <ul><li>Overfitting meaningless regularity is found in the data </li></ul><ul><ul><li>irrelevant attributes confound the true, important, distinguishing features </li></ul></ul><ul><ul><li>fix by pruning lower nodes in the decision tree </li></ul></ul><ul><ul><li>if gain of best attribute is below a threshold, make this node a leaf rather than generating child nodes </li></ul></ul>
    53. 53. Pruning Decision Trees using a Greedy Algorithm <ul><li>randomly partition training examples into: TRAIN set (~80% of training exs.) TUNE set (~10% of training exs.) TEST set (~10% of training exs.) </li></ul><ul><li>build decision tree as usual using TRAIN set </li></ul><ul><li>bestTree = decision tree produced on the TRAIN set ; </li></ul><ul><li>bestAccuracy = accuracy of bestTree on the TUNE set ; </li></ul><ul><li>progressMade = true; </li></ul><ul><li>while (progressMade) { //while accuracy on TUNE improves </li></ul><ul><li>find better tree ; </li></ul><ul><li>//starting at root, consider various pruned versions //of the current tree and see if any are better than //the best tree found so far </li></ul><ul><li>} </li></ul><ul><li>return bestTree; </li></ul><ul><li>use TEST to determine performance accuracy ; </li></ul>
    54. 54. Pruning Decision Trees using a Greedy Algorithm <ul><li>//find better tree </li></ul><ul><li>progressMade = false; </li></ul><ul><li>currentTree = bestTree; </li></ul><ul><li>for ( each interiorNode N in currentTree ) { //start at root </li></ul><ul><li>prunedTree = pruned copy of currentTree ; </li></ul><ul><li>newAccuracy = accuracy of prunedTree on TUNE set ; </li></ul><ul><li>if (newAccuracy >= bestAccuracy) { </li></ul><ul><li>bestAccuracy = newAccuracy; </li></ul><ul><li>bestTree = prunedTree; </li></ul><ul><li>progressMade = true; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    55. 55. Pruning Decision Trees using a Greedy Algorithm <ul><li>//pruned copy of currentTree </li></ul><ul><li>replace interiorNode N in currentTree by a leaf node </li></ul><ul><li>label leaf node with the majorityClass among TRAIN set examples that reached node N </li></ul><ul><li>break ties in favor of '-' </li></ul>
    56. 56. Summary <ul><li>One of the most widely used learning methods in practice </li></ul><ul><li>Can out-perform human experts in many problems </li></ul>
    57. 57. Summary <ul><li>Strengths </li></ul><ul><ul><li>fast </li></ul></ul><ul><ul><li>simple to implement </li></ul></ul><ul><ul><li>well founded in information theory </li></ul></ul><ul><ul><li>can convert result to a set of easily interpretable rules </li></ul></ul><ul><ul><li>empirically valid in many commercial products </li></ul></ul><ul><ul><li>handles noisy data </li></ul></ul><ul><ul><li>scales well </li></ul></ul>
    58. 58. Summary <ul><li>Weaknesses </li></ul><ul><ul><li>Univariate : partitions using only one attribute at a time, which limits types of possible trees </li></ul></ul><ul><ul><li>large decision trees may be hard to understand </li></ul></ul><ul><ul><li>requires fixed-length feature vectors </li></ul></ul><ul><ul><li>non-incremental (i.e., batch method) </li></ul></ul>

    ×