Supervised Learning Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001
Administration Exams graded!  http://www.cs.princeton.edu/courses/archive/fall01/cs302/whats-new.html Project groups.
Supervised Learning Most studied in machine learning. http://www1.ics.uci.edu/~mlearn/MLRepository.html Set of examples (usually numeric vectors).  Split into: Training:  Allowed to see it Test:  Want to minimize error here
Another Significant App Name A  B  C D E  F   G 1. Jeffrey B. 1  0  1  0  1  0  1  - 2. Paul S. 0  1  1  0  0  0  1  - 3. Daniel C. 0  0  1  0  0  0  0  - 4. Gregory P. 1  0  1  0  1  0  0  - 5. Michael N. 0  0  1  1  0  0  0  - 6. Corinne N. 1  1  1  0  1  0  1  + 7. Mariyam M. 0  1  0  1  0  0  1  + 8. Stephany D. 1  1  1  1  1  1  1  + 9. Mary D. 1  1  1  1  1  1  1  + 10. Jamie F. 1  1  1  0  0  1  1  +
Features A: First name ends in a vowel? B: Neat handwriting?  (Lisa test.) C: Middle name listed? D: Senior? E: Got extra-extra credit? F: Google brings up home page? G: Google brings up reference?
Decision Tree Internal nodes: features Leaves: classification F A D A 0 1 8,9 2,3 , 7 1,4,5 , 6 10 Error: 30%
Search Given a set of training data, pick a decision tree: search problem! Challenges: Scoring function? Large space of trees.
Scoring Function What’s a good tree? Low error on training data Small Small tree is obviously not enough, why isn’t low error?
Low Error Not Enough C E B 0 1 F middle name? EEC? Neat? Google? Training set Error: 0% (can always do this?)
Memorizing the Data D E F B A A B A A C B A A B A A C
“Learning Curve” error Tree size
What’s the Problem? Memorization w/o generalization Want a tree big enough to be correct, but not so big that it gets distracted by particulars. But, how can we know? (Weak) theoretical bounds exist.
Cross-validation Simple, effective hack method. Data Test Train C-V Train’
Concrete Idea: Pruning Use Train’ to find tree w/ no error. Use C-V to score prunings of tree. Return pruned tree w/ max score.
How Find the Tree? Lots to choose from. Could use local search. Greedy search…
Why Might This Fail? No target function, just noise Target function too complex (2 2^n  possibilities, parity) Training data doesn’t match target function (PAC bounds)
Theory: PAC Learning Probably   Approximately  Correct Training/testing from distribution. With probability 1-  , learned rule will have error smaller than   . Bounds on size of training set in terms of    “dimensionality” of the target concept.
Classification Naïve Bayes classifier Differentiation vs. modeling More on this later.
What to Learn Decision tree representation Memorization problem: causes and cures (cross-validation, pruning) Greedy heuristic for finding small trees with low error
Homework 9 (due 12/5) Write a program that decides if a pair of words are synonyms using wordnet.  I’ll send you the list, you send me the answers. Draw a decision tree that represents (a) f 1 +f 2 +…+f n  (or),  (b) f 1 f 2 …f n  (and), (c) parity (odd number of features “on”). More soon

Heuristic Search

  • 1.
    Supervised Learning Introductionto Artificial Intelligence COS302 Michael L. Littman Fall 2001
  • 2.
    Administration Exams graded! http://www.cs.princeton.edu/courses/archive/fall01/cs302/whats-new.html Project groups.
  • 3.
    Supervised Learning Moststudied in machine learning. http://www1.ics.uci.edu/~mlearn/MLRepository.html Set of examples (usually numeric vectors). Split into: Training: Allowed to see it Test: Want to minimize error here
  • 4.
    Another Significant AppName A B C D E F G 1. Jeffrey B. 1 0 1 0 1 0 1 - 2. Paul S. 0 1 1 0 0 0 1 - 3. Daniel C. 0 0 1 0 0 0 0 - 4. Gregory P. 1 0 1 0 1 0 0 - 5. Michael N. 0 0 1 1 0 0 0 - 6. Corinne N. 1 1 1 0 1 0 1 + 7. Mariyam M. 0 1 0 1 0 0 1 + 8. Stephany D. 1 1 1 1 1 1 1 + 9. Mary D. 1 1 1 1 1 1 1 + 10. Jamie F. 1 1 1 0 0 1 1 +
  • 5.
    Features A: Firstname ends in a vowel? B: Neat handwriting? (Lisa test.) C: Middle name listed? D: Senior? E: Got extra-extra credit? F: Google brings up home page? G: Google brings up reference?
  • 6.
    Decision Tree Internalnodes: features Leaves: classification F A D A 0 1 8,9 2,3 , 7 1,4,5 , 6 10 Error: 30%
  • 7.
    Search Given aset of training data, pick a decision tree: search problem! Challenges: Scoring function? Large space of trees.
  • 8.
    Scoring Function What’sa good tree? Low error on training data Small Small tree is obviously not enough, why isn’t low error?
  • 9.
    Low Error NotEnough C E B 0 1 F middle name? EEC? Neat? Google? Training set Error: 0% (can always do this?)
  • 10.
    Memorizing the DataD E F B A A B A A C B A A B A A C
  • 11.
  • 12.
    What’s the Problem?Memorization w/o generalization Want a tree big enough to be correct, but not so big that it gets distracted by particulars. But, how can we know? (Weak) theoretical bounds exist.
  • 13.
    Cross-validation Simple, effectivehack method. Data Test Train C-V Train’
  • 14.
    Concrete Idea: PruningUse Train’ to find tree w/ no error. Use C-V to score prunings of tree. Return pruned tree w/ max score.
  • 15.
    How Find theTree? Lots to choose from. Could use local search. Greedy search…
  • 16.
    Why Might ThisFail? No target function, just noise Target function too complex (2 2^n possibilities, parity) Training data doesn’t match target function (PAC bounds)
  • 17.
    Theory: PAC LearningProbably Approximately Correct Training/testing from distribution. With probability 1-  , learned rule will have error smaller than  . Bounds on size of training set in terms of  “dimensionality” of the target concept.
  • 18.
    Classification Naïve Bayesclassifier Differentiation vs. modeling More on this later.
  • 19.
    What to LearnDecision tree representation Memorization problem: causes and cures (cross-validation, pruning) Greedy heuristic for finding small trees with low error
  • 20.
    Homework 9 (due12/5) Write a program that decides if a pair of words are synonyms using wordnet. I’ll send you the list, you send me the answers. Draw a decision tree that represents (a) f 1 +f 2 +…+f n (or), (b) f 1 f 2 …f n (and), (c) parity (odd number of features “on”). More soon