Introduction to Machine Learning
Reminders HW6 posted All HW solutions posted
A Review You will be asked to  learn the parameters  of a probabilistic model of C given A,B, using some data The parameters model entries of a conditional probability distribution as coin flips  4  3  2  1 P(C|AB)  A,  B  A,B A,  B A,B
A Review Example data ML estimates  4 =6/11  3 =4/11  2 =1/11  1 =3/8 P(C|AB)  A,  B  A,B A,  B A,B 5 0 0 0 6 1 0 0 7 0 1 0 4 1 1 0 10 0 0 1 1 1 0 1 5 0 1 1 3 1 1 1 # obs C B A
Maximum Likelihood for BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data Alarm Earthquake Burglar E 500 B: 200 N=1000 P(E)  = 0.5 P(B)  = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A|  : 1/380 0.003 F F 0.34 F T 0.95 T F 0.95 T T P(A|E,B) B E
Maximum A Posteriori Example data MAP Estimates assuming Beta prior with a=b=3 virtual counts  2H,2T  4 =8/15  3 =6/15  2 =3/15  1 =5/14 P(C|AB)  A,  B  A,B A,  B A,B 5 0 0 0 6 1 0 0 7 0 1 0 4 1 1 0 10 0 0 1 1 1 0 1 5 0 1 1 3 1 1 1 # obs C B A
Discussion Where are we now? How does it relate to the future of AI? How will you be tested on it?
Motivation Agent has made observations ( data ) Now must make sense of it ( hypotheses ) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …
Topics in Machine Learning Applications Document retrieval Document classification Data mining Computer vision Scientific discovery Robotics … Tasks & settings Classification Ranking Clustering Regression Decision-making Supervised Unsupervised Semi-supervised Active Reinforcement learning Techniques Bayesian learning Decision trees Neural networks Support vector machines Boosting Case-based reasoning   Dimensionality reduction …
What is Learning? Mostly generalization from experience: “ Our experience of the world is specific,  yet we are able to formulate general  theories that account for the past and  predict the future” M.R.   Genesereth and N.J. Nilsson,  in  Logical Foundations of AI , 1987    Concepts, heuristics, policies Supervised vs. un-supervised learning
Inductive Learning Basic form: learn a function from examples f  is the unknown target function An example is a pair ( x ,  f(x) ) Problem: find a hypothesis  h such that  h ≈ f given a  training set  of examples D Instance of  supervised learning Classification task: f    {0,1,…,C} Regression task: f    reals
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting:
Inductive learning method Construct/adjust  h  to agree with  f  on training set ( h  is  consistent  if it agrees with  f  on all examples) E.g., curve fitting: h=D is a trivial, but perhaps uninteresting solution (caching)
Classification Task The  concept  f(x) takes on values True and False A example is  positive  if  f  is True, else it is  negative The set X of all examples is the  example set The  training set  is a subset of X a small one!
Logic-Based Inductive Learning Here, examples (x, f(x)) take on discrete values
Logic-Based Inductive Learning Here, examples (x,f(x)) take discrete values Concept Note that the training set does not say whether  an observable predicate is pertinent or not
Rewarded Card Example Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB:   ((r=1) v … v (r=10))    NUM(r) ((r=J) v (r=Q) v (r=K))    FACE(r) ((s=S) v (s=C))    BLACK(s) ((s=D) v (s=H))    RED(s) Training set D: REWARD([4,C])     REWARD([7,C])     REWARD([2,S])        REWARD([5,H])      REWARD([J,S])
Rewarded Card Example Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB:   ((r=1) v … v (r=10))    NUM(r) ((r=J) v (r=Q) v (r=K))    FACE(r) ((s=S) v (s=C))    BLACK(s) ((s=D) v (s=H))    RED(s) Training set D: REWARD([4,C])     REWARD([7,C])     REWARD([2,S])        REWARD([5,H])      REWARD([J,S]) Possible inductive hypothesis: h    (NUM(r)    BLACK(s)    REWARD([r,s]))   There are several possible  inductive hypotheses
Learning a Logical Predicate (Concept Classifier) Set E of objects (e.g., cards) Goal  predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Example:   CONCEPT  describes the precondition of an action, e.g.,  Unstack(C,A)   E is the set of states CONCEPT(x)   HANDEMPTY  x, BLOCK(C)   x, BLOCK(A)   x,    CLEAR(C)   x, ON(C,A)   x Learning CONCEPT is a step toward learning an action  description
Learning a Logical Predicate (Concept Classifier) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable  predicates A(x), B(X), … (e.g., NUM, RED) Training set : values of CONCEPT for some combinations of values of the observable predicates
Learning a Logical Predicate (Concept Classifier) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form:   CONCEPT(x)     S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates, e.g.:   CONCEPT(x)    A(x)    (  B(x) v C(x))
An  hypothesis  is any sentence of the form:   CONCEPT(x)     S(A,B, …) where S(A,B,…) is a sentence built using the observable predicates The set of all hypotheses is called the  hypothesis space  H An hypothesis h  agrees  with an example if it gives the correct value of CONCEPT Hypothesis Space
Inductive Learning Scheme + + + + + + + + + + + + - - - - - - - - - - - - Example set X {[A, B, …, CONCEPT]} Hypothesis space H {[CONCEPT(x)     S(A,B, …)]} Training set D Inductive hypothesis h
Size of Hypothesis Space n observable predicates 2 n  entries in truth table defining  CONCEPT and each entry can be filled with True or False In the absence of any restriction (bias), there are  hypotheses to choose from n = 6    2x10 19  hypotheses!  2 2 n
Multiple Inductive Hypotheses h 1     NUM(r)    BLACK(s)    REWARD([r,s]) h 2     BLACK(s)      (r=J)    REWARD([r,s]) h 3     ([r,s]=[4,C])     ([r,s]=[7,C])     [r,s]=[2,S])    REWARD([r,s]) h 4       ([r,s]=[5,H])      ([r,s]=[J,S])    REWARD([r,s]) agree with all the examples in the training set
Multiple Inductive Hypotheses h 1     NUM(r)    BLACK(s)    REWARD([r,s]) h 2     BLACK(s)      (r=J)    REWARD([r,s]) h 3     ([r,s]=[4,C])     ([r,s]=[7,C])     [r,s]=[2,S])    REWARD([r,s]) h 4       ([r,s]=[5,H])      ([r,s]=[J,S])    REWARD([r,s]) agree with all the examples in the training set Need for a system of preferences – called  an  inductive bias  – to compare possible hypotheses
Notion of Capacity It refers to the ability of a machine to learn any training set without error A machine with too much capacity is like a botanist with photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before A machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine
   Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning  it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller
   Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning  it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller Einstein: “A theory must be as simple as possible,  but not simpler than this”
   Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning  it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller If the bias allows only sentences S that are conjunctions of k << n predicates picked from the n observable predicates, then the size of  H is O(n k )
Relation to Bayesian Learning Recall MAP learning  MAP  = argmax    P(D|  )   P(  ) P(D|  )  is the likelihood ~ accuracy on training set P(  )  is the prior ~ inductive bias  MAP  = argmin    - log P(D|  )  -  log P(  ) Measures accuracy on training set Minimum length encoding of   MAP learning is a form of KIS bias
Supervised Learning Flow Chart Training set Target concept Datapoints Inductive Hypothesis Prediction Learner Hypothesis space Choice of learning algorithm Unknown concept we want to approximate Observations we have seen Test set Observations we will see in the future Better quantities to assess performance
Capacity is Not the Only Criterion Accuracy on training set isn’t the best measure of performance Learn Test Example set X Hypothesis space H Training set D + + + + + + + + + + + + - - - - - - - - - - - -
Generalization Error A hypothesis h is said to  generalize  well if it achieves low error on all examples in X Learn Test Example set X Hypothesis space H + + + + + + + + + + + + - - - - - - - - - - - -
Assessing Performance of a Learning Algorithm Samples from X are typically unavailable Take out some of the training set Train on the remaining training set Test on the excluded instances Cross-validation
Cross-Validation Split original set of examples, train Hypothesis space H Train Examples D + + + + + + + - - - - - - + + + + + - - - - - -
Cross-Validation Evaluate hypothesis on testing set Hypothesis space H Testing set + + + + + + + - - - - - -
Cross-Validation Evaluate hypothesis on testing set Hypothesis space H Testing set + + + + + - - - - - - + + Test
Cross-Validation Compare true concept against prediction Hypothesis space H Testing set + + + + + - - - - - - + + 9/13 correct  + + + + + + + - - - - - -
Common Splitting Strategies k-fold cross-validation Leave-one-out Train Test Dataset
Cardinal sins of machine learning Thou shalt not: Train on examples in the testing set Form assumptions by “peeking” at the testing set, then formulating inductive bias
Tennis Example Evaluate learning algorithm PlayTennis = S(Temperature,Wind)
Tennis Example Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool)    (W=Weak) Training errors = 3/10 Testing errors = 4/4
Tennis Example Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 1/4
Tennis Example Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 2/4
Tennis Example Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Sum of all testing errors: 7/12
How to construct a better learner? Ideas?
Next Time Decision tree learning

Introduction to machine learning

  • 1.
  • 2.
    Reminders HW6 postedAll HW solutions posted
  • 3.
    A Review Youwill be asked to learn the parameters of a probabilistic model of C given A,B, using some data The parameters model entries of a conditional probability distribution as coin flips  4  3  2  1 P(C|AB)  A,  B  A,B A,  B A,B
  • 4.
    A Review Exampledata ML estimates  4 =6/11  3 =4/11  2 =1/11  1 =3/8 P(C|AB)  A,  B  A,B A,  B A,B 5 0 0 0 6 1 0 0 7 0 1 0 4 1 1 0 10 0 0 1 1 1 0 1 5 0 1 1 3 1 1 1 # obs C B A
  • 5.
    Maximum Likelihood forBN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data Alarm Earthquake Burglar E 500 B: 200 N=1000 P(E) = 0.5 P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 0.003 F F 0.34 F T 0.95 T F 0.95 T T P(A|E,B) B E
  • 6.
    Maximum A PosterioriExample data MAP Estimates assuming Beta prior with a=b=3 virtual counts 2H,2T  4 =8/15  3 =6/15  2 =3/15  1 =5/14 P(C|AB)  A,  B  A,B A,  B A,B 5 0 0 0 6 1 0 0 7 0 1 0 4 1 1 0 10 0 0 1 1 1 0 1 5 0 1 1 3 1 1 1 # obs C B A
  • 7.
    Discussion Where arewe now? How does it relate to the future of AI? How will you be tested on it?
  • 8.
    Motivation Agent hasmade observations ( data ) Now must make sense of it ( hypotheses ) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …
  • 9.
    Topics in MachineLearning Applications Document retrieval Document classification Data mining Computer vision Scientific discovery Robotics … Tasks & settings Classification Ranking Clustering Regression Decision-making Supervised Unsupervised Semi-supervised Active Reinforcement learning Techniques Bayesian learning Decision trees Neural networks Support vector machines Boosting Case-based reasoning Dimensionality reduction …
  • 10.
    What is Learning?Mostly generalization from experience: “ Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future” M.R. Genesereth and N.J. Nilsson, in Logical Foundations of AI , 1987  Concepts, heuristics, policies Supervised vs. un-supervised learning
  • 11.
    Inductive Learning Basicform: learn a function from examples f is the unknown target function An example is a pair ( x , f(x) ) Problem: find a hypothesis h such that h ≈ f given a training set of examples D Instance of supervised learning Classification task: f  {0,1,…,C} Regression task: f  reals
  • 12.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:
  • 13.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:
  • 14.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:
  • 15.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:
  • 16.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:
  • 17.
    Inductive learning methodConstruct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting: h=D is a trivial, but perhaps uninteresting solution (caching)
  • 18.
    Classification Task The concept f(x) takes on values True and False A example is positive if f is True, else it is negative The set X of all examples is the example set The training set is a subset of X a small one!
  • 19.
    Logic-Based Inductive LearningHere, examples (x, f(x)) take on discrete values
  • 20.
    Logic-Based Inductive LearningHere, examples (x,f(x)) take discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not
  • 21.
    Rewarded Card ExampleDeck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])   REWARD([5,H])   REWARD([J,S])
  • 22.
    Rewarded Card ExampleDeck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])   REWARD([5,H])   REWARD([J,S]) Possible inductive hypothesis: h  (NUM(r)  BLACK(s)  REWARD([r,s])) There are several possible inductive hypotheses
  • 23.
    Learning a LogicalPredicate (Concept Classifier) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Example: CONCEPT describes the precondition of an action, e.g., Unstack(C,A) E is the set of states CONCEPT(x)  HANDEMPTY  x, BLOCK(C)  x, BLOCK(A)  x, CLEAR(C)  x, ON(C,A)  x Learning CONCEPT is a step toward learning an action description
  • 24.
    Learning a LogicalPredicate (Concept Classifier) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set : values of CONCEPT for some combinations of values of the observable predicates
  • 25.
    Learning a LogicalPredicate (Concept Classifier) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x)  A(x)  (  B(x) v C(x))
  • 26.
    An hypothesis is any sentence of the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built using the observable predicates The set of all hypotheses is called the hypothesis space H An hypothesis h agrees with an example if it gives the correct value of CONCEPT Hypothesis Space
  • 27.
    Inductive Learning Scheme+ + + + + + + + + + + + - - - - - - - - - - - - Example set X {[A, B, …, CONCEPT]} Hypothesis space H {[CONCEPT(x)  S(A,B, …)]} Training set D Inductive hypothesis h
  • 28.
    Size of HypothesisSpace n observable predicates 2 n entries in truth table defining CONCEPT and each entry can be filled with True or False In the absence of any restriction (bias), there are hypotheses to choose from n = 6  2x10 19 hypotheses! 2 2 n
  • 29.
    Multiple Inductive Hypothesesh 1  NUM(r)  BLACK(s)  REWARD([r,s]) h 2  BLACK(s)   (r=J)  REWARD([r,s]) h 3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h 4   ([r,s]=[5,H])   ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set
  • 30.
    Multiple Inductive Hypothesesh 1  NUM(r)  BLACK(s)  REWARD([r,s]) h 2  BLACK(s)   (r=J)  REWARD([r,s]) h 3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h 4   ([r,s]=[5,H])   ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set Need for a system of preferences – called an inductive bias – to compare possible hypotheses
  • 31.
    Notion of CapacityIt refers to the ability of a machine to learn any training set without error A machine with too much capacity is like a botanist with photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before A machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine
  • 32.
    Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller
  • 33.
    Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller Einstein: “A theory must be as simple as possible, but not simpler than this”
  • 34.
    Keep-It-Simple (KIS) Bias Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller If the bias allows only sentences S that are conjunctions of k << n predicates picked from the n observable predicates, then the size of H is O(n k )
  • 35.
    Relation to BayesianLearning Recall MAP learning  MAP = argmax  P(D|  ) P(  ) P(D|  ) is the likelihood ~ accuracy on training set P(  ) is the prior ~ inductive bias  MAP = argmin  - log P(D|  ) - log P(  ) Measures accuracy on training set Minimum length encoding of  MAP learning is a form of KIS bias
  • 36.
    Supervised Learning FlowChart Training set Target concept Datapoints Inductive Hypothesis Prediction Learner Hypothesis space Choice of learning algorithm Unknown concept we want to approximate Observations we have seen Test set Observations we will see in the future Better quantities to assess performance
  • 37.
    Capacity is Notthe Only Criterion Accuracy on training set isn’t the best measure of performance Learn Test Example set X Hypothesis space H Training set D + + + + + + + + + + + + - - - - - - - - - - - -
  • 38.
    Generalization Error Ahypothesis h is said to generalize well if it achieves low error on all examples in X Learn Test Example set X Hypothesis space H + + + + + + + + + + + + - - - - - - - - - - - -
  • 39.
    Assessing Performance ofa Learning Algorithm Samples from X are typically unavailable Take out some of the training set Train on the remaining training set Test on the excluded instances Cross-validation
  • 40.
    Cross-Validation Split originalset of examples, train Hypothesis space H Train Examples D + + + + + + + - - - - - - + + + + + - - - - - -
  • 41.
    Cross-Validation Evaluate hypothesison testing set Hypothesis space H Testing set + + + + + + + - - - - - -
  • 42.
    Cross-Validation Evaluate hypothesison testing set Hypothesis space H Testing set + + + + + - - - - - - + + Test
  • 43.
    Cross-Validation Compare trueconcept against prediction Hypothesis space H Testing set + + + + + - - - - - - + + 9/13 correct + + + + + + + - - - - - -
  • 44.
    Common Splitting Strategiesk-fold cross-validation Leave-one-out Train Test Dataset
  • 45.
    Cardinal sins ofmachine learning Thou shalt not: Train on examples in the testing set Form assumptions by “peeking” at the testing set, then formulating inductive bias
  • 46.
    Tennis Example Evaluatelearning algorithm PlayTennis = S(Temperature,Wind)
  • 47.
    Tennis Example Evaluatelearning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool)  (W=Weak) Training errors = 3/10 Testing errors = 4/4
  • 48.
    Tennis Example Evaluatelearning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 1/4
  • 49.
    Tennis Example Evaluatelearning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 2/4
  • 50.
    Tennis Example Evaluatelearning algorithm PlayTennis = S(Temperature,Wind) Sum of all testing errors: 7/12
  • 51.
    How to constructa better learner? Ideas?
  • 52.
    Next Time Decisiontree learning