Computational Learning Theory


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Computational Learning Theory

  1. 1. CS 9633 Machine Learning Computational Learning Theory Adapted from notes by Tom Mitchell
  2. 2. Theoretical Characterization of Learning Problems <ul><li>Under what conditions is successful learning possible and impossible? </li></ul><ul><li>Under what conditions is a particular learning algorithm assured of learning successfully? </li></ul>
  3. 3. Two Frameworks <ul><li>PAC (Probably Approximately Correct) Learning Framework: Identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples </li></ul><ul><ul><li>Define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples needed </li></ul></ul><ul><li>Mistake Bound Framework </li></ul>
  4. 4. Theoretical Questions of Interest <ul><li>Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? </li></ul><ul><li>Can one characterize the number of training examples necessary or sufficient to assure successful learning? </li></ul><ul><li>How is the number of examples affected </li></ul><ul><ul><li>If observing a random sample of training data? </li></ul></ul><ul><ul><li>if the learner is allowed to pose queries to the trainer? </li></ul></ul><ul><li>Can one characterize the number of mistakes that a learner will make before learning the target function? </li></ul><ul><li>Can one characterize the inherent computational complexity of a class of learning algorithms? </li></ul>
  5. 5. Computational Learning Theory <ul><li>Relatively recent field </li></ul><ul><li>Area of intense research </li></ul><ul><li>Partial answers to some questions on previous page is yes. </li></ul><ul><li>Will generally focus on certain types of learning problems. </li></ul>
  6. 6. Inductive Learning of Target Function <ul><li>What we are given </li></ul><ul><ul><li>Hypothesis space </li></ul></ul><ul><ul><li>Training examples </li></ul></ul><ul><li>What we want to know </li></ul><ul><ul><li>How many training examples are sufficient to successfully learn the target function? </li></ul></ul><ul><ul><li>How many mistakes will the learner make before succeeding? </li></ul></ul>
  7. 7. Questions for Broad Classes of Learning Algorithms <ul><li>Sample complexity </li></ul><ul><ul><li>How many training examples do we need to converge to a successful hypothesis with a high probability? </li></ul></ul><ul><li>Computational complexity </li></ul><ul><ul><li>How much computational effort is needed to converge to a successful hypothesis with a high probability? </li></ul></ul><ul><li>Mistake Bound </li></ul><ul><ul><li>How many training examples will the learner misclassify before converging to a successful hypothesis? </li></ul></ul>
  8. 8. PAC Learning <ul><li>Probably Approximately Correct Learning Model </li></ul><ul><li>Will restrict discussion to learning boolean-valued concepts in noise-free data. </li></ul>
  9. 9. Problem Setting: Instances and Concepts <ul><li>X is set of all possible instances over which target function may be defined </li></ul><ul><li>C is set of target concepts learner is to learn </li></ul><ul><ul><li>Each target concept c in C is a subset of X </li></ul></ul><ul><ul><li>Each target concept c in C is a boolean function </li></ul></ul><ul><ul><ul><li>c: X  {0,1} </li></ul></ul></ul><ul><ul><ul><li>c(x) = 1 if x is positive example of concept </li></ul></ul></ul><ul><ul><ul><li>c(x) = 0 otherwise </li></ul></ul></ul>
  10. 10. Problem Setting: Distribution <ul><li>Instances generated at random using some probability distribution D </li></ul><ul><ul><li>D may be any distribution </li></ul></ul><ul><ul><li>D is generally not known to the learner </li></ul></ul><ul><ul><li>D is required to be stationary (does not change over time) </li></ul></ul><ul><li>Training examples x are drawn at random from X according to D and presented with target value c(x) to the learner. </li></ul>
  11. 11. Problem Setting: Hypotheses <ul><li>Learner L considers set of hypotheses H </li></ul><ul><li>After observing a sequence of training examples of the target concept c , L must output some hypothesis h from H which is its estimate of c </li></ul>
  12. 12. Example Problem (Classifying Executables) <ul><li>Three Classes (Malicious, Boring, Funny) </li></ul><ul><li>Features </li></ul><ul><ul><li>a 1 GUI present (yes/no) </li></ul></ul><ul><ul><li>a 2 Deletes files (yes/no) </li></ul></ul><ul><ul><li>a 3 Allocates memory (yes/no) </li></ul></ul><ul><ul><li>a 4 Creates new thread (yes/no) </li></ul></ul><ul><li>Distribution? </li></ul><ul><li>Hypotheses? </li></ul>
  13. 13. M No Yes No No 10 B Yes No No No 9 M Yes No Yes Yes 8 M No Yes Yes Yes 7 F No No No Yes 6 B Yes No No Yes 5 M Yes Yes No No 4 F No Yes Yes No 3 B No No No Yes 2 B Yes No No Yes 1 Class a 4 a 3 a 2 a 1 Instance
  14. 14. True Error <ul><li>Definition: The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D , is the probability that h will misclassify an instance drawn at random according to D . </li></ul>
  15. 15. Error of h with respect to c Instance space X + + + c h - - - -
  16. 16. Key Points <ul><li>True error defined over entire instance space, not just training data </li></ul><ul><li>Error depends strongly on the unknown probability distribution D </li></ul><ul><li>The error of h with respect to c is not directly observable to the learner L—can only observe performance with respect to training data (training error) </li></ul><ul><li>Question: How probable is it that the observed training error for h gives a misleading estimate of the true error? </li></ul>
  17. 17. PAC Learnability <ul><li>Goal: characterize classes of target concepts that can be reliably learned </li></ul><ul><ul><li>from a reasonable number of randomly drawn training examples and </li></ul></ul><ul><ul><li>using a reasonable amount of computation </li></ul></ul><ul><li>Unreasonable to expect perfect learning where error D (h) = 0 </li></ul><ul><ul><li>Would need to provide training examples corresponding to every possible instance </li></ul></ul><ul><ul><li>With random sample of training examples, there is always a non-zero probability that the training examples will be misleading </li></ul></ul>
  18. 18. Weaken Demand on Learner <ul><li>Hypothesis error (Approximately) </li></ul><ul><ul><li>Will not require a zero error hypothesis </li></ul></ul><ul><ul><li>Require that error is bounded by some constant  , that can be made arbitrarily small </li></ul></ul><ul><ul><li> is the error parameter </li></ul></ul><ul><li>Error on training data (Probably) </li></ul><ul><ul><li>Will not require that the learner succeed on every sequence of randomly drawn training examples </li></ul></ul><ul><ul><li>Require that its probability of failure is bounded by a constant,  , that can be made arbitrarily small </li></ul></ul><ul><ul><li> is the confidence parameter </li></ul></ul>
  19. 19. Definition of PAC-Learnability <ul><li>Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H . C is PAC-learnable by L using H if all c  C , distributions D over X ,  such that 0 <  < ½ , and  such that 0 <  < ½ , learner L will with probability at least (1 -  ) output a hypothesis h  H such that error D (h)   , in time that is polynomial in 1/  , 1/  , n , and size(c) . </li></ul>
  20. 20. Requirements of Definition <ul><li>L must with arbitrarily high probability (1-  ), out put a hypothesis having arbitrarily low error (  ). </li></ul><ul><li>L’s learning must be efficient—grows polynomially in terms of </li></ul><ul><ul><li>Strengths of output hypothesis (1/  , 1/  ) </li></ul></ul><ul><ul><li>Inherent complexity of instance space (n) and concept class C (size(c)). </li></ul></ul>
  21. 21. Block Diagram of PAC Learning Model Learning algorithm L Training sample Control Parameters  ,  Hypothesis h
  22. 22. Examples of second requirement <ul><li>Consider executables problem where instances are conjunctions of boolean features: </li></ul><ul><ul><li>a 1 =yes  a 2 =no  a 3 =yes  a 4 =no </li></ul></ul><ul><li>Concepts are conjunctions of a subset of the features </li></ul><ul><ul><li>a 1 =yes  a 3 =yes  a 4 =yes </li></ul></ul>
  23. 23. Using the Concept of PAC Learning in Practice <ul><li>We often want to know how many training instances we need in order to achieve a certain level of accuracy with a specified probability. </li></ul><ul><li>If L requires some minimum processing time per training example, then for C to be PAC-learnable by L, L must learn from a polynomial number of training examples. </li></ul>
  24. 24. Sample Complexity <ul><li>Sample complexity of a learning problem is the growth in the required training examples with problem size. </li></ul><ul><li>Will determine the sample complexity for consistent learners. </li></ul><ul><ul><li>A learner is consistent if it outputs hypotheses which perfectly fit the training data whenever possible. </li></ul></ul><ul><ul><li>All algorithms in Chapter 2 are consistent learners. </li></ul></ul>
  25. 25. Recall definition of VS <ul><li>The version space , denoted VS H,D , with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with the training examples in D </li></ul>
  26. 26. VS and PAC learning by consistent learners <ul><li>Every consistent learner outputs a hypothesis belonging to the version space, regardless of the instance space X, hypothesis space H, or training data D. </li></ul><ul><li>To bound the number of examples needed by any consistent learner, we need only to bound the number of examples needed to assure that the version space contains no unacceptable hypotheses. </li></ul>
  27. 27.  -exhausted <ul><li>Definition: Consider a hypothesis space H, target concept c, instance distribution D , and set of training examples D of c. The version space VS H,D is said to be  -exhausted with respect to c and D , if every hypothesis h in V H,D has error less than  with respect to c and D. </li></ul>
  28. 28. Exhausting the version space VS H,D error = 0.1 r=0.2 error = 0.3 r=0.2 error = 0.2 r=0 error = 0.1 r=0 error = 0.3 r=0.4 error = 0.2 r=0.3 Hypothesis Space H
  29. 29. Exhausting the Version Space <ul><li>Only an observer who knows the identify of the target concept can determine with certainty whether the version space is  -exhausted. </li></ul><ul><li>But, we can bound the probability that the version space will be  -exhausted after a given number of training examples </li></ul><ul><ul><li>Without knowing the identity of the target concept </li></ul></ul><ul><ul><li>Without knowing the distribution from which training examples were drawn </li></ul></ul>
  30. 30. Theorem 7.1 <ul><li>Theorem 7.1  -exhausting the version space. If the hypothesis space H is finite, D is a sequence of m  1 independent randomly drawn examples of some target concept c, then for any 0  1, the probability that the version space VS H,D is not  -exhausted (with respect to c) is less than or equal to </li></ul><ul><li>|H|e -  m </li></ul>
  31. 31. Proof of theorem <ul><li>See text </li></ul>
  32. 32. Number of Training Examples (Eq. 7.2)
  33. 33. Summary of Result <ul><li>Inequality on previous slide provides a general bound on the number of trianing examples sufficient for any consistent learner to successfully learn any target concept in H, for any desired values of  and  . </li></ul><ul><li>This number m of training examples is sufficient to assure that any consistent hypothesis will be </li></ul><ul><ul><li>probably (with probability 1-  ) </li></ul></ul><ul><ul><li>approximately (within error  ) correct. </li></ul></ul><ul><li>The value of m grows </li></ul><ul><ul><li>linearly with 1/  </li></ul></ul><ul><ul><li>logarithmically with 1/  </li></ul></ul><ul><ul><li>logarithmically with |H| </li></ul></ul><ul><li>The bound can be a substantial overestimate. </li></ul>
  34. 34. Problem <ul><li>Suppose we have the instance space described for the EnjoySports problem: </li></ul><ul><ul><li>Sky (Sunny, Cloudy, Rainy) </li></ul></ul><ul><ul><li>AirTemp (Warm, Cold) </li></ul></ul><ul><ul><li>Humidity (Normal, High) </li></ul></ul><ul><ul><li>Wind (Strong, Weak) </li></ul></ul><ul><ul><li>Water (Warm, Cold) </li></ul></ul><ul><ul><li>Forecast (Same, Change) </li></ul></ul><ul><li>Hypotheses can be as before </li></ul><ul><ul><li>(?, Warm, Normal, ?, ?, Same) (0, 0, 0, 0, 0, 0) </li></ul></ul><ul><li>How many training examples do we need to have an error rate of less than 10% with a probability of 95%? </li></ul>
  35. 35. Limits of Equation 7.2 <ul><li>Equation 7.2 tell us how many training examples suffice to ensure (with probability (1-  ) that every hypothesis having 0 training error, will have a true error of at most  . </li></ul><ul><li>Problem: there may be no hypothesis that is consistent with if the concept is not in H. In this case, we want the minimum error hypothesis. </li></ul>
  36. 36. Agnostic Learning and Inconsistent Hypotheses <ul><li>An Agnostic Learner does not make the assumption that the concept is contained in the hypothesis space. </li></ul><ul><li>We may want to consider the hypothesis with the minimum error </li></ul><ul><li>Can derive a bound similar to the previous one: </li></ul>
  37. 37. Concepts that are PAC-Learnable <ul><li>Proofs that a type of concept is PAC-Learnable usually consist of two steps: </li></ul><ul><ul><li>Show that each target concept in C can be learned from a polynomial number of training examples </li></ul></ul><ul><ul><li>Show that the processing time per training example is also polynomially bounded </li></ul></ul>
  38. 38. PAC Learnability of Conjunctions of Boolean Literals <ul><li>Class C of target concepts described by conjunctions of boolean literals: </li></ul><ul><li>GUI_Present   Opens_files </li></ul><ul><li>Is C PAC learnable? Yes. </li></ul><ul><li>Will prove by </li></ul><ul><ul><li>Showing that a polynomial # of training examples is needed to learn each concept </li></ul></ul><ul><ul><li>Demonstrate an algorithm that uses polynomial time per training example </li></ul></ul>
  39. 39. Examples Needed to Learn Each Concept <ul><li>Consider a consistent learner that uses hypothesis space H =C </li></ul><ul><li>Compute number m of random training examples sufficient to ensure that L will, with probability (1 -  ), output a hypothesis with maximum error  . </li></ul><ul><li>We will use m  (1/  )(ln|H|+ln(1/  )) </li></ul><ul><li>What is the size of the hypothesis space? </li></ul>
  40. 40. Complexity Per Example <ul><li>We just need to show that for some algorithm, we can spend a polynomial amount of time per training example. </li></ul><ul><li>One way to do this is to give an algorithm. </li></ul><ul><li>In this case, we can use Find-S as the learning algorithm. </li></ul><ul><li>Find-S incrementally computes the most specific hypothesis consistent with each training example. </li></ul><ul><li>Old  Tired + </li></ul><ul><li> Old  Happy + </li></ul><ul><li>Tired + </li></ul><ul><li>Old   Tired - </li></ul><ul><li>Rich  Happy + </li></ul><ul><li>What is a bound on the time per example? </li></ul>
  41. 41. Theorem 7.2 <ul><li>PAC-learnability of boolean conjunctions. The class C of conjunctions of boolean literals is PAC-learnable by the FIND-S algorithm using H=C </li></ul>
  42. 42. Proof of Theorem 7.2 <ul><li>Equation 7.4 shows that the sample complexity for this concept class id polynomial in n, 1/  , and 1/  , and independent of size(c). To incrematally process each training example, the FIND-S algorithm requires effort linear in n and independent of 1/  , 1/  , and size(c). Therefore, this concept class is PAC-learnable by the FIND-S algorithm. </li></ul>
  43. 43. Interesting Results <ul><li>Unbiased learners are not PAC learnable because they require an exponential number of examples. </li></ul><ul><li>K-term Disjunctive Normal Form is not PAC learnable </li></ul><ul><li>K-term Conjunctive Normal Form is a superset of k-DNF, but it is PAC learnable </li></ul>
  44. 44. Sample Complexity with Infinite Hypothesis Spaces <ul><li>Two drawbacks to previous result </li></ul><ul><ul><li>It often does not give a very tight bound on the sample complexity </li></ul></ul><ul><ul><li>It only applies to finite hypothesis spaces </li></ul></ul><ul><li>Vapnik-Chervonekis dimension of H (VC dimension) </li></ul><ul><ul><li>Will give tighter bounds </li></ul></ul><ul><ul><li>Applies to many infinite hypothesis spaces. </li></ul></ul>
  45. 45. Shattering a Set of Instances <ul><li>Consider a subset of instances S from the instance space X. </li></ul><ul><li>Every hypothesis imposes dichotomies on S </li></ul><ul><ul><li>{x  S | h(x) = 1} </li></ul></ul><ul><ul><li>{x  S | h(x) = 0} </li></ul></ul><ul><li>Given some instance space S, there are 2 |S| possible dichotomies. </li></ul><ul><li>The ability of H to shatter a set of concepts is a measure of its capacity to represent target concepts defined over these instances. </li></ul>
  46. 46. Shattering a Hypothesis Space <ul><li>Definition: A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. </li></ul>
  47. 47. Vapnik-Chervonenkis Dimension <ul><li>Ability to shatter a set of instances is closely related to the inductive bias of the hypothesis space. </li></ul><ul><li>An unbiased hypothesis space is one that shatters the instance space X. </li></ul><ul><li>Sometimes H cannot be shattered, but a large subset of it can. </li></ul>
  48. 48. Vapnik-Chervonenkis Dimension <ul><li>Definition: The Vapnik-Chervonenkis dimension, VC(H) of hypothesis space H defined over instance space X, is the size of the largest finite subset of X shattered by H . If arbitrarily large finite sets of X can be shattered by H, then VC(H) =  . </li></ul>
  49. 49. Shattered Instance Space
  50. 50. Example 1 of VC Dimension <ul><li>Instance space X is the set of real numbers X = R . </li></ul><ul><li>H is the set of intervals on the real number line. Form of H is: </li></ul><ul><li>a < x < b </li></ul><ul><li>What is VC(H)? </li></ul>
  51. 51. Shattering the real number line -1.2 3.4 6.7 What is VC(H)? What is |H|? -1.2 3.4
  52. 52. Example 2 of VC Dimension <ul><li>Set X of instances corresponding to numbers on the x,y plane </li></ul><ul><li>H is the set of all linear decision surfaces </li></ul><ul><li>What is VC(H)? </li></ul>
  53. 53. Shattering the x-y plane 2 instances 3 instances VC(H) = ? |H| = ?
  54. 54. Proving limits on VC dimension <ul><li>If we find any set of instances of size d that can be shattered, then VC(H)  d. </li></ul><ul><li>To show that VC(H) < d, we must show that no set of size d can be shattered. </li></ul>
  55. 55. General result for r dimensional space <ul><li>The VC dimension of linear decision surfaces in an r dimensional space is r+1. </li></ul>
  56. 56. Example 3 of VC dimension <ul><li>Set X of instances are conjunctions of exactly three boolean literals </li></ul><ul><li>young  happy  single </li></ul><ul><li>H is the set of hypothesis described by a conjunction of up to 3 boolean literals. </li></ul><ul><li>What is VC(H)? </li></ul>
  57. 57. Shattering conjunctions of literals <ul><li>Approach: construct a set of instances of size 3 that can be shattered. Let instance i have positive literal l i and all other literals negative. Representation of instances that are conjunctions of literals l 1 , l 2 and l 3 as bit strings: </li></ul><ul><ul><li>Instance 1 : 100 </li></ul></ul><ul><ul><li>Instance 2 : 010 </li></ul></ul><ul><ul><li>Instance 3 : 001 </li></ul></ul><ul><li>Construction of dichotomy: To exclude an instance, add appropriate  l i to the hypothesis. </li></ul><ul><li>Extend the argument to n literals. </li></ul><ul><li>Can VC(H) be greater than n (number of literals)? </li></ul>
  58. 58. Sample Complexity and the VC dimension <ul><li>Can derive a new bound for the number of randomly drawn training examples that suffice to probably approximately learn a target concept (how many examples do we need to  -exhaust the version space with probability (1-  )?) </li></ul>
  59. 59. Comparing the Bounds
  60. 60. Lower Bound on Sample Complexity <ul><li>Theorem 7.3 Lower bound on sample complexity. Consider any concept class C such that VC(C)  2, any learner L, and any 0 <  < 1/8, and 0 <  < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than </li></ul><ul><li>Then with probability at least  , L outputs a hypothesis h having error D (h) >  . </li></ul>