ML Study 
PAC Learning 
2014.09.11 
Sanghyuk Chun
Overview 
• ML intro & Decision tree 
• Bayesian Methods 
• Regression 
• Graphical Model 1 
• Graphical Model 2 (EM) 
• PAC learning 
• Hidden Markov Models 
• Learning Representations 
• Neural Network 
• Support Vector Machine 
• Reinforcement Learning 
Basic Concepts 
Model and Algorithms 
PAC: Theory for ML algorithms
Computational Learning Theory 
• Computational learning theory is a mathematical and 
theoretical field related to analysis of machine 
learning algorithms 
• We need to seek theory to relate: 
• Probability of successful learning 
• Number of training examples 
• Complexity of hypothesis space 
• Accuracy to which target function is approximated 
• Manner in which training examples presented
Prototypical Concept Learning Task 
• Given: 
• Instances X (set of instance or objects in world) 
• Target concept c (subset of instance space) 
• Hypothesis H (collection of concepts over X) 
• Training data D (example from instance space) 
• Determine: 
• A hypothesis h in H such that h(x) = c(x) for all x in D? 
• A hypothesis h in H such that h(x) = c(x) for all x in X? 
Training error 
True error
Function Approximation: Overview 
h in hypothesis space H which 
is “best” hypothesis on the 
training data D 
c is a target function (or concept) 
what we want to find from 
hypothesis space H 
There is no free lunch!! 
generalization beyond the training 
data is impossible without more 
assumptions 
(for example, regularization term 
in logistic regression)
True error and Training error 
• True error of hypothesis h with respect to c 
• How often h(x) ≠ c(x) over random instances 
! 
! 
! 
• Training error of hypothesis h with respect to c 
• How often h(x) ≠ c(x) over training instances 
!
True error and Training error 
• We now use “Empirical Risk Minimization” 
method which find hypothesis minimizing training 
error to select hypothesis h 
• Problem: errortrain(h) is a biased approximation to 
the errorD(h) 
• Since h is selected using training data, or 
errortrain(h) is dependent on h, errortrain(h) is a 
biased approximation to the errorD(h) 
• On h, it is likely to be an underestimate!
True error and Test error 
• Question: By the way, it is impossible to know 
exact true error errorD(h), is there any unbiased 
approximation error to the errorD(h)? How can we 
measure ‘true’ performance of hypothesis h? 
• Answer: Test error is an unbiased approximation 
to the errorD(h) 
• as the test set are i.i.d. samples draw from the 
true distribution independently of h
Overfitting 
• Hypothesis h in H overfits training data if there is an 
alternative hypothesis h’ in H such that 
• errortrain(h) < errortrain(h’) and 
• errorD(h) > errorD(h’) 
! 
! 
“Complex” model causes more overfitting effect (Occam’s razor) 
What if the training set goes infinity? 
Or, how many training sample we need?
PAC learning 
• PAC learning, or Probably Approximately Correct 
learning is a framework for mathematical analysis 
of machine learning 
• Goal of PAC: With high probability (“Probably”), 
the selected hypothesis will have low error 
("Approximately Correct") 
• Assume there is no error (noise) on data
PAC learning: 
finite hypothesis space 
• As we see before, training error underestimates 
the true error 
• In PAC learning, we seek theory to relate: 
• The number of training samples: m 
• Te gap between training and true errors 
• errorD(h) ≤ errortrain(h) + ε 
• Complexity of the hypothesis space: |H| 
• Confidence of the relation: at least (1-δ )
Special case: errortrain(h)=0 
• Assume errortrain(h)=0, or target concept c is in 
hypothesis space H 
• errorD(h) ≤ errortrain(h) + ε 
• errorD(h) ≤ ε 
• What is the probability that there exists consistent 
hypothesis with true error > ε? 
• i.e. represent δ using m, ε, |H| 
• Result (proof: see appendix)
Bounds for 
finite hypothesis space 
• 
• Suppose the probability to be at most δ 
(confidence of relation is 1-δ), (i.e. |H|e-εm ≤ δ) 
• How many training examples suffice? 
• 
• If errortrain(h)=0 then with probability at least 1-δ 
•
Agnostic learning 
• Assume errortrain(h) ≠ 0 or target function c is not in 
hypothesis space H 
• Again, in PAC learning, we seek theory to relate: 
• The number of training samples: m 
• Te gap between training and true errors 
• errorD(h) ≤ errortrain(h) + ε 
• Complexity of the hypothesis space: |H| 
• Confidence of the relation: at least (1-δ ) 
• The bound on δ (derived from Hoeffding bounds)
Bounds for 
finite hypothesis space 
• The bound on δ 
• 
! 
• We got new answer! 
• 
true error training error degree of overfitting
Intuition 
• The bound of number of training samples is 
! 
• The bound on true error is 
! 
• We can improve performance of the algorithm by 
• Decreasing training errortrain(h) 
• Increasing number of training sample m 
• Choose H which is “simple” (Occam’s Razor)
PAC learnable
PAC learning: 
infinite hypothesis space 
• Bound for finite hypothesis space 
! 
! 
! 
• What if hypothesis space is infinite? or |H| → ∞?
VC dimension 
• VC (Vapnik–Chervonenkis) dimension or VC(H) is a 
measure of the capacity of a classification algorithm 
• defined as the cardinality (size) of the largest set of 
points that the algorithm can shatter 
! 
! 
! 
• Shatter: correctly classify regardless of the labeling
VC dimension example 
• Linear classify in 2-D dimension 
! 
! 
! 
• 1-Nearest neighbor method? 
! 
! 
for d-D dimension, we can do classify 
d+1 points! 
VC(H) = d+1 
• Decision tree with k boolean variables 
• VC(H) = 2k because we can shatter 2k examples using a tree with 2k leaf 
nodes, and we cannot shatter 2k +1 examples. 
http://www.cs.cmu.edu/~guestrin/Class/15781/slides/learningtheory-bns-annotated.pdf
VC(H) and |H| 
• Is there any relation between VC(H) and |H| ?? 
• Let VC(H) = k (k is finite value) 
• We can shatter k examples using hypothesis H 
• We got 2k labeling cases of them 
• Definitely, since hypothesis space H can shatter 
every 2k cases, we got |H| ≥ 2k 
• k ≤ log2|H| (This is a very loose bound)
Bounds for infinite hypothesis space 
• In PAC learning, we seek theory to relate: 
• The number of training samples: m 
• errorD(h) ≤ errortrain(h) + ε 
• VC dimension: VC(H) (|H| and VC(H) are related) 
• Confidence of the relation: at least (1-δ ) 
• The bound on m 
• 
! 
• The bound on error 
increasing function of VC(H) on VC(H) < 2m 
For sufficiently large training data, 
Occam’s razor still works here
Structural Risk Minimization 
• Question: Is there any better criteria for choose 
hypothesis than empirical risk minimization? 
• Answer: choose H to minimize bound on 
expected true error (Structural Risk Minimization) 
! 
pick hypothesis that minimize structural risk
Appendix: bound for 
finite hypothesis space 
• We will call a hypothesis h’ is “bad” if err(h’) > ε 
• if err(h) > ε, then it must be the case that there exists some 
h’ in B s.t. h’ is consistent with m training data points 
• h is one example, and there could be many more 
• These implies that

PAC Learning

  • 1.
    ML Study PACLearning 2014.09.11 Sanghyuk Chun
  • 2.
    Overview • MLintro & Decision tree • Bayesian Methods • Regression • Graphical Model 1 • Graphical Model 2 (EM) • PAC learning • Hidden Markov Models • Learning Representations • Neural Network • Support Vector Machine • Reinforcement Learning Basic Concepts Model and Algorithms PAC: Theory for ML algorithms
  • 3.
    Computational Learning Theory • Computational learning theory is a mathematical and theoretical field related to analysis of machine learning algorithms • We need to seek theory to relate: • Probability of successful learning • Number of training examples • Complexity of hypothesis space • Accuracy to which target function is approximated • Manner in which training examples presented
  • 4.
    Prototypical Concept LearningTask • Given: • Instances X (set of instance or objects in world) • Target concept c (subset of instance space) • Hypothesis H (collection of concepts over X) • Training data D (example from instance space) • Determine: • A hypothesis h in H such that h(x) = c(x) for all x in D? • A hypothesis h in H such that h(x) = c(x) for all x in X? Training error True error
  • 5.
    Function Approximation: Overview h in hypothesis space H which is “best” hypothesis on the training data D c is a target function (or concept) what we want to find from hypothesis space H There is no free lunch!! generalization beyond the training data is impossible without more assumptions (for example, regularization term in logistic regression)
  • 6.
    True error andTraining error • True error of hypothesis h with respect to c • How often h(x) ≠ c(x) over random instances ! ! ! • Training error of hypothesis h with respect to c • How often h(x) ≠ c(x) over training instances !
  • 7.
    True error andTraining error • We now use “Empirical Risk Minimization” method which find hypothesis minimizing training error to select hypothesis h • Problem: errortrain(h) is a biased approximation to the errorD(h) • Since h is selected using training data, or errortrain(h) is dependent on h, errortrain(h) is a biased approximation to the errorD(h) • On h, it is likely to be an underestimate!
  • 8.
    True error andTest error • Question: By the way, it is impossible to know exact true error errorD(h), is there any unbiased approximation error to the errorD(h)? How can we measure ‘true’ performance of hypothesis h? • Answer: Test error is an unbiased approximation to the errorD(h) • as the test set are i.i.d. samples draw from the true distribution independently of h
  • 9.
    Overfitting • Hypothesish in H overfits training data if there is an alternative hypothesis h’ in H such that • errortrain(h) < errortrain(h’) and • errorD(h) > errorD(h’) ! ! “Complex” model causes more overfitting effect (Occam’s razor) What if the training set goes infinity? Or, how many training sample we need?
  • 10.
    PAC learning •PAC learning, or Probably Approximately Correct learning is a framework for mathematical analysis of machine learning • Goal of PAC: With high probability (“Probably”), the selected hypothesis will have low error ("Approximately Correct") • Assume there is no error (noise) on data
  • 11.
    PAC learning: finitehypothesis space • As we see before, training error underestimates the true error • In PAC learning, we seek theory to relate: • The number of training samples: m • Te gap between training and true errors • errorD(h) ≤ errortrain(h) + ε • Complexity of the hypothesis space: |H| • Confidence of the relation: at least (1-δ )
  • 12.
    Special case: errortrain(h)=0 • Assume errortrain(h)=0, or target concept c is in hypothesis space H • errorD(h) ≤ errortrain(h) + ε • errorD(h) ≤ ε • What is the probability that there exists consistent hypothesis with true error > ε? • i.e. represent δ using m, ε, |H| • Result (proof: see appendix)
  • 13.
    Bounds for finitehypothesis space • • Suppose the probability to be at most δ (confidence of relation is 1-δ), (i.e. |H|e-εm ≤ δ) • How many training examples suffice? • • If errortrain(h)=0 then with probability at least 1-δ •
  • 14.
    Agnostic learning •Assume errortrain(h) ≠ 0 or target function c is not in hypothesis space H • Again, in PAC learning, we seek theory to relate: • The number of training samples: m • Te gap between training and true errors • errorD(h) ≤ errortrain(h) + ε • Complexity of the hypothesis space: |H| • Confidence of the relation: at least (1-δ ) • The bound on δ (derived from Hoeffding bounds)
  • 15.
    Bounds for finitehypothesis space • The bound on δ • ! • We got new answer! • true error training error degree of overfitting
  • 16.
    Intuition • Thebound of number of training samples is ! • The bound on true error is ! • We can improve performance of the algorithm by • Decreasing training errortrain(h) • Increasing number of training sample m • Choose H which is “simple” (Occam’s Razor)
  • 17.
  • 18.
    PAC learning: infinitehypothesis space • Bound for finite hypothesis space ! ! ! • What if hypothesis space is infinite? or |H| → ∞?
  • 19.
    VC dimension •VC (Vapnik–Chervonenkis) dimension or VC(H) is a measure of the capacity of a classification algorithm • defined as the cardinality (size) of the largest set of points that the algorithm can shatter ! ! ! • Shatter: correctly classify regardless of the labeling
  • 20.
    VC dimension example • Linear classify in 2-D dimension ! ! ! • 1-Nearest neighbor method? ! ! for d-D dimension, we can do classify d+1 points! VC(H) = d+1 • Decision tree with k boolean variables • VC(H) = 2k because we can shatter 2k examples using a tree with 2k leaf nodes, and we cannot shatter 2k +1 examples. http://www.cs.cmu.edu/~guestrin/Class/15781/slides/learningtheory-bns-annotated.pdf
  • 21.
    VC(H) and |H| • Is there any relation between VC(H) and |H| ?? • Let VC(H) = k (k is finite value) • We can shatter k examples using hypothesis H • We got 2k labeling cases of them • Definitely, since hypothesis space H can shatter every 2k cases, we got |H| ≥ 2k • k ≤ log2|H| (This is a very loose bound)
  • 22.
    Bounds for infinitehypothesis space • In PAC learning, we seek theory to relate: • The number of training samples: m • errorD(h) ≤ errortrain(h) + ε • VC dimension: VC(H) (|H| and VC(H) are related) • Confidence of the relation: at least (1-δ ) • The bound on m • ! • The bound on error increasing function of VC(H) on VC(H) < 2m For sufficiently large training data, Occam’s razor still works here
  • 23.
    Structural Risk Minimization • Question: Is there any better criteria for choose hypothesis than empirical risk minimization? • Answer: choose H to minimize bound on expected true error (Structural Risk Minimization) ! pick hypothesis that minimize structural risk
  • 24.
    Appendix: bound for finite hypothesis space • We will call a hypothesis h’ is “bad” if err(h’) > ε • if err(h) > ε, then it must be the case that there exists some h’ in B s.t. h’ is consistent with m training data points • h is one example, and there could be many more • These implies that