PAC Learning

ML Study
PAC Learning
2014.09.11
Sanghyuk Chun

Overview
• ML intro & Decision tree
• Bayesian Methods
• Regression
• Graphical Model 1
• Graphical Model 2 (EM)
• PAC learning
• Hidden Markov Models
• Learning Representations
• Neural Network
• Support Vector Machine
• Reinforcement Learning
Basic Concepts
Model and Algorithms
PAC: Theory for ML algorithms

Computational Learning Theory
• Computational learning theory is a mathematical and
theoretical field related to analysis of machine
learning algorithms
• We need to seek theory to relate:
• Probability of successful learning
• Number of training examples
• Complexity of hypothesis space
• Accuracy to which target function is approximated
• Manner in which training examples presented

Prototypical Concept Learning Task
• Given:
• Instances X (set of instance or objects in world)
• Target concept c (subset of instance space)
• Hypothesis H (collection of concepts over X)
• Training data D (example from instance space)
• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in D?
• A hypothesis h in H such that h(x) = c(x) for all x in X?
Training error
True error

Function Approximation: Overview
h in hypothesis space H which
is “best” hypothesis on the
training data D
c is a target function (or concept)
what we want to find from
hypothesis space H
There is no free lunch!!
generalization beyond the training
data is impossible without more
assumptions
(for example, regularization term
in logistic regression)

True error and Training error
• True error of hypothesis h with respect to c
• How often h(x) ≠ c(x) over random instances
!
!
!
• Training error of hypothesis h with respect to c
• How often h(x) ≠ c(x) over training instances
!

True error and Training error
• We now use “Empirical Risk Minimization”
method which find hypothesis minimizing training
error to select hypothesis h
• Problem: errortrain(h) is a biased approximation to
the errorD(h)
• Since h is selected using training data, or
errortrain(h) is dependent on h, errortrain(h) is a
biased approximation to the errorD(h)
• On h, it is likely to be an underestimate!

True error and Test error
• Question: By the way, it is impossible to know
exact true error errorD(h), is there any unbiased
approximation error to the errorD(h)? How can we
measure ‘true’ performance of hypothesis h?
• Answer: Test error is an unbiased approximation
to the errorD(h)
• as the test set are i.i.d. samples draw from the
true distribution independently of h

Overfitting
• Hypothesis h in H overfits training data if there is an
alternative hypothesis h’ in H such that
• errortrain(h) < errortrain(h’) and
• errorD(h) > errorD(h’)
!
!
“Complex” model causes more overfitting effect (Occam’s razor)
What if the training set goes infinity?
Or, how many training sample we need?

PAC learning
• PAC learning, or Probably Approximately Correct
learning is a framework for mathematical analysis
of machine learning
• Goal of PAC: With high probability (“Probably”),
the selected hypothesis will have low error
("Approximately Correct")
• Assume there is no error (noise) on data

PAC learning:
finite hypothesis space
• As we see before, training error underestimates
the true error
• In PAC learning, we seek theory to relate:
• The number of training samples: m
• Te gap between training and true errors
• errorD(h) ≤ errortrain(h) + ε
• Complexity of the hypothesis space: |H|
• Confidence of the relation: at least (1-δ )

Special case: errortrain(h)=0
• Assume errortrain(h)=0, or target concept c is in
hypothesis space H
• errorD(h) ≤ ε
• What is the probability that there exists consistent
hypothesis with true error > ε?
• i.e. represent δ using m, ε, |H|
• Result (proof: see appendix)

Bounds for
•
• Suppose the probability to be at most δ
(confidence of relation is 1-δ), (i.e. |H|e-εm ≤ δ)
• How many training examples suffice?
•
• If errortrain(h)=0 then with probability at least 1-δ
•

Agnostic learning
• Assume errortrain(h) ≠ 0 or target function c is not in
hypothesis space H
• Again, in PAC learning, we seek theory to relate:
• Te gap between training and true errors
• Complexity of the hypothesis space: |H|
• The bound on δ (derived from Hoeffding bounds)

Bounds for
• The bound on δ
•
!
• We got new answer!
•
true error training error degree of overfitting

Intuition
• The bound of number of training samples is
!
• The bound on true error is
!
• We can improve performance of the algorithm by
• Decreasing training errortrain(h)
• Increasing number of training sample m
• Choose H which is “simple” (Occam’s Razor)

PAC learning:
infinite hypothesis space
• Bound for finite hypothesis space
!
!
!
• What if hypothesis space is infinite? or |H| → ∞?

VC dimension
• VC (Vapnik–Chervonenkis) dimension or VC(H) is a
measure of the capacity of a classification algorithm
• defined as the cardinality (size) of the largest set of
points that the algorithm can shatter
!
!
!
• Shatter: correctly classify regardless of the labeling

VC dimension example
• Linear classify in 2-D dimension
!
!
!
• 1-Nearest neighbor method?
!
!
for d-D dimension, we can do classify
d+1 points!
VC(H) = d+1
• Decision tree with k boolean variables
• VC(H) = 2k because we can shatter 2k examples using a tree with 2k leaf
nodes, and we cannot shatter 2k +1 examples.
http://www.cs.cmu.edu/~guestrin/Class/15781/slides/learningtheory-bns-annotated.pdf

VC(H) and |H|
• Is there any relation between VC(H) and |H| ??
• Let VC(H) = k (k is finite value)
• We can shatter k examples using hypothesis H
• We got 2k labeling cases of them
• Definitely, since hypothesis space H can shatter
every 2k cases, we got |H| ≥ 2k
• k ≤ log2|H| (This is a very loose bound)

Bounds for infinite hypothesis space
• In PAC learning, we seek theory to relate:
• VC dimension: VC(H) (|H| and VC(H) are related)
• The bound on m
•
!
• The bound on error
increasing function of VC(H) on VC(H) < 2m
For sufficiently large training data,
Occam’s razor still works here

Structural Risk Minimization
• Question: Is there any better criteria for choose
hypothesis than empirical risk minimization?
• Answer: choose H to minimize bound on
expected true error (Structural Risk Minimization)
!
pick hypothesis that minimize structural risk

Appendix: bound for
• We will call a hypothesis h’ is “bad” if err(h’) > ε
• if err(h) > ε, then it must be the case that there exists some
h’ in B s.t. h’ is consistent with m training data points
• h is one example, and there could be many more
• These implies that

PAC Learning

More Related Content

What's hot

Viewers also liked

Similar to PAC Learning

More from Sanghyuk Chun

Recently uploaded

PAC Learning