SlideShare a Scribd company logo
1 of 10
Download to read offline
Note to other teachers and users of these slides. Andrew would
                                                         be delighted if you found this source material useful in giving
                                                         your own lectures. Feel free to use these slides verbatim, or to
                                                         modify them to fit your own needs. PowerPoint originals are
                                                         available. If you make use of a significant portion of these sli des
                                                         in your own lecture, please include this message, or the
                                                         following link to the source repository of Andrew’s tutorials:
                                                         http://www.cs.cmu.edu/~awm/tutorials . Comments and
                                                         corrections gratefully received.




                      PAC-learning
                                    Andrew W. Moore
                                   Associate Professor
                                School of Computer Science
                                Carnegie Mellon University
                                     www.cs.cmu.edu/~awm
                                        awm@cs.cmu.edu
                                          412-268-7599




         Copyright © 2001, Andrew W. Moore                                             Nov 30th, 2001




   Probably Approximately Correct
           (PAC) Learning
• Imagine we’re doing classification with categorical
  inputs.
• All inputs and outputs are binary.
• Data is noiseless.
• There’s a machine f(x,h) which has H possible
  settings (a.k.a. hypotheses), called h1, h 2 .. hH.




Copyright © 2001, Andrew W. Moore                                                             PAC-learning: Slide 2




                                                                                                                                1
Example of a machine
• f(x,h) consists of all logical sentences about X1, X2
  .. Xm that contain only logical ands.
• Example hypotheses:
• X1 ^ X3 ^ X19
• X3 ^ X18
• X7
• X1 ^ X2 ^ X2 ^ x4 … ^ Xm
• Question: if there are 3 attributes, what is the
  complete set of hypotheses in f?



Copyright © 2001, Andrew W. Moore                                 PAC-learning: Slide 3




                   Example of a machine
• f(x,h) consists of all logical sentences about X1, X2
  .. Xm that contain only logical ands.
• Example hypotheses:
• X1 ^ X3 ^ X19
• X3 ^ X18
• X7
• X1 ^ X2 ^ X2 ^ x4 … ^ Xm
• Question: if there are 3 attributes, what is the
  complete set of hypotheses in f? (H = 8)
              True                  X2        X3        X2 ^ X3

              X1                    X1 ^ X2   X1 ^ X3   X1 ^ X2 ^ X3
Copyright © 2001, Andrew W. Moore                                 PAC-learning: Slide 4




                                                                                          2
And-Positive-Literals Machine
• f(x,h) consists of all logical sentences about X1, X2
  .. Xm that contain only logical ands.
• Example hypotheses:
• X1 ^ X3 ^ X19
• X3 ^ X18
• X7
• X1 ^ X2 ^ X2 ^ x4 … ^ Xm
• Question: if there are m attributes, how many
  hypotheses in f?



Copyright © 2001, Andrew W. Moore            PAC-learning: Slide 5




       And-Positive-Literals Machine
• f(x,h) consists of all logical sentences about X1, X2
  .. Xm that contain only logical ands.
• Example hypotheses:
• X1 ^ X3 ^ X19
• X3 ^ X18
• X7
• X1 ^ X2 ^ X2 ^ x4 … ^ Xm
• Question: if there are m attributes, how many
  hypotheses in f? (H = 2m)



Copyright © 2001, Andrew W. Moore            PAC-learning: Slide 6




                                                                     3
And-Literals Machine
• f(x,h) consists of all logical
  sentences about X1, X2 ..
  Xm or their negations that
  contain only logical ands.
• Example hypotheses:
• X1 ^ ~X3 ^ X19
• X3 ^ ~X18
• ~X7
• X1 ^ X2 ^ ~X3 ^ … ^ Xm
• Question: if there are 2
  attributes, what is the
  complete set of hypotheses
  in f?

Copyright © 2001, Andrew W. Moore              PAC-learning: Slide 7




                     And-Literals Machine
• f(x,h) consists of all logical
  sentences about X1, X2 ..
  Xm or their negations that
                                    True       True
  contain only logical ands.
• Example hypotheses:               True       X2
• X1 ^ ~X3 ^ X19                    True       ~X2
• X3 ^ ~X18                         X1         True
• ~X7                               X1     ^   X2
• X1 ^ X2 ^ ~X3 ^ … ^ Xm            X1     ^   ~X2
• Question: if there are 2          ~X1        True
  attributes, what is the
                                    ~X1    ^   X2
  complete set of hypotheses
                                    ~X1    ^   ~X2
  in f? (H = 9)

Copyright © 2001, Andrew W. Moore              PAC-learning: Slide 8




                                                                       4
And-Literals Machine
• f(x,h) consists of all logical
  sentences about X1, X2 ..
  Xm or their negations that
                                    True        True
  contain only logical ands.
• Example hypotheses:               True        X2
• X1 ^ ~X3 ^ X19                    True        ~X2
• X3 ^ ~X18                         X1          True
• ~X7                               X1     ^    X2
• X1 ^ X2 ^ ~X3 ^ … ^ Xm            X1     ^    ~X2
• Question: if there are m          ~X1         True
  attributes, what is the size of
                                    ~X1    ^    X2
  the complete set of
                                    ~X1    ^    ~X2
  hypotheses in f?

Copyright © 2001, Andrew W. Moore               PAC-learning: Slide 9




                     And-Literals Machine
• f(x,h) consists of all logical
  sentences about X1, X2 ..
  Xm or their negations that
                                    True        True
  contain only logical ands.
• Example hypotheses:               True        X2
• X1 ^ ~X3 ^ X19                    True        ~X2
• X3 ^ ~X18                         X1          True
• ~X7                               X1     ^    X2
• X1 ^ X2 ^ ~X3 ^ … ^ Xm            X1     ^    ~X2
• Question: if there are m          ~X1         True
  attributes, what is the size of
                                    ~X1    ^    X2
  the complete set of
                                    ~X1    ^    ~X2
  hypotheses in f? (H = 3m)

Copyright © 2001, Andrew W. Moore              PAC-learning: Slide 10




                                                                        5
Lookup Table Machine
• f(x,h) consists of all truth              X1   X2   X3   X4   Y

  tables mapping combinations               0    0    0    0    0
                                            0    0    0    1    1
  of input attributes to true               0    0    1    0    1
  and false                                 0    0    1    1    0
                                            0    1    0    0    1
• Example hypothesis:                       0    1    0    1    0
                                            0    1    1    0    0
                                            0    1    1    1    1
• Question: if there are m                  1    0    0    0    0

  attributes, what is the size of           1    0    0    1    0

  the complete set of                       1    0    1    0    0
                                            1    0    1    1    1
  hypotheses in f?                          1    1    0    0    0
                                            1    1    0    1    0
                                            1    1    1    0    0
                                            1    1    1    1    0




Copyright © 2001, Andrew W. Moore                                   PAC-learning: Slide 11




                  Lookup Table Machine
• f(x,h) consists of all truth              X1   X2   X3   X4   Y

  tables mapping combinations               0    0    0    0    0
                                            0    0    0    1    1
  of input attributes to true               0    0    1    0    1
  and false                                 0    0    1    1    0
                                            0    1    0    0    1
• Example hypothesis:                       0    1    0    1    0
                                            0    1    1    0    0
                                            0    1    1    1    1
• Question: if there are m                  1    0    0    0    0

  attributes, what is the size of           1    0    0    1    0

  the complete set of                       1    0    1    0    0
                                            1    0    1    1    1
  hypotheses in f?                          1    1    0    0    0
                                            1    1    0    1    0




            H =2
                                        m   1    1    1    0    0
                                    2       1    1    1    1    0




Copyright © 2001, Andrew W. Moore                                   PAC-learning: Slide 12




                                                                                             6
A Game
• We specify f, the machine
• Nature choose hidden random hypothesis h*
• Nature randomly generates R datapoints
   •How is a datapoint generated?
     1.Vector of inputs x k = (x k1 ,x k2 , x km) is
       drawn from a fixed unknown distrib: D
     2.The corresponding output y k=f(x k , h*)
• We learn an approximation of h* by choosing
  some hest for which the training set error is 0




 Copyright © 2001, Andrew W. Moore                      PAC-learning: Slide 13




    Test Error Rate
• We specify f, the machine
• Nature choose hidden random hypothesis h*
• Nature randomly generates R datapoints
   •How is a datapoint generated?
      1.Vector of inputs x k = (x k1 ,x k2 , x km) is
        drawn from a fixed unknown distrib: D
      2.The corresponding output y k=f(x k , h*)
• We learn an approximation of h* by choosing
  some hest for which the training set error is 0
• For each hypothesis h ,
• Say h is Correctly Classified (CCd) if h has
  zero training set error
• Define TESTERR(h )
      = Fraction of test points that h will classify
        correctly
      = P(h classifies a random test point
        correctly)
• Say h is BAD if TESTERR(h) > ε




 Copyright © 2001, Andrew W. Moore                      PAC-learning: Slide 14




                                                                                 7
Test Error Rate
                                                                      P (h is CCd | h is bad ) =
                                                        P(∀k ∈ Training Set, f ( xk , h ) = yk | h is bad )
• We specify f, the machine
                                                                            ≤ (1 − ε ) R
• Nature choose hidden random hypothesis h*
• Nature randomly generates R datapoints
   •How is a datapoint generated?
      1.Vector of inputs x k = (x k1 ,x k2 , x km) is
        drawn from a fixed unknown distrib: D
      2.The corresponding output y k=f(x k , h*)
• We learn an approximation of h* by choosing
  some hest for which the training set error is 0
• For each hypothesis h ,
• Say h is Correctly Classified (CCd) if h has
  zero training set error
• Define TESTERR(h )
      = Fraction of test points that i will classify
        correctly
      = P(h classifies a random test point
        correctly)
• Say h is BAD if TESTERR(h) > ε




 Copyright © 2001, Andrew W. Moore                                                          PAC-learning: Slide 15




    Test Error Rate
                                                                      P (h is CCd | h is bad ) =
                                                        P(∀k ∈ Training Set, f ( xk , h ) = yk | h is bad )
• We specify f, the machine
                                                                            ≤ (1 − ε ) R
• Nature choose hidden random hypothesis h*
• Nature randomly generates R datapoints
                                                        P ( we learn a bad h ) ≤
   •How is a datapoint generated?
      1.Vector of inputs x k = (x k1 ,x k2 , x km) is
                                                                   the set of CCd h' s 
        drawn from a fixed unknown distrib: D                    P
                                                                   containsa bad h  = 
                                                                                       
      2.The corresponding output y k=f(x k , h*)
• We learn an approximation of h* by choosing
                                                                   P (∃h. h is CCd ^ h is bad)=
  some hest for which the training set error is 0
• For each hypothesis h ,
                                                                     ( h1 is CCd ^ h1 is bad) ∨ 
• Say h is Correctly Classified (CCd) if h has                                                  
  zero training set error                                            ( h2 is CCd ^ h2 is bad) ∨ 
                                                                                                 ≤
                                                                   P
• Define TESTERR(h )
                                                                                  :
                                                                                                
      = Fraction of test points that i will classify
                                                                     ( h is CCd ^ h is bad) 
                                                                    H                           
        correctly                                                                      H
      = P(h classifies a random test point                                H

                                                                      ∑ P(h is          CCd ^ hi is bad)≤
        correctly)                                                                  i
                                                                          i =1
                                                             H
• Say h is BAD if TESTERR(h) > ε
                                                            ∑ P(h is             CCd | hi is bad) =
                                                                      i
                                                            i =1

                                                        H × P (hi is CCd | hi is bad) ≤ H (1 − ε ) R
 Copyright © 2001, Andrew W. Moore                                                          PAC-learning: Slide 16




                                                                                                                     8
PAC Learning
 • Chose R such that with probability less than δ we’ll
   select a bad hest (i.e. an hest which makes mistakes
   more than fraction ε of the time)
 • Probably Approximately Correct
 • As we just saw, this can be achieved by choosing
   R such that
                 δ = P( we learn a bad h ) ≤ H (1 − ε ) R

 • i.e. R such that
                                   0.69                 1
                        R≥               log 2 H + log 2 
                                    ε                   δ
 Copyright © 2001, Andrew W. Moore                                             PAC-learning: Slide 17




                                     PAC in action
Machine       Example                              H                R required to PAC-
              Hypothesis                                            learn
                                                                        0.69           1
And-positive- X3 ^ X7 ^ X8                         2m                         m + log 2 
                                                                          ε            δ
literals
And-literals X3 ^ ~X7                              3m                0.69                  1
                                                                           (log2 3)m + log2 
                                                                      ε                    δ

Lookup                   X1   X2     X3   X4   Y

                                                                        0.69  m        1
                                                         m
                         0    0      0    0    0



                                                    22                        2 + log 2 
                         0    0      0    1    1


Table
                         0    0      1    0    1

                                                                         ε             δ
                         0    0      1    1    0
                         0    1      0    0    1
                         0    1      0    1    0
                         0    1      1    0    0
                         0    1      1    1    1
                         1    0      0    0    0
                         1    0      0    1    0
                         1    0      1    0    0
                         1    0      1    1    1
                         1    1      0    0    0
                         1    1      0    1    0
                         1    1      1    0    0
                         1    1      1    1    0




                        (X1 ^ X5) v
And-lits or                                                          0. 69                       1
                                                   (3m ) 2 = 3 2m           ( 2 log 2 3) m + log2 
                                                                      ε                          δ
And-lits                (X2 ^ ~X7 ^ X8)

 Copyright © 2001, Andrew W. Moore                                             PAC-learning: Slide 18




                                                                                                        9
PAC for decision trees of depth k
• Assume m attributes
• Hk = Number of decision trees of depth k

• H0 =2
• Hk+1 = (#choices of root attribute) *
         (# possible left subtrees) *
          (# possible right subtrees)
                  = m * Hk * Hk

• Write Lk = log2 Hk
• L0 = 1
• Lk+1 = log2 m + 2Lk
• So Lk = (2 k -1)(1+log2 m) +1
• So to PAC-learn, need
                                              0.69  k                                1
                                         R≥         ( 2 − 1)(1 + log 2 m) + 1 + log 2 
                                               ε                                     δ
Copyright © 2001, Andrew W. Moore                                             PAC-learning: Slide 19




                 What you should know
• Be able to understand every step in the math that
  gets you to
                  δ = P( we learn a bad h ) ≤ H (1 − ε ) R

• Understand that you thus need this many records
  to PAC-learn a machine with H hypotheses
                                    0.69                 1
                       R≥                 log 2 H + log 2 
                                     ε                   δ
• Understand examples of deducing H for various
  machines
Copyright © 2001, Andrew W. Moore                                             PAC-learning: Slide 20




                                                                                                       10

More Related Content

What's hot

Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Anmol Dwivedi
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
asimnawaz54
 

What's hot (20)

Predicates and Quantifiers
Predicates and Quantifiers Predicates and Quantifiers
Predicates and Quantifiers
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian Networks
 
I2ml3e chap2
I2ml3e chap2I2ml3e chap2
I2ml3e chap2
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
Unit 1 quantifiers
Unit 1  quantifiersUnit 1  quantifiers
Unit 1 quantifiers
 
Predicates and Quantifiers
Predicates and QuantifiersPredicates and Quantifiers
Predicates and Quantifiers
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
Puy chosuantai2
Puy chosuantai2Puy chosuantai2
Puy chosuantai2
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
 
Higher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming Problem
Higher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming ProblemHigher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming Problem
Higher-Order (F, α, β, ρ, d) –Convexity for Multiobjective Programming Problem
 
Lesson 20: Derivatives and the Shapes of Curves (slides)
Lesson 20: Derivatives and the Shapes of Curves (slides)Lesson 20: Derivatives and the Shapes of Curves (slides)
Lesson 20: Derivatives and the Shapes of Curves (slides)
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Iteration Techniques
Iteration TechniquesIteration Techniques
Iteration Techniques
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
 
CMSC 56 | Lecture 3: Predicates & Quantifiers
CMSC 56 | Lecture 3: Predicates & QuantifiersCMSC 56 | Lecture 3: Predicates & Quantifiers
CMSC 56 | Lecture 3: Predicates & Quantifiers
 
Predicate & quantifier
Predicate & quantifierPredicate & quantifier
Predicate & quantifier
 
Quantification
QuantificationQuantification
Quantification
 
Conditional preference queries and possibilistic logic
Conditional preference queries and possibilistic logicConditional preference queries and possibilistic logic
Conditional preference queries and possibilistic logic
 
Mth3101 Advanced Calculus Chapter 1
Mth3101 Advanced Calculus Chapter 1Mth3101 Advanced Calculus Chapter 1
Mth3101 Advanced Calculus Chapter 1
 

Similar to PAC Learning

Information Gain
Information GainInformation Gain
Information Gain
guest32311f
 
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
Dongseo University
 
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docxMA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
smile790243
 

Similar to PAC Learning (8)

pac.ppt
pac.pptpac.ppt
pac.ppt
 
Information Gain
Information GainInformation Gain
Information Gain
 
Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
 
i2ml3e-chap2.pptx
i2ml3e-chap2.pptxi2ml3e-chap2.pptx
i2ml3e-chap2.pptx
 
Mle
MleMle
Mle
 
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docxMA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
MA500-2 Topological Structures 2016Aisling McCluskey, Dar.docx
 
Varese italie seminar
Varese italie seminarVarese italie seminar
Varese italie seminar
 

More from guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
guestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
guestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
guestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
guestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
guestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
guestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
guestfee8698
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
guestfee8698
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
guestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
guestfee8698
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
guestfee8698
 

More from guestfee8698 (19)

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

PAC Learning

  • 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these sli des in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, Andrew W. Moore Nov 30th, 2001 Probably Approximately Correct (PAC) Learning • Imagine we’re doing classification with categorical inputs. • All inputs and outputs are binary. • Data is noiseless. • There’s a machine f(x,h) which has H possible settings (a.k.a. hypotheses), called h1, h 2 .. hH. Copyright © 2001, Andrew W. Moore PAC-learning: Slide 2 1
  • 2. Example of a machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm that contain only logical ands. • Example hypotheses: • X1 ^ X3 ^ X19 • X3 ^ X18 • X7 • X1 ^ X2 ^ X2 ^ x4 … ^ Xm • Question: if there are 3 attributes, what is the complete set of hypotheses in f? Copyright © 2001, Andrew W. Moore PAC-learning: Slide 3 Example of a machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm that contain only logical ands. • Example hypotheses: • X1 ^ X3 ^ X19 • X3 ^ X18 • X7 • X1 ^ X2 ^ X2 ^ x4 … ^ Xm • Question: if there are 3 attributes, what is the complete set of hypotheses in f? (H = 8) True X2 X3 X2 ^ X3 X1 X1 ^ X2 X1 ^ X3 X1 ^ X2 ^ X3 Copyright © 2001, Andrew W. Moore PAC-learning: Slide 4 2
  • 3. And-Positive-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm that contain only logical ands. • Example hypotheses: • X1 ^ X3 ^ X19 • X3 ^ X18 • X7 • X1 ^ X2 ^ X2 ^ x4 … ^ Xm • Question: if there are m attributes, how many hypotheses in f? Copyright © 2001, Andrew W. Moore PAC-learning: Slide 5 And-Positive-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm that contain only logical ands. • Example hypotheses: • X1 ^ X3 ^ X19 • X3 ^ X18 • X7 • X1 ^ X2 ^ X2 ^ x4 … ^ Xm • Question: if there are m attributes, how many hypotheses in f? (H = 2m) Copyright © 2001, Andrew W. Moore PAC-learning: Slide 6 3
  • 4. And-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm or their negations that contain only logical ands. • Example hypotheses: • X1 ^ ~X3 ^ X19 • X3 ^ ~X18 • ~X7 • X1 ^ X2 ^ ~X3 ^ … ^ Xm • Question: if there are 2 attributes, what is the complete set of hypotheses in f? Copyright © 2001, Andrew W. Moore PAC-learning: Slide 7 And-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm or their negations that True True contain only logical ands. • Example hypotheses: True X2 • X1 ^ ~X3 ^ X19 True ~X2 • X3 ^ ~X18 X1 True • ~X7 X1 ^ X2 • X1 ^ X2 ^ ~X3 ^ … ^ Xm X1 ^ ~X2 • Question: if there are 2 ~X1 True attributes, what is the ~X1 ^ X2 complete set of hypotheses ~X1 ^ ~X2 in f? (H = 9) Copyright © 2001, Andrew W. Moore PAC-learning: Slide 8 4
  • 5. And-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm or their negations that True True contain only logical ands. • Example hypotheses: True X2 • X1 ^ ~X3 ^ X19 True ~X2 • X3 ^ ~X18 X1 True • ~X7 X1 ^ X2 • X1 ^ X2 ^ ~X3 ^ … ^ Xm X1 ^ ~X2 • Question: if there are m ~X1 True attributes, what is the size of ~X1 ^ X2 the complete set of ~X1 ^ ~X2 hypotheses in f? Copyright © 2001, Andrew W. Moore PAC-learning: Slide 9 And-Literals Machine • f(x,h) consists of all logical sentences about X1, X2 .. Xm or their negations that True True contain only logical ands. • Example hypotheses: True X2 • X1 ^ ~X3 ^ X19 True ~X2 • X3 ^ ~X18 X1 True • ~X7 X1 ^ X2 • X1 ^ X2 ^ ~X3 ^ … ^ Xm X1 ^ ~X2 • Question: if there are m ~X1 True attributes, what is the size of ~X1 ^ X2 the complete set of ~X1 ^ ~X2 hypotheses in f? (H = 3m) Copyright © 2001, Andrew W. Moore PAC-learning: Slide 10 5
  • 6. Lookup Table Machine • f(x,h) consists of all truth X1 X2 X3 X4 Y tables mapping combinations 0 0 0 0 0 0 0 0 1 1 of input attributes to true 0 0 1 0 1 and false 0 0 1 1 0 0 1 0 0 1 • Example hypothesis: 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 • Question: if there are m 1 0 0 0 0 attributes, what is the size of 1 0 0 1 0 the complete set of 1 0 1 0 0 1 0 1 1 1 hypotheses in f? 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 Copyright © 2001, Andrew W. Moore PAC-learning: Slide 11 Lookup Table Machine • f(x,h) consists of all truth X1 X2 X3 X4 Y tables mapping combinations 0 0 0 0 0 0 0 0 1 1 of input attributes to true 0 0 1 0 1 and false 0 0 1 1 0 0 1 0 0 1 • Example hypothesis: 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 • Question: if there are m 1 0 0 0 0 attributes, what is the size of 1 0 0 1 0 the complete set of 1 0 1 0 0 1 0 1 1 1 hypotheses in f? 1 1 0 0 0 1 1 0 1 0 H =2 m 1 1 1 0 0 2 1 1 1 1 0 Copyright © 2001, Andrew W. Moore PAC-learning: Slide 12 6
  • 7. A Game • We specify f, the machine • Nature choose hidden random hypothesis h* • Nature randomly generates R datapoints •How is a datapoint generated? 1.Vector of inputs x k = (x k1 ,x k2 , x km) is drawn from a fixed unknown distrib: D 2.The corresponding output y k=f(x k , h*) • We learn an approximation of h* by choosing some hest for which the training set error is 0 Copyright © 2001, Andrew W. Moore PAC-learning: Slide 13 Test Error Rate • We specify f, the machine • Nature choose hidden random hypothesis h* • Nature randomly generates R datapoints •How is a datapoint generated? 1.Vector of inputs x k = (x k1 ,x k2 , x km) is drawn from a fixed unknown distrib: D 2.The corresponding output y k=f(x k , h*) • We learn an approximation of h* by choosing some hest for which the training set error is 0 • For each hypothesis h , • Say h is Correctly Classified (CCd) if h has zero training set error • Define TESTERR(h ) = Fraction of test points that h will classify correctly = P(h classifies a random test point correctly) • Say h is BAD if TESTERR(h) > ε Copyright © 2001, Andrew W. Moore PAC-learning: Slide 14 7
  • 8. Test Error Rate P (h is CCd | h is bad ) = P(∀k ∈ Training Set, f ( xk , h ) = yk | h is bad ) • We specify f, the machine ≤ (1 − ε ) R • Nature choose hidden random hypothesis h* • Nature randomly generates R datapoints •How is a datapoint generated? 1.Vector of inputs x k = (x k1 ,x k2 , x km) is drawn from a fixed unknown distrib: D 2.The corresponding output y k=f(x k , h*) • We learn an approximation of h* by choosing some hest for which the training set error is 0 • For each hypothesis h , • Say h is Correctly Classified (CCd) if h has zero training set error • Define TESTERR(h ) = Fraction of test points that i will classify correctly = P(h classifies a random test point correctly) • Say h is BAD if TESTERR(h) > ε Copyright © 2001, Andrew W. Moore PAC-learning: Slide 15 Test Error Rate P (h is CCd | h is bad ) = P(∀k ∈ Training Set, f ( xk , h ) = yk | h is bad ) • We specify f, the machine ≤ (1 − ε ) R • Nature choose hidden random hypothesis h* • Nature randomly generates R datapoints P ( we learn a bad h ) ≤ •How is a datapoint generated? 1.Vector of inputs x k = (x k1 ,x k2 , x km) is  the set of CCd h' s  drawn from a fixed unknown distrib: D P  containsa bad h  =    2.The corresponding output y k=f(x k , h*) • We learn an approximation of h* by choosing P (∃h. h is CCd ^ h is bad)= some hest for which the training set error is 0 • For each hypothesis h ,  ( h1 is CCd ^ h1 is bad) ∨  • Say h is Correctly Classified (CCd) if h has   zero training set error  ( h2 is CCd ^ h2 is bad) ∨  ≤ P • Define TESTERR(h ) :   = Fraction of test points that i will classify  ( h is CCd ^ h is bad)  H  correctly H = P(h classifies a random test point H ∑ P(h is CCd ^ hi is bad)≤ correctly) i i =1 H • Say h is BAD if TESTERR(h) > ε ∑ P(h is CCd | hi is bad) = i i =1 H × P (hi is CCd | hi is bad) ≤ H (1 − ε ) R Copyright © 2001, Andrew W. Moore PAC-learning: Slide 16 8
  • 9. PAC Learning • Chose R such that with probability less than δ we’ll select a bad hest (i.e. an hest which makes mistakes more than fraction ε of the time) • Probably Approximately Correct • As we just saw, this can be achieved by choosing R such that δ = P( we learn a bad h ) ≤ H (1 − ε ) R • i.e. R such that 0.69  1 R≥  log 2 H + log 2  ε δ Copyright © 2001, Andrew W. Moore PAC-learning: Slide 17 PAC in action Machine Example H R required to PAC- Hypothesis learn 0.69  1 And-positive- X3 ^ X7 ^ X8 2m  m + log 2  ε δ literals And-literals X3 ^ ~X7 3m 0.69  1  (log2 3)m + log2  ε δ Lookup X1 X2 X3 X4 Y 0.69  m 1 m 0 0 0 0 0 22  2 + log 2  0 0 0 1 1 Table 0 0 1 0 1 ε δ 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 (X1 ^ X5) v And-lits or 0. 69  1 (3m ) 2 = 3 2m  ( 2 log 2 3) m + log2  ε δ And-lits (X2 ^ ~X7 ^ X8) Copyright © 2001, Andrew W. Moore PAC-learning: Slide 18 9
  • 10. PAC for decision trees of depth k • Assume m attributes • Hk = Number of decision trees of depth k • H0 =2 • Hk+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * Hk * Hk • Write Lk = log2 Hk • L0 = 1 • Lk+1 = log2 m + 2Lk • So Lk = (2 k -1)(1+log2 m) +1 • So to PAC-learn, need 0.69  k 1 R≥  ( 2 − 1)(1 + log 2 m) + 1 + log 2  ε δ Copyright © 2001, Andrew W. Moore PAC-learning: Slide 19 What you should know • Be able to understand every step in the math that gets you to δ = P( we learn a bad h ) ≤ H (1 − ε ) R • Understand that you thus need this many records to PAC-learn a machine with H hypotheses 0.69  1 R≥  log 2 H + log 2  ε δ • Understand examples of deducing H for various machines Copyright © 2001, Andrew W. Moore PAC-learning: Slide 20 10