1. Introduction to Machine Learning
Vapnik–Chervonenkis Dimension
Andres Mendez-Vazquez
July 22, 2018
1 / 135
2. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
2 / 135
3. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
3 / 135
4. Images/cinvestav
Until Now
We have been learning
A Lot of functions to approximate the models f of a given data set D.
Data Observation/Data Collection
What we can observePhenomea Generating
Samples
4 / 135
6. Images/cinvestav
Example
Consider the following data set D
Consider a Boolean target function over a three-dimensional input
space X = {0, 1}3
With a data set D
n xn yn
1 000 0
2 001 1
3 010 1
4 011 0
5 100 1
6 / 135
7. Images/cinvestav
Example
Consider the following data set D
Consider a Boolean target function over a three-dimensional input
space X = {0, 1}3
With a data set D
n xn yn
1 000 0
2 001 1
3 010 1
4 011 0
5 100 1
6 / 135
8. Images/cinvestav
We have the following
We have the space of input has 23
possibilities
Therefore, we have 223
possible functions for f
Learning outside the data D, basically we want a g that generalize
outside D
n xn yn g f1 f2 f3 f4 f5 f6 f7 f8
1 000 0 0 0 0 0 0 0 0 0 0
2 001 1 1 1 1 1 1 1 1 1 1
3 010 1 1 1 1 1 1 1 1 1 1
4 011 0 0 0 0 0 0 0 0 0 0
5 100 1 1 1 1 1 1 1 1 1 1
6 101 ? 0 0 0 0 1 1 1 1
7 110 ? 0 0 1 1 0 0 1 1
7 110 ? 0 1 0 1 0 1 0 1
7 / 135
9. Images/cinvestav
We have the following
We have the space of input has 23
possibilities
Therefore, we have 223
possible functions for f
Learning outside the data D, basically we want a g that generalize
outside D
n xn yn g f1 f2 f3 f4 f5 f6 f7 f8
1 000 0 0 0 0 0 0 0 0 0 0
2 001 1 1 1 1 1 1 1 1 1 1
3 010 1 1 1 1 1 1 1 1 1 1
4 011 0 0 0 0 0 0 0 0 0 0
5 100 1 1 1 1 1 1 1 1 1 1
6 101 ? 0 0 0 0 1 1 1 1
7 110 ? 0 0 1 1 0 0 1 1
7 110 ? 0 1 0 1 0 1 0 1
7 / 135
10. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
8 / 135
11. Images/cinvestav
Here is the Dilemma!!!
Each of the f1, f2, ..., f8
It is a possible real f, the true f.
Any of them is a possible good f
Therefore
The quality of the learning will be determined by how close our
prediction is to the true value.
9 / 135
12. Images/cinvestav
Here is the Dilemma!!!
Each of the f1, f2, ..., f8
It is a possible real f, the true f.
Any of them is a possible good f
Therefore
The quality of the learning will be determined by how close our
prediction is to the true value.
9 / 135
13. Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
14. Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
15. Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
16. Images/cinvestav
We want to Generalize
But, If we want to use only a deterministic approach to H
Our Attempts to use H to learn g is a waste of time!!!
11 / 135
17. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
12 / 135
19. Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
20. Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
21. Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
22. Images/cinvestav
Two Answers... Possible vs. Probable
No!!! Because we can see only the samples
For example, Sample an be mostly blue while bin is mostly red.
Yes!!!
Sample frequency ν is likely close to bin frequency µ.
15 / 135
23. Images/cinvestav
Two Answers... Possible vs. Probable
No!!! Because we can see only the samples
For example, Sample an be mostly blue while bin is mostly red.
Yes!!!
Sample frequency ν is likely close to bin frequency µ.
15 / 135
24. Images/cinvestav
What does ν say about µ?
We have the following hypothesis
In a big sample (large N ), ν is probably close to µ (within ).
How?
Hoeffding’s Inequality .
16 / 135
25. Images/cinvestav
What does ν say about µ?
We have the following hypothesis
In a big sample (large N ), ν is probably close to µ (within ).
How?
Hoeffding’s Inequality .
16 / 135
26. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
17 / 135
27. Images/cinvestav
We have the following theorem
Theorem (Hoeffding’s inequality)
Let Z1, ..., Zn be independent bounded random variables with
Zi ∈ [a, b] for all i, where −∞ < a ≤ b < ∞. Then
P
1
N
N
i=1
(Zi − E [Zi]) ≥ t ≤ exp
− 2Nt2
(b−a)2
and
P
1
N
N
i=1
(Zi − E [Zi]) ≤ −t ≤ exp
− 2Nt2
(b−a)2
for all t ≥ 0.
18 / 135
28. Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
29. Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
30. Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
31. Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
32. Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
33. Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
35. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
36. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
37. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
38. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
39. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
40. Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
41. Images/cinvestav
Here a Small Remark
Here, we are not talking about classes
When talking about blue and red balls, but if we are able to identify
the correct label:
yh = h (x) =f (x) = y
or
yh = h (x) =f (x) = y
Still, the use of blue and red balls allows
to see our Learning Problem as a Bernoulli distribution
23 / 135
42. Images/cinvestav
Here a Small Remark
Here, we are not talking about classes
When talking about blue and red balls, but if we are able to identify
the correct label:
yh = h (x) =f (x) = y
or
yh = h (x) =f (x) = y
Still, the use of blue and red balls allows
to see our Learning Problem as a Bernoulli distribution
23 / 135
43. Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
44. Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
45. Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
46. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
25 / 135
47. Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
48. Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
49. Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
50. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
27 / 135
51. Images/cinvestav
Generalization Error
Definition (Generalization Error/out-of-sample error)
Given a hypothesis/proposed model h ∈ H, a target concept/real
model f ∈ F, and an underlying distribution D, the generalization error or
risk of h is defined by
R (h) = Px∼D (h (x) = f (x)) = Ex∼D Ih(x)=f(x)
a
where Iω is the indicator function of the event ω.
a
This comes the fact that 1 ∗ P (A) + 0 ∗ P A = E [IA]
28 / 135
52. Images/cinvestav
Empirical Error
Definition (Empirical Error/in-sample error)
Given a hypothesis/proposed model h ∈ H, a target concept/real
model f ∈ F, a sample X = {x1, x2, ..., xN }, the empirical error or
empirical risk of h is defined by:
R =
1
N
N
i=1
Ih(xi)=f(xi)
29 / 135
53. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
30 / 135
57. Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
58. Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
59. Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
60. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
34 / 135
61. Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
62. Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
63. Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
66. Images/cinvestav
We have two rules
First one
if A1 =⇒ A2, then P (A1) ≤ P (A2)
If you have any set of events A1, A2, ..., AM
P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤
M
m=1
P (Am)
37 / 135
67. Images/cinvestav
We have two rules
First one
if A1 =⇒ A2, then P (A1) ≤ P (A2)
If you have any set of events A1, A2, ..., AM
P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤
M
m=1
P (Am)
37 / 135
69. Images/cinvestav
Thus
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤P [|Ein (h1) − Eout (h1)| ≥
or |Ein (h2) − Eout (h2)| ≥
· · ·
or |Ein (hM ) − Eout (hM )| ≥ ]
39 / 135
70. Images/cinvestav
Then
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤
M
m=1
[|Ein (hm) − Eout (hm)| ≥ ]
Thus
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
40 / 135
71. Images/cinvestav
Then
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤
M
m=1
[|Ein (hm) − Eout (hm)| ≥ ]
Thus
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
40 / 135
72. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
41 / 135
73. Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
74. Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
75. Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
76. Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
77. Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
78. Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
79. Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
80. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
44 / 135
81. Images/cinvestav
For example
We could have hypothesis based in hyperplanes
Linear regression output:
h (x) =
d
i=1
wixi = wT
x
Therefore
Ein (x) =
1
N
N
i=1
(h (xn) − yn)2
45 / 135
82. Images/cinvestav
For example
We could have hypothesis based in hyperplanes
Linear regression output:
h (x) =
d
i=1
wixi = wT
x
Therefore
Ein (x) =
1
N
N
i=1
(h (xn) − yn)2
45 / 135
83. Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
84. Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
85. Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
86. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
47 / 135
87. Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
88. Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
89. Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
90. Images/cinvestav
We have the following Process
Assuming P (y|x) instead of y = f (x)
Then a data point (x, y) is now generated by the joint distribution
P (x, y) = P (x) P (y|x)
Therefore
Noisy target is a deterministic target plus added noise.
f (x) ≈ E [y|x] + (y − f (x))
49 / 135
91. Images/cinvestav
We have the following Process
Assuming P (y|x) instead of y = f (x)
Then a data point (x, y) is now generated by the joint distribution
P (x, y) = P (x) P (y|x)
Therefore
Noisy target is a deterministic target plus added noise.
f (x) ≈ E [y|x] + (y − f (x))
49 / 135
92. Images/cinvestav
Finally, we have as Learning Process
Unknow Target Distribution Probability
Distribution
Training Examples
50 / 135
93. Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
94. Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
95. Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
96. Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
97. Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
98. Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
99. Images/cinvestav
Then
We make at the same time
Ein (g) ≈ 0
To Make the Error in our selected hypothesis g with respect to the
real function f
Learning splits in two questions
1 Can we make Eout (g) is close enough Ein (g)?
2 Can we make Ein (g) small enough?
53 / 135
100. Images/cinvestav
Then
We make at the same time
Ein (g) ≈ 0
To Make the Error in our selected hypothesis g with respect to the
real function f
Learning splits in two questions
1 Can we make Eout (g) is close enough Ein (g)?
2 Can we make Ein (g) small enough?
53 / 135
102. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
55 / 135
103. Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
104. Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
105. Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
106. Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
107. Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
108. Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
109. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
58 / 135
110. Images/cinvestav
Thus
It is possible to define
The generalization error as the discrepancy between Ein and Eout
Therefore
The Hoeffding Inequality is a way to characterize the generalization
error with a probabilistic bound
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
For any > 0.
59 / 135
111. Images/cinvestav
Thus
It is possible to define
The generalization error as the discrepancy between Ein and Eout
Therefore
The Hoeffding Inequality is a way to characterize the generalization
error with a probabilistic bound
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
For any > 0.
59 / 135
112. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
60 / 135
113. Images/cinvestav
Reinterpreting This
Assume a Tolerance Level δ, for example δ = 0.0005
It is possible to say that with probability 1 − δ :
Eout (g) < Ein (g) +
1
2N
ln
2M
δ
61 / 135
114. Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
115. Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
116. Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
121. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
65 / 135
122. Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
123. Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
124. Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
125. Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
126. Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
127. Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
128. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
68 / 135
129. Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
130. Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
131. Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
132. Images/cinvestav
Therefore, we need to replace M
We need to find a finite substitute with finite range values
For this, we notice that
|Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · ·
or |Ein (hM ) − Eout (hM )| ≥
70 / 135
133. Images/cinvestav
We have
This guarantee |Ein (g) − Eout (g)| ≥
Thus, we can take a look at the events Bm events for which you have
|Ein (hm) − Eout (hm)| ≥
Then
P B1 or B2 · · · or BM ≤
M
m=1
P [Bm]
71 / 135
134. Images/cinvestav
We have
This guarantee |Ein (g) − Eout (g)| ≥
Thus, we can take a look at the events Bm events for which you have
|Ein (hm) − Eout (hm)| ≥
Then
P B1 or B2 · · · or BM ≤
M
m=1
P [Bm]
71 / 135
135. Images/cinvestav
Now, we have the following
Example
We have a gross overestimate
Basically, if hi and hj are quite similar the two events
|Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥
are likely to coincide!!!
72 / 135
136. Images/cinvestav
Now, we have the following
Example
We have a gross overestimate
Basically, if hi and hj are quite similar the two events
|Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥
are likely to coincide!!!
72 / 135
137. Images/cinvestav
We have
Something Notable
In a typical learning model, many hypotheses are indeed very similar.
The mathematical theory of generalization hinges on this observation
We only need to account for the overlapping on different hypothesis
to substitute M.
73 / 135
138. Images/cinvestav
We have
Something Notable
In a typical learning model, many hypotheses are indeed very similar.
The mathematical theory of generalization hinges on this observation
We only need to account for the overlapping on different hypothesis
to substitute M.
73 / 135
139. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
74 / 135
140. Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
141. Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
142. Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
145. Images/cinvestav
Something Important
Each h ∈ H generates a dichotomy on x1, ..., xN
However, two different h’s may generate the same dichotomy if they
generate the same pattern
78 / 135
146. Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
147. Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
148. Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
149. Images/cinvestav
Growth function, Our Replacement of M
Definition
The growth function is defined for a hypothesis set H by
mH (N) = max
x1,...,xN ∈X
#H (x1, x2, ..., xN )
where # denotes the cardinality (number of elements) of a set.
Therefore
mH (N) is the maximum number of dichotomies that be
generated by H on any N points.
We remove dependency on the entire X
80 / 135
150. Images/cinvestav
Growth function, Our Replacement of M
Definition
The growth function is defined for a hypothesis set H by
mH (N) = max
x1,...,xN ∈X
#H (x1, x2, ..., xN )
where # denotes the cardinality (number of elements) of a set.
Therefore
mH (N) is the maximum number of dichotomies that be
generated by H on any N points.
We remove dependency on the entire X
80 / 135
151. Images/cinvestav
Therefore
We have that
M and mH (N) is a measure of the of the number of hypothesis in H
However, we avoid considering all of X
Now we only consider N points instead of the entire X.
81 / 135
152. Images/cinvestav
Therefore
We have that
M and mH (N) is a measure of the of the number of hypothesis in H
However, we avoid considering all of X
Now we only consider N points instead of the entire X.
81 / 135
153. Images/cinvestav
Upper Bound for mH (N)
First, we know that
H (x1, x2, ..., xN ) ⊆ {−1, +1}N
Hence, we have the value of mH (N) is at most # {−1, +1}N
mH (N) ≤ 2N
82 / 135
154. Images/cinvestav
Upper Bound for mH (N)
First, we know that
H (x1, x2, ..., xN ) ⊆ {−1, +1}N
Hence, we have the value of mH (N) is at most # {−1, +1}N
mH (N) ≤ 2N
82 / 135
155. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
83 / 135
156. Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
157. Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
158. Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
159. Images/cinvestav
Shattering
Definition
A set X of N ≥ 1 points is said to be shattered by a hypothesis set
H when H realizes all possible dichotomies of X, that is when
mH (N) = 2N
85 / 135
160. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
86 / 135
163. Images/cinvestav
Thus, we have that
As we change a, we get N + 1 different dichotomies
mH (N) = N + 1
Now, we have the case of positive intervals
H consists of all hypotheses in one dimension that return +1 within
some interval and −1 otherwise.
88 / 135
164. Images/cinvestav
Thus, we have that
As we change a, we get N + 1 different dichotomies
mH (N) = N + 1
Now, we have the case of positive intervals
H consists of all hypotheses in one dimension that return +1 within
some interval and −1 otherwise.
88 / 135
165. Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
166. Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
167. Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
169. Images/cinvestav
Finally
In the case of a Convex Set in R2
H consists of all hypothesis in two dimensions that are positive inside
some convex set and negative elsewhere.
91 / 135
171. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
93 / 135
172. Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
173. Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
174. Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
175. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
95 / 135
178. Images/cinvestav
Important
Something Notable
In general, it is easier to find a break point for H than to compute the
full growth function for that H.
Using this concept
We are ready to define the concept of Vapnik–Chervonenkis (VC)
dimension.
98 / 135
179. Images/cinvestav
Important
Something Notable
In general, it is easier to find a break point for H than to compute the
full growth function for that H.
Using this concept
We are ready to define the concept of Vapnik–Chervonenkis (VC)
dimension.
98 / 135
180. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
99 / 135
181. Images/cinvestav
VC-Dimension
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
A set containing k points, for arbitrary k, is in general linear position
if and only if no (k − 1) −dimensional flat contains them all
100 / 135
182. Images/cinvestav
Important Remarks
Remark 1
if V Cdim (H) = d, there exists a set of size d that can be fully
shattered.
Remark2
This does not imply that all sets of size d or less are fully shattered
This is typically the case!!!
101 / 135
183. Images/cinvestav
Important Remarks
Remark 1
if V Cdim (H) = d, there exists a set of size d that can be fully
shattered.
Remark2
This does not imply that all sets of size d or less are fully shattered
This is typically the case!!!
101 / 135
185. Images/cinvestav
Now, we define B(N, k)
Definition
B(N, k) is the maximum number of dichotomies on N points such
that no subset of size k of the N points can be shattered by these
dichotomies.
Something Notable
The definition of B(N, k) assumes a break point k!!!
103 / 135
186. Images/cinvestav
Now, we define B(N, k)
Definition
B(N, k) is the maximum number of dichotomies on N points such
that no subset of size k of the N points can be shattered by these
dichotomies.
Something Notable
The definition of B(N, k) assumes a break point k!!!
103 / 135
187. Images/cinvestav
Further
Since B(N, k) is a maximum
It is an upper bound for mH (N) under a break point k.
mH (N) ≤ B (N, k) if k is a break point for H.
Then
We need to find a Bound for B (N, k) to prove that mH (k) is
polynomial.
104 / 135
188. Images/cinvestav
Further
Since B(N, k) is a maximum
It is an upper bound for mH (N) under a break point k.
mH (N) ≤ B (N, k) if k is a break point for H.
Then
We need to find a Bound for B (N, k) to prove that mH (k) is
polynomial.
104 / 135
190. Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
191. Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
192. Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
193. Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
194. Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
195. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
107 / 135
197. Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
198. Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
199. Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
200. Images/cinvestav
Therefore, we collect them in three sets
The ones with only one Dichotomy
We use the set S1
The other in two different sets
S+
2 the ones with xN = +1.
S−
2 the ones with xN = −1.
110 / 135
201. Images/cinvestav
Therefore, we collect them in three sets
The ones with only one Dichotomy
We use the set S1
The other in two different sets
S+
2 the ones with xN = +1.
S−
2 the ones with xN = −1.
110 / 135
202. Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
203. Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
204. Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
205. Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
206. Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
207. Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
208. Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
209. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
113 / 135
210. Images/cinvestav
Connecting the Growth Function with the V Cdim
Sauer’s Lemma
For all k ∈ N , the following inequality holds:
B (N, k) ≤
k−1
i=0
N
i
114 / 135
211. Images/cinvestav
Proof
Proof
For k = 1
B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 =
N
0
Then, by induction
We assume that the statement is true for N ≤ N0 and all k.
115 / 135
212. Images/cinvestav
Proof
Proof
For k = 1
B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 =
N
0
Then, by induction
We assume that the statement is true for N ≤ N0 and all k.
115 / 135
213. Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
214. Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
215. Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
216. Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
217. Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
218. Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
219. Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
220. Images/cinvestav
Now
We have in conclusion for all k
B (N, k) ≤
k−1
i=0
N
i
Therefore
mH (N) ≤ B (N, k) ≤
k−1
i=0
N
i
118 / 135
221. Images/cinvestav
Now
We have in conclusion for all k
B (N, k) ≤
k−1
i=0
N
i
Therefore
mH (N) ≤ B (N, k) ≤
k−1
i=0
N
i
118 / 135
228. Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
229. Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
230. Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
231. Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
232. Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
233. Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
234. Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
235. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
124 / 135
236. Images/cinvestav
Remark about mH (k)
We have bounded the number of effective hypothesis
Yes!!! we can have M hypotheses but the number of
dichotomies generated by them is bounded by mH (k)
125 / 135
237. Images/cinvestav
VC-Dimension Again
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
Something Notable
If mH (N) = 2N for all N, V Cdim (H) = ∞
126 / 135
238. Images/cinvestav
VC-Dimension Again
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
Something Notable
If mH (N) = 2N for all N, V Cdim (H) = ∞
126 / 135
239. Images/cinvestav
Remember
We have the following
Ein (g) < Eout (g) +
1
2N
ln
2M
δ
We instead of using M, we use mH (N)
We can use our growth function as the effective way to bound
Ein (g) < Eout (g) +
1
2N
ln
2mH (N)
δ
127 / 135
240. Images/cinvestav
Remember
We have the following
Ein (g) < Eout (g) +
1
2N
ln
2M
δ
We instead of using M, we use mH (N)
We can use our growth function as the effective way to bound
Ein (g) < Eout (g) +
1
2N
ln
2mH (N)
δ
127 / 135
241. Images/cinvestav
VC Generalized Bound
Theorem (VC Generalized Bound)
For any tolerance δ > 0 and H be a hypothesis set with
V Cdim (H) = k.,
Ein (g) < Eout (g) +
2k
N
ln
eN
k
+
1
2N
ln
1
δ
with probability ≥ 1 − δ
Something Notable
This Bound only fails when V Cdim (H) = ∞!!!
128 / 135
242. Images/cinvestav
VC Generalized Bound
Theorem (VC Generalized Bound)
For any tolerance δ > 0 and H be a hypothesis set with
V Cdim (H) = k.,
Ein (g) < Eout (g) +
2k
N
ln
eN
k
+
1
2N
ln
1
δ
with probability ≥ 1 − δ
Something Notable
This Bound only fails when V Cdim (H) = ∞!!!
128 / 135
243. Images/cinvestav
Proof
Although we will not talk about it
We will remark the that is possible to use the Rademacher complexity
To manage the number of overlapping hypothesis (Which can be
infinite)
We will stop here, but
But I will encourage to look at more about the proof...
129 / 135
244. Images/cinvestav
Proof
Although we will not talk about it
We will remark the that is possible to use the Rademacher complexity
To manage the number of overlapping hypothesis (Which can be
infinite)
We will stop here, but
But I will encourage to look at more about the proof...
129 / 135
245. Images/cinvestav
About the Proof
For More, take a look at
“A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al.
“Foundations of Machine Learning” by Mehryar Mohori et al.
This is the equivalent to use Measure Theory to understand the
innards of Probability
We are professionals, we must understand!!!
130 / 135
246. Images/cinvestav
About the Proof
For More, take a look at
“A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al.
“Foundations of Machine Learning” by Mehryar Mohori et al.
This is the equivalent to use Measure Theory to understand the
innards of Probability
We are professionals, we must understand!!!
130 / 135
247. Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
131 / 135
249. Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
250. Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
251. Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
252. Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
253. Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
254. Images/cinvestav
Therefore
We have that
The Neural concept represent an hypothesis from RN to {−1, 1}
Therefore the entire hypothesis is a composition of concepts
This is called a G-composition of H.
134 / 135
255. Images/cinvestav
Therefore
We have that
The Neural concept represent an hypothesis from RN to {−1, 1}
Therefore the entire hypothesis is a composition of concepts
This is called a G-composition of H.
134 / 135
256. Images/cinvestav
We have the following theorem
Theorem (Kearns and Vazirani, 1994)
Let G be a layered directed acyclic graph with N input nodes and
r ≥ 2 internal nodes each of indegree r.
Let H hypothesis set over Rr of V Cdim (H) = d, and let
G-composition of H. then
V Cdim (HG) ≤ 2ds log2 (es)
135 / 135
257. Images/cinvestav
We have the following theorem
Theorem (Kearns and Vazirani, 1994)
Let G be a layered directed acyclic graph with N input nodes and
r ≥ 2 internal nodes each of indegree r.
Let H hypothesis set over Rr of V Cdim (H) = d, and let
G-composition of H. then
V Cdim (HG) ≤ 2ds log2 (es)
135 / 135