SlideShare a Scribd company logo
1 of 257
Download to read offline
Introduction to Machine Learning
Vapnik–Chervonenkis Dimension
Andres Mendez-Vazquez
July 22, 2018
1 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
2 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
3 / 135
Images/cinvestav
Until Now
We have been learning
A Lot of functions to approximate the models f of a given data set D.
Data Observation/Data Collection
What we can observePhenomea Generating
Samples
4 / 135
Images/cinvestav
The Question
But Never asked ourselves if
Are we able to really learn f from D?
5 / 135
Images/cinvestav
Example
Consider the following data set D
Consider a Boolean target function over a three-dimensional input
space X = {0, 1}3
With a data set D
n xn yn
1 000 0
2 001 1
3 010 1
4 011 0
5 100 1
6 / 135
Images/cinvestav
Example
Consider the following data set D
Consider a Boolean target function over a three-dimensional input
space X = {0, 1}3
With a data set D
n xn yn
1 000 0
2 001 1
3 010 1
4 011 0
5 100 1
6 / 135
Images/cinvestav
We have the following
We have the space of input has 23
possibilities
Therefore, we have 223
possible functions for f
Learning outside the data D, basically we want a g that generalize
outside D
n xn yn g f1 f2 f3 f4 f5 f6 f7 f8
1 000 0 0 0 0 0 0 0 0 0 0
2 001 1 1 1 1 1 1 1 1 1 1
3 010 1 1 1 1 1 1 1 1 1 1
4 011 0 0 0 0 0 0 0 0 0 0
5 100 1 1 1 1 1 1 1 1 1 1
6 101 ? 0 0 0 0 1 1 1 1
7 110 ? 0 0 1 1 0 0 1 1
7 110 ? 0 1 0 1 0 1 0 1
7 / 135
Images/cinvestav
We have the following
We have the space of input has 23
possibilities
Therefore, we have 223
possible functions for f
Learning outside the data D, basically we want a g that generalize
outside D
n xn yn g f1 f2 f3 f4 f5 f6 f7 f8
1 000 0 0 0 0 0 0 0 0 0 0
2 001 1 1 1 1 1 1 1 1 1 1
3 010 1 1 1 1 1 1 1 1 1 1
4 011 0 0 0 0 0 0 0 0 0 0
5 100 1 1 1 1 1 1 1 1 1 1
6 101 ? 0 0 0 0 1 1 1 1
7 110 ? 0 0 1 1 0 0 1 1
7 110 ? 0 1 0 1 0 1 0 1
7 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
8 / 135
Images/cinvestav
Here is the Dilemma!!!
Each of the f1, f2, ..., f8
It is a possible real f, the true f.
Any of them is a possible good f
Therefore
The quality of the learning will be determined by how close our
prediction is to the true value.
9 / 135
Images/cinvestav
Here is the Dilemma!!!
Each of the f1, f2, ..., f8
It is a possible real f, the true f.
Any of them is a possible good f
Therefore
The quality of the learning will be determined by how close our
prediction is to the true value.
9 / 135
Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
Images/cinvestav
Therefore, we have
In order to select a g, we need to have an hypothesis H
To be able to select such g by our training procedure.
Further, any of the f1, f2, ..., f8 is a good choice for f
Therefore, it does not matter how near we are to the bits in D
Our problem, we want to generalize to the data outside D
However, it does not make any difference if our Hypothesis is correct
or incorrect in D
10 / 135
Images/cinvestav
We want to Generalize
But, If we want to use only a deterministic approach to H
Our Attempts to use H to learn g is a waste of time!!!
11 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
12 / 135
Images/cinvestav
Consider a “bin” with red and green marbles
Going back to our example
BIN
SAMPLES
13 / 135
Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
Images/cinvestav
Therefore
We have the “Real Probabilities”
P [Pick a Red marble] = µ
P [Pick a Blue marble] = 1 − µ
However, the value of µ is not know
Thus, we sample the space for N samples in an independent way.
Here, the fraction of real marbles is equal to ν
Question: Can ν can be used to know about µ?
14 / 135
Images/cinvestav
Two Answers... Possible vs. Probable
No!!! Because we can see only the samples
For example, Sample an be mostly blue while bin is mostly red.
Yes!!!
Sample frequency ν is likely close to bin frequency µ.
15 / 135
Images/cinvestav
Two Answers... Possible vs. Probable
No!!! Because we can see only the samples
For example, Sample an be mostly blue while bin is mostly red.
Yes!!!
Sample frequency ν is likely close to bin frequency µ.
15 / 135
Images/cinvestav
What does ν say about µ?
We have the following hypothesis
In a big sample (large N ), ν is probably close to µ (within ).
How?
Hoeffding’s Inequality .
16 / 135
Images/cinvestav
What does ν say about µ?
We have the following hypothesis
In a big sample (large N ), ν is probably close to µ (within ).
How?
Hoeffding’s Inequality .
16 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
17 / 135
Images/cinvestav
We have the following theorem
Theorem (Hoeffding’s inequality)
Let Z1, ..., Zn be independent bounded random variables with
Zi ∈ [a, b] for all i, where −∞ < a ≤ b < ∞. Then
P
1
N
N
i=1
(Zi − E [Zi]) ≥ t ≤ exp
− 2Nt2
(b−a)2
and
P
1
N
N
i=1
(Zi − E [Zi]) ≤ −t ≤ exp
− 2Nt2
(b−a)2
for all t ≥ 0.
18 / 135
Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
Images/cinvestav
Therefore
Assume that the Zi are the random variables from the N samples
Then, we have that values for Zi ∈ {0, 1} therefore we have that...
First inequality, for any > 0 and N
P
1
N
N
i=1
Zi − µ ≥ ≤ exp−2N 2
Second inequality, for > 0 and N
P
1
N
N
i=1
Zi − µ ≤ ≤ exp−2N 2
19 / 135
Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
Images/cinvestav
Here
We can use the fact that
ν =
1
N
N
i=1
Zi
Putting all together, we have
P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ )
Finally
P (|ν − µ| ≥ ) ≤ 2 exp−2N 2
20 / 135
Images/cinvestav
Therefore
We have the following
If is small enough and as long as N is large
21 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Making Possible
Possible to estimate ν ≈ µ
How do we connect with Learning?
Learning
We want to find a function f : X −→ Y which is unknown!!!
Here we assume that each ball in the bin is a sample x ∈ X.
Thus, it is necessary to select an hypothesis
Basically, we want to have an hypothesis h:
h (x) = f (x) we color the sample blue.
h (x) = f (x) we color the sample red.
22 / 135
Images/cinvestav
Here a Small Remark
Here, we are not talking about classes
When talking about blue and red balls, but if we are able to identify
the correct label:
yh = h (x) =f (x) = y
or
yh = h (x) =f (x) = y
Still, the use of blue and red balls allows
to see our Learning Problem as a Bernoulli distribution
23 / 135
Images/cinvestav
Here a Small Remark
Here, we are not talking about classes
When talking about blue and red balls, but if we are able to identify
the correct label:
yh = h (x) =f (x) = y
or
yh = h (x) =f (x) = y
Still, the use of blue and red balls allows
to see our Learning Problem as a Bernoulli distribution
23 / 135
Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
Images/cinvestav
Swiss mathematician Jacob Bernoulli
Definition
The Bernoulli distribution is a discrete distribution having two
possible outcomes X = 0 or X = 1.
With the following probabilities
P (X|p) =
1 − p if X = 0
p if X = 1
Also expressed as
P (X = k|p) = (p)k
(1 − p)1−k
24 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
25 / 135
Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
Images/cinvestav
Thus
We define Ein (in-sample error)
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
We have made explicit the dependency of Ein on the particular h that
we are considering.
Now Eout (out-of-sample error)
Eout (h) = P (h (x) = f (x)) = µ
Where
The probability is based on the distribution P over X which is used to
sample the data points x.
26 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
27 / 135
Images/cinvestav
Generalization Error
Definition (Generalization Error/out-of-sample error)
Given a hypothesis/proposed model h ∈ H, a target concept/real
model f ∈ F, and an underlying distribution D, the generalization error or
risk of h is defined by
R (h) = Px∼D (h (x) = f (x)) = Ex∼D Ih(x)=f(x)
a
where Iω is the indicator function of the event ω.
a
This comes the fact that 1 ∗ P (A) + 0 ∗ P A = E [IA]
28 / 135
Images/cinvestav
Empirical Error
Definition (Empirical Error/in-sample error)
Given a hypothesis/proposed model h ∈ H, a target concept/real
model f ∈ F, a sample X = {x1, x2, ..., xN }, the empirical error or
empirical risk of h is defined by:
R =
1
N
N
i=1
Ih(xi)=f(xi)
29 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
30 / 135
Images/cinvestav
Basically
We have
P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2
Now, we need to consider an entire set of hypothesis, H
H = {h1, h2, ..., hM }
31 / 135
Images/cinvestav
Basically
We have
P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2
Now, we need to consider an entire set of hypothesis, H
H = {h1, h2, ..., hM }
31 / 135
Images/cinvestav
Therefore
Each Hypothesis is a scenario in the bin space
32 / 135
Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
Images/cinvestav
Remark
The Hoeffding Inequality still applies to each bin individually
Now, we need to consider all the bins simultaneously.
Here, we have the following situation
h is fixed before the data set is generated!!!
If you are allowed to change h after you generate the data set
The Hoeffding Inequality no longer holds
33 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
34 / 135
Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
Images/cinvestav
Therefore
With multiple hypotheses in H
The Learning Algorithm chooses the final hypothesis g based on D
after generating the data.
The statement we would like to make is not
P (|Ein (hm) − Eout (hm)| ≥ ) is small.
We would rather
P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g.
35 / 135
Images/cinvestav
Therefore
Something Notable
The hypothesis g is not fixed ahead of time before generating the data
Thus we need to bound
P (|Ein (g) − Eout (g)| ≥ )
Which it does not depend on which g the algorithm picks.
36 / 135
Images/cinvestav
Therefore
Something Notable
The hypothesis g is not fixed ahead of time before generating the data
Thus we need to bound
P (|Ein (g) − Eout (g)| ≥ )
Which it does not depend on which g the algorithm picks.
36 / 135
Images/cinvestav
We have two rules
First one
if A1 =⇒ A2, then P (A1) ≤ P (A2)
If you have any set of events A1, A2, ..., AM
P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤
M
m=1
P (Am)
37 / 135
Images/cinvestav
We have two rules
First one
if A1 =⇒ A2, then P (A1) ≤ P (A2)
If you have any set of events A1, A2, ..., AM
P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤
M
m=1
P (Am)
37 / 135
Images/cinvestav
Therefore
Now assuming independence between hypothesis
|Ein (g) − Eout (g)| ≥ =⇒ |Ein (h1) − Eout (h1)| ≥
or |Ein (h2) − Eout (h2)| ≥
· · ·
or |Ein (hM ) − Eout (hM )| ≥
38 / 135
Images/cinvestav
Thus
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤P [|Ein (h1) − Eout (h1)| ≥
or |Ein (h2) − Eout (h2)| ≥
· · ·
or |Ein (hM ) − Eout (hM )| ≥ ]
39 / 135
Images/cinvestav
Then
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤
M
m=1
[|Ein (hm) − Eout (hm)| ≥ ]
Thus
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
40 / 135
Images/cinvestav
Then
We have
P (|Ein (g) − Eout (g)| ≥ ) ≤
M
m=1
[|Ein (hm) − Eout (hm)| ≥ ]
Thus
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
40 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
41 / 135
Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
Images/cinvestav
We have
Something Notable
We have introduced two apparently conflicting arguments about the
feasibility of learning.
We have two possibilities
One argument says that we cannot learn anything outside of D.
The other say it is possible!!!
Here, we introduce the probabilistic answer
This will solve our conundrum!!!
42 / 135
Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
Images/cinvestav
Then
The Deterministic Answer
Do we have something to say about f outside of D? The answer is
NO.
The Probabilistic Answer
Is D telling us something likely about f outside of D? The answer is
YES
The reason why
We approach our Learning from a Probabilistic point of view!!!
43 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
44 / 135
Images/cinvestav
For example
We could have hypothesis based in hyperplanes
Linear regression output:
h (x) =
d
i=1
wixi = wT
x
Therefore
Ein (x) =
1
N
N
i=1
(h (xn) − yn)2
45 / 135
Images/cinvestav
For example
We could have hypothesis based in hyperplanes
Linear regression output:
h (x) =
d
i=1
wixi = wT
x
Therefore
Ein (x) =
1
N
N
i=1
(h (xn) − yn)2
45 / 135
Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
Images/cinvestav
Clearly, we have used loss functions
Mostly to give meaning h ≈ f
By Error Measures E (h, f)
By using pointwise definitions
e (h (x) , f (x))
Examples
Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2
Binary Error e (h (x) , f (x)) = I [h (x) = f (x)]
46 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
47 / 135
Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
Images/cinvestav
Therefore, we have
The Overall Error
E (h, f) = Average of pointwise errors e (h (x) , f (x))
In-Sample Error
Ein (h) =
1
N
N
i=1
e (h (xi) , f (xi))
Out-of-sample error
Ein (h) = EX [e (h (x) , f (x))]
48 / 135
Images/cinvestav
We have the following Process
Assuming P (y|x) instead of y = f (x)
Then a data point (x, y) is now generated by the joint distribution
P (x, y) = P (x) P (y|x)
Therefore
Noisy target is a deterministic target plus added noise.
f (x) ≈ E [y|x] + (y − f (x))
49 / 135
Images/cinvestav
We have the following Process
Assuming P (y|x) instead of y = f (x)
Then a data point (x, y) is now generated by the joint distribution
P (x, y) = P (x) P (y|x)
Therefore
Noisy target is a deterministic target plus added noise.
f (x) ≈ E [y|x] + (y − f (x))
49 / 135
Images/cinvestav
Finally, we have as Learning Process
Unknow Target Distribution Probability
Distribution
Training Examples
50 / 135
Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
Images/cinvestav
Therefore
Distinction between P (y|x) and P (x)
Both convey probabilistic aspects of x and y.
Therefore
1 The Target distribution P (y|x) is what we are trying to learn.
2 The Input distribution P (x) quantifies relative importance of x.
Finally
Merging P (x, y) = P (y|x) P (x) mixes the two concepts
51 / 135
Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
Images/cinvestav
Therefore
Learning is feasible because It is likely that
Eout (g) ≈ Ein (g)
Therefore, we need g ≈ f
Eout (g) = P (g (x) = f (x)) ≈ 0
How do we achieve this?
Eout (g) ≈ Ein (g) =
1
N
N
n=1
I (g (xn) = f (xn))
52 / 135
Images/cinvestav
Then
We make at the same time
Ein (g) ≈ 0
To Make the Error in our selected hypothesis g with respect to the
real function f
Learning splits in two questions
1 Can we make Eout (g) is close enough Ein (g)?
2 Can we make Ein (g) small enough?
53 / 135
Images/cinvestav
Then
We make at the same time
Ein (g) ≈ 0
To Make the Error in our selected hypothesis g with respect to the
real function f
Learning splits in two questions
1 Can we make Eout (g) is close enough Ein (g)?
2 Can we make Ein (g) small enough?
53 / 135
Images/cinvestav
Therefore, we have
Nice Connection with Bias-Variance Trade-off
ERROR Out-of-Sample Error
Model Complexity
In-Sample Error
54 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
55 / 135
Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
Images/cinvestav
We have that
The out-of-sample error
Eout (h) = P (h (x) = f (x))
It Measures how well our training on D
It has generalized to data that we have not seen before.
Remark
Eout is based on the performance over the entire input space X.
56 / 135
Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
Images/cinvestav
Testing Data Set
Intuitively
we want to estimate the value of Eout using a sample of data points.
Something Notable
These points must be ’fresh’ test points that have not been used for
training.
Basically
Out Testing Set.
57 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
58 / 135
Images/cinvestav
Thus
It is possible to define
The generalization error as the discrepancy between Ein and Eout
Therefore
The Hoeffding Inequality is a way to characterize the generalization
error with a probabilistic bound
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
For any > 0.
59 / 135
Images/cinvestav
Thus
It is possible to define
The generalization error as the discrepancy between Ein and Eout
Therefore
The Hoeffding Inequality is a way to characterize the generalization
error with a probabilistic bound
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
For any > 0.
59 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
60 / 135
Images/cinvestav
Reinterpreting This
Assume a Tolerance Level δ, for example δ = 0.0005
It is possible to say that with probability 1 − δ :
Eout (g) < Ein (g) +
1
2N
ln
2M
δ
61 / 135
Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
Images/cinvestav
Proof
We have the complement Hoeffding Probability using the absolute
value
P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2
Therefore, we have
P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2
This imply
Eout (g) < Ein (g) +
62 / 135
Images/cinvestav
Therefore
We simply use
δ = 2M exp−2N 2
Then
ln 1 − ln
δ
2M
= 2N 2
Therefore
=
1
2N
ln
2M
δ
63 / 135
Images/cinvestav
Therefore
We simply use
δ = 2M exp−2N 2
Then
ln 1 − ln
δ
2M
= 2N 2
Therefore
=
1
2N
ln
2M
δ
63 / 135
Images/cinvestav
Therefore
We simply use
δ = 2M exp−2N 2
Then
ln 1 − ln
δ
2M
= 2N 2
Therefore
=
1
2N
ln
2M
δ
63 / 135
Images/cinvestav
Generalization Bound
This inequality is know as a generalization Bound
Ein (g) < Eout (g) +
1
2N
ln
2M
δ
64 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
65 / 135
Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
Images/cinvestav
We have
The following inequality also holds
− < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) −
Thus
Not only we want our hypothesis g to do well int the out samples,
Eout (g) < Ein (g) +
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis with higher
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
than g.
66 / 135
Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
Images/cinvestav
Therefore
But, we want to know how well we did with our H
Thus, Eout (g) > Ein (g) − assures that it is not possible to do
better!!!
Given any hypothesis h with higher than g
Ein (h) =
1
N
N
n=1
I (h (xn) = f (xn))
It will have a higher Eout (h) given
Eout (h) > Ein (h) −
67 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
68 / 135
Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
Images/cinvestav
The Infiniteness of H
A Problem with the Error Bound given its dependency on M
1
2N
ln
2M
δ
What happens when M becomes infinity
The number of hypothesis in H becomes infinity.
Thus, the bound becomes infinity
Problem, almost all interesting learning models have infinite H....
For Example... in our linear Regression... f (x) = wT
x
69 / 135
Images/cinvestav
Therefore, we need to replace M
We need to find a finite substitute with finite range values
For this, we notice that
|Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · ·
or |Ein (hM ) − Eout (hM )| ≥
70 / 135
Images/cinvestav
We have
This guarantee |Ein (g) − Eout (g)| ≥
Thus, we can take a look at the events Bm events for which you have
|Ein (hm) − Eout (hm)| ≥
Then
P B1 or B2 · · · or BM ≤
M
m=1
P [Bm]
71 / 135
Images/cinvestav
We have
This guarantee |Ein (g) − Eout (g)| ≥
Thus, we can take a look at the events Bm events for which you have
|Ein (hm) − Eout (hm)| ≥
Then
P B1 or B2 · · · or BM ≤
M
m=1
P [Bm]
71 / 135
Images/cinvestav
Now, we have the following
Example
We have a gross overestimate
Basically, if hi and hj are quite similar the two events
|Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥
are likely to coincide!!!
72 / 135
Images/cinvestav
Now, we have the following
Example
We have a gross overestimate
Basically, if hi and hj are quite similar the two events
|Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥
are likely to coincide!!!
72 / 135
Images/cinvestav
We have
Something Notable
In a typical learning model, many hypotheses are indeed very similar.
The mathematical theory of generalization hinges on this observation
We only need to account for the overlapping on different hypothesis
to substitute M.
73 / 135
Images/cinvestav
We have
Something Notable
In a typical learning model, many hypotheses are indeed very similar.
The mathematical theory of generalization hinges on this observation
We only need to account for the overlapping on different hypothesis
to substitute M.
73 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
74 / 135
Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
Images/cinvestav
Consider
A finite data set
X = {x1, x2, ..., xN }
And we consider a set of hypothesis h ∈ H such that
h : X → {−1, +1}
We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of
±1.
Such N-tuple is called a Dichotomy
Given that it splits x1, x2, ..., xN into two groups...
75 / 135
Images/cinvestav
Dichotomy
Definition
Given a hypothesis set H, a dichotomy of a set X is one of the
possible ways of labeling the points of X using a hypothesis in H.
76 / 135
Images/cinvestav
Examples of Dichotomies
Here the first Dichotomy can be generated by a perceptron
Class +1
Class -1
The Dichotomy Generated
By a Perceptron
77 / 135
Images/cinvestav
Something Important
Each h ∈ H generates a dichotomy on x1, ..., xN
However, two different h’s may generate the same dichotomy if they
generate the same pattern
78 / 135
Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
Images/cinvestav
Remark
Definition
Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these
points are defined by
H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H}
Therefore
We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the
geometry of the points.
Thus
A large H (x1, x2, ..., xN ) means H is more diverse.
79 / 135
Images/cinvestav
Growth function, Our Replacement of M
Definition
The growth function is defined for a hypothesis set H by
mH (N) = max
x1,...,xN ∈X
#H (x1, x2, ..., xN )
where # denotes the cardinality (number of elements) of a set.
Therefore
mH (N) is the maximum number of dichotomies that be
generated by H on any N points.
We remove dependency on the entire X
80 / 135
Images/cinvestav
Growth function, Our Replacement of M
Definition
The growth function is defined for a hypothesis set H by
mH (N) = max
x1,...,xN ∈X
#H (x1, x2, ..., xN )
where # denotes the cardinality (number of elements) of a set.
Therefore
mH (N) is the maximum number of dichotomies that be
generated by H on any N points.
We remove dependency on the entire X
80 / 135
Images/cinvestav
Therefore
We have that
M and mH (N) is a measure of the of the number of hypothesis in H
However, we avoid considering all of X
Now we only consider N points instead of the entire X.
81 / 135
Images/cinvestav
Therefore
We have that
M and mH (N) is a measure of the of the number of hypothesis in H
However, we avoid considering all of X
Now we only consider N points instead of the entire X.
81 / 135
Images/cinvestav
Upper Bound for mH (N)
First, we know that
H (x1, x2, ..., xN ) ⊆ {−1, +1}N
Hence, we have the value of mH (N) is at most # {−1, +1}N
mH (N) ≤ 2N
82 / 135
Images/cinvestav
Upper Bound for mH (N)
First, we know that
H (x1, x2, ..., xN ) ⊆ {−1, +1}N
Hence, we have the value of mH (N) is at most # {−1, +1}N
mH (N) ≤ 2N
82 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
83 / 135
Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
Images/cinvestav
Therefore
If H is capable of generating all possible dichotomies on x1, x2, ..., xN
Then,
H (x1, x2, ..., xN ) = {−1, +1}
N
and #H (x1, x2, ..., xN ) = 2N
We can say that
H can shatter x1, x2, ..., xN
Meaning
H is as diverse as can be on this particular sample.
84 / 135
Images/cinvestav
Shattering
Definition
A set X of N ≥ 1 points is said to be shattered by a hypothesis set
H when H realizes all possible dichotomies of X, that is when
mH (N) = 2N
85 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
86 / 135
Images/cinvestav
Example
Positive Rays
Imagine a input space on R, with H consisting of all hypotheses
h : R → {−1, +1} of the form
h (x) = sign (x − a)
Example
87 / 135
Images/cinvestav
Example
Positive Rays
Imagine a input space on R, with H consisting of all hypotheses
h : R → {−1, +1} of the form
h (x) = sign (x − a)
Example
87 / 135
Images/cinvestav
Thus, we have that
As we change a, we get N + 1 different dichotomies
mH (N) = N + 1
Now, we have the case of positive intervals
H consists of all hypotheses in one dimension that return +1 within
some interval and −1 otherwise.
88 / 135
Images/cinvestav
Thus, we have that
As we change a, we get N + 1 different dichotomies
mH (N) = N + 1
Now, we have the case of positive intervals
H consists of all hypotheses in one dimension that return +1 within
some interval and −1 otherwise.
88 / 135
Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
Images/cinvestav
Therefore
We have
The line is again split by the points into N + 1 regions.
Furthermore
The dichotomy we get is decided by which two regions contain the
end values of the interval
Therefore, we have the number of possible dichotomies
N + 1
2
89 / 135
Images/cinvestav
Additionally
If the two points fall in the same region, the H = −1
Then
mH (N) =
N + 1
2
+ 1 =
1
2
N2
+
1
2
N + 1
90 / 135
Images/cinvestav
Finally
In the case of a Convex Set in R2
H consists of all hypothesis in two dimensions that are positive inside
some convex set and negative elsewhere.
91 / 135
Images/cinvestav
Therefore
We have the following
mH (N) = 2N
By using the “Radon’s theorem”
92 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
93 / 135
Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
Images/cinvestav
Remember
We have that
P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2
What if mH (N) replaces M
If mH (N) is polynomial, we have an excellent case!!!
Therefore, we need to prove that
mH (N) is polynomial
94 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
95 / 135
Images/cinvestav
Break Point
Definition
If no data set of size k can be shattered by H, then k is said to be
a break point for H:
mH (k) < 2k
96 / 135
Images/cinvestav
Example
For the Perceptron, we have k = 4
Non-ShatterShatter
97 / 135
Images/cinvestav
Important
Something Notable
In general, it is easier to find a break point for H than to compute the
full growth function for that H.
Using this concept
We are ready to define the concept of Vapnik–Chervonenkis (VC)
dimension.
98 / 135
Images/cinvestav
Important
Something Notable
In general, it is easier to find a break point for H than to compute the
full growth function for that H.
Using this concept
We are ready to define the concept of Vapnik–Chervonenkis (VC)
dimension.
98 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
99 / 135
Images/cinvestav
VC-Dimension
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
A set containing k points, for arbitrary k, is in general linear position
if and only if no (k − 1) −dimensional flat contains them all
100 / 135
Images/cinvestav
Important Remarks
Remark 1
if V Cdim (H) = d, there exists a set of size d that can be fully
shattered.
Remark2
This does not imply that all sets of size d or less are fully shattered
This is typically the case!!!
101 / 135
Images/cinvestav
Important Remarks
Remark 1
if V Cdim (H) = d, there exists a set of size d that can be fully
shattered.
Remark2
This does not imply that all sets of size d or less are fully shattered
This is typically the case!!!
101 / 135
Images/cinvestav
Why? General Linear Position
For example in the Perceptron
No General Position
102 / 135
Images/cinvestav
Now, we define B(N, k)
Definition
B(N, k) is the maximum number of dichotomies on N points such
that no subset of size k of the N points can be shattered by these
dichotomies.
Something Notable
The definition of B(N, k) assumes a break point k!!!
103 / 135
Images/cinvestav
Now, we define B(N, k)
Definition
B(N, k) is the maximum number of dichotomies on N points such
that no subset of size k of the N points can be shattered by these
dichotomies.
Something Notable
The definition of B(N, k) assumes a break point k!!!
103 / 135
Images/cinvestav
Further
Since B(N, k) is a maximum
It is an upper bound for mH (N) under a break point k.
mH (N) ≤ B (N, k) if k is a break point for H.
Then
We need to find a Bound for B (N, k) to prove that mH (k) is
polynomial.
104 / 135
Images/cinvestav
Further
Since B(N, k) is a maximum
It is an upper bound for mH (N) under a break point k.
mH (N) ≤ B (N, k) if k is a break point for H.
Then
We need to find a Bound for B (N, k) to prove that mH (k) is
polynomial.
104 / 135
Images/cinvestav
Therefore
Thus, we start with two boundary conditions k = 1 and N = 1
B (N, 1) = 1
B (1, k) = 2 k > 1
105 / 135
Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
Images/cinvestav
Why?
Something Notable
B(N, 1) = 1 for all N since if no subset of size 1 can be shattered
Then only one dichotomy can be allowed.
Because a second different dichotomy must differ on at least one point
and then that subset of size 1 would be shattered.
Second
B(1, k) = 2 for k > 1 since there do not even exist subsets of size k.
Because the constraint is vacuously true and we have 2 possible
dichotomies +1 and −1.
106 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
107 / 135
Images/cinvestav
B(N, k) Dichotomies, N ≥ 2 and k ≥ 2
# of rows x1 x2 · · · xN−1 xN
+1 +1 · · · +1 +1
-1 +1 · · · +1 -1
S1 α
.
.
.
.
.
. · · ·
.
.
.
.
.
.
+1 -1 · · · -1 -1
-1 +1 · · · -1 +1
+1 -1 · · · +1 +1
-1 -1 · · · +1 +1
S+
2 β
.
.
.
.
.
. · · ·
.
.
.
.
.
.
+1 -1 · · · +1 +1
S2
-1 +1 · · · -1 +1
+1 -1 · · · +1 -1
-1 -1 · · · +1 -1
S−
2 β
.
.
.
.
.
. · · ·
.
.
.
.
.
.
+1 -1 · · · +1 -1
-1 +1 · · · -1 -1
108 / 135
Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
Images/cinvestav
What is this partition mean
First, Consider the dichotomies on x1x2 · · · xN−1
Some appear once (Either +1 or -1 at xN ), but only ONCE!!!
We collect them in S1
The Remaining Dichotomies appear Twice
Once with +1 and once with -1 in the xN column.
109 / 135
Images/cinvestav
Therefore, we collect them in three sets
The ones with only one Dichotomy
We use the set S1
The other in two different sets
S+
2 the ones with xN = +1.
S−
2 the ones with xN = −1.
110 / 135
Images/cinvestav
Therefore, we collect them in three sets
The ones with only one Dichotomy
We use the set S1
The other in two different sets
S+
2 the ones with xN = +1.
S−
2 the ones with xN = −1.
110 / 135
Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
Images/cinvestav
Therefore
We have the following
B (N, k) = α + 2β
The total number of different dichotomies on the first N − 1 points
They are α + β.
Additionally, no subset of k of these first N − 1 points can be
shattered
Since no k-subset of all N points can be shattered:
α + β ≤ B (N − 1, k)
By definition of B.
111 / 135
Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
Images/cinvestav
Then
Further, no subset of size k − 1 of the first N − 1 points can be
shattered by the dichotomies in S+
2
If there existed such a subset, then taking the corresponding set of
dichotomies in S−
2 and xN
You finish with a subset of size k that can be shattered a contradiction
given the definition of B (N, k).
Therefore
β ≤ B (N − 1, k − 1)
Then, we have
B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1)
112 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
113 / 135
Images/cinvestav
Connecting the Growth Function with the V Cdim
Sauer’s Lemma
For all k ∈ N , the following inequality holds:
B (N, k) ≤
k−1
i=0
N
i
114 / 135
Images/cinvestav
Proof
Proof
For k = 1
B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 =
N
0
Then, by induction
We assume that the statement is true for N ≤ N0 and all k.
115 / 135
Images/cinvestav
Proof
Proof
For k = 1
B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 =
N
0
Then, by induction
We assume that the statement is true for N ≤ N0 and all k.
115 / 135
Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
Images/cinvestav
Now
We need to prove this for N = N0 + 1 and all k
Observation: This is true for k = 1 given
B (N, 1) = 1
Now, consider k ≥ 2
B(N0, k) + B (N0, k − 1)
Therefore
B (N0 + 1, k) ≤
k−1
i=0
N0
i
+
k−2
i=0
N0
i
116 / 135
Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
Images/cinvestav
Therefore
We have the following
B (N0 + 1, k) ≤ 1 +
k−1
i=1
N0
i
+
k−1
i=1
N0
i − 1
= 1 +
k−1
i=1
N0
i
+
N0
i − 1
= 1 +
k−1
i=1
N0 + 1
i
=
k−1
i=0
N0 + 1
i
Because
N0
i
+
N0
i − 1
=
N0 + 1
i
117 / 135
Images/cinvestav
Now
We have in conclusion for all k
B (N, k) ≤
k−1
i=0
N
i
Therefore
mH (N) ≤ B (N, k) ≤
k−1
i=0
N
i
118 / 135
Images/cinvestav
Now
We have in conclusion for all k
B (N, k) ≤
k−1
i=0
N
i
Therefore
mH (N) ≤ B (N, k) ≤
k−1
i=0
N
i
118 / 135
Images/cinvestav
Then
Theorem
If mH (k) < 2k for some value k, then
mH (N) ≤
k−1
i=0
N
i
119 / 135
Images/cinvestav
Finally
Corollary
Let H be a hypothesis set with V Cdim (H) = k. Then, for all
N ≥ k
mH (N) ≤
eN
k
k−1
= O Nk
120 / 135
Images/cinvestav
We have
Proof
mH (N) ≤
k
i=0
N
i
≤
k
i=0
N
i
N
k
k−i
≤
N
i=0
N
i
N
k
k−i
N
k
k N
i=0
N
i
k
N
i
121 / 135
Images/cinvestav
We have
Proof
mH (N) ≤
k
i=0
N
i
≤
k
i=0
N
i
N
k
k−i
≤
N
i=0
N
i
N
k
k−i
N
k
k N
i=0
N
i
k
N
i
121 / 135
Images/cinvestav
We have
Proof
mH (N) ≤
k
i=0
N
i
≤
k
i=0
N
i
N
k
k−i
≤
N
i=0
N
i
N
k
k−i
N
k
k N
i=0
N
i
k
N
i
121 / 135
Images/cinvestav
We have
Proof
mH (N) ≤
k
i=0
N
i
≤
k
i=0
N
i
N
k
k−i
≤
N
i=0
N
i
N
k
k−i
N
k
k N
i=0
N
i
k
N
i
121 / 135
Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
Images/cinvestav
Therefore
We have
mH (N) ≤
N
k
k N
i=0
N
i
k
N
i
=
N
k
k
1 +
k
N
N
Given that (1 − x) = e−x
mH (N) ≤
N
k
k
e
k
N
≤
N
k
k−1
ek−1
=
e
k
k
Nk
= O Nk
122 / 135
Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
Images/cinvestav
Therefore
We have that
mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that
mH (N) is polynomial
We are not depending on the number of hypothesis!!!!
123 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
124 / 135
Images/cinvestav
Remark about mH (k)
We have bounded the number of effective hypothesis
Yes!!! we can have M hypotheses but the number of
dichotomies generated by them is bounded by mH (k)
125 / 135
Images/cinvestav
VC-Dimension Again
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
Something Notable
If mH (N) = 2N for all N, V Cdim (H) = ∞
126 / 135
Images/cinvestav
VC-Dimension Again
Definition
The VC-dimension of a hypothesis set H is the size of the largest set
that can be fully shattered by H (Those points need to be in “General
Position”):
V Cdim (H) = max k|mH (k) = 2k
Something Notable
If mH (N) = 2N for all N, V Cdim (H) = ∞
126 / 135
Images/cinvestav
Remember
We have the following
Ein (g) < Eout (g) +
1
2N
ln
2M
δ
We instead of using M, we use mH (N)
We can use our growth function as the effective way to bound
Ein (g) < Eout (g) +
1
2N
ln
2mH (N)
δ
127 / 135
Images/cinvestav
Remember
We have the following
Ein (g) < Eout (g) +
1
2N
ln
2M
δ
We instead of using M, we use mH (N)
We can use our growth function as the effective way to bound
Ein (g) < Eout (g) +
1
2N
ln
2mH (N)
δ
127 / 135
Images/cinvestav
VC Generalized Bound
Theorem (VC Generalized Bound)
For any tolerance δ > 0 and H be a hypothesis set with
V Cdim (H) = k.,
Ein (g) < Eout (g) +
2k
N
ln
eN
k
+
1
2N
ln
1
δ
with probability ≥ 1 − δ
Something Notable
This Bound only fails when V Cdim (H) = ∞!!!
128 / 135
Images/cinvestav
VC Generalized Bound
Theorem (VC Generalized Bound)
For any tolerance δ > 0 and H be a hypothesis set with
V Cdim (H) = k.,
Ein (g) < Eout (g) +
2k
N
ln
eN
k
+
1
2N
ln
1
δ
with probability ≥ 1 − δ
Something Notable
This Bound only fails when V Cdim (H) = ∞!!!
128 / 135
Images/cinvestav
Proof
Although we will not talk about it
We will remark the that is possible to use the Rademacher complexity
To manage the number of overlapping hypothesis (Which can be
infinite)
We will stop here, but
But I will encourage to look at more about the proof...
129 / 135
Images/cinvestav
Proof
Although we will not talk about it
We will remark the that is possible to use the Rademacher complexity
To manage the number of overlapping hypothesis (Which can be
infinite)
We will stop here, but
But I will encourage to look at more about the proof...
129 / 135
Images/cinvestav
About the Proof
For More, take a look at
“A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al.
“Foundations of Machine Learning” by Mehryar Mohori et al.
This is the equivalent to use Measure Theory to understand the
innards of Probability
We are professionals, we must understand!!!
130 / 135
Images/cinvestav
About the Proof
For More, take a look at
“A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al.
“Foundations of Machine Learning” by Mehryar Mohori et al.
This is the equivalent to use Measure Theory to understand the
innards of Probability
We are professionals, we must understand!!!
130 / 135
Images/cinvestav
Outline
1 Is Learning Feasible?
Introduction
The Dilemma
A Binary Problem, Solving the Dilemma
Hoeffding’s Inequality
Error in the Sample and Error in the Phenomena
Formal Definitions
Back to the Hoeffding’s Inequality
The Learning Process
Feasibility of Learning
Example
Overall Error
2 Vapnik-Chervonenkis Dimension
Theory of Generalization
Generalization Error
Reinterpretation
Subtlety
A Problem with M
Dichotomies
Shattering
Example of Computing mH (N)
What are we looking for?
Break Point
VC-Dimension
Partition B(N, k)
Connecting the Growth Function with the V Cdim
VC Generalization Bound Theorem
3 Example
Multi-Layer Perceptron
131 / 135
Images/cinvestav
As you remember from previous classes
We have architectures like
132 / 135
Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
Images/cinvestav
G-composition of H
Let G be a layered directed acyclic graph
Where directed edges go from one layer l to the next layer l + 1.
Now, we have a set of hypothesis H
NInput Nodes with in-degree 0
Intermediate Nodes with in-degree r
Single Output node with out-degree 0
H our hypothesis over the space Euclidean space Rr
Basically each node represent the hypothesis ci : Rr → {−1, 1} by
mean of tanh.
133 / 135
Images/cinvestav
Therefore
We have that
The Neural concept represent an hypothesis from RN to {−1, 1}
Therefore the entire hypothesis is a composition of concepts
This is called a G-composition of H.
134 / 135
Images/cinvestav
Therefore
We have that
The Neural concept represent an hypothesis from RN to {−1, 1}
Therefore the entire hypothesis is a composition of concepts
This is called a G-composition of H.
134 / 135
Images/cinvestav
We have the following theorem
Theorem (Kearns and Vazirani, 1994)
Let G be a layered directed acyclic graph with N input nodes and
r ≥ 2 internal nodes each of indegree r.
Let H hypothesis set over Rr of V Cdim (H) = d, and let
G-composition of H. then
V Cdim (HG) ≤ 2ds log2 (es)
135 / 135
Images/cinvestav
We have the following theorem
Theorem (Kearns and Vazirani, 1994)
Let G be a layered directed acyclic graph with N input nodes and
r ≥ 2 internal nodes each of indegree r.
Let H hypothesis set over Rr of V Cdim (H) = d, and let
G-composition of H. then
V Cdim (HG) ≤ 2ds log2 (es)
135 / 135

More Related Content

Similar to 17 vapnik chervonenkis dimension

end behavior.....pptx
end behavior.....pptxend behavior.....pptx
end behavior.....pptxLunaLedezma3
 
BOLD Presentation Zane
BOLD Presentation ZaneBOLD Presentation Zane
BOLD Presentation ZaneZane Ricks
 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10chidabdu
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Hitomi Yanaka
 
Multiple comparison problem
Multiple comparison problemMultiple comparison problem
Multiple comparison problemJiri Haviger
 
8R-wk15(Nov30-Dec4)
8R-wk15(Nov30-Dec4)8R-wk15(Nov30-Dec4)
8R-wk15(Nov30-Dec4)Greg Stevens
 
Applied logic: A mastery learning approach
Applied logic: A mastery learning approachApplied logic: A mastery learning approach
Applied logic: A mastery learning approachjohn6938
 
Critique of image~$itique.docxCritique of imagecritique.do.docx
Critique of image~$itique.docxCritique of imagecritique.do.docxCritique of image~$itique.docxCritique of imagecritique.do.docx
Critique of image~$itique.docxCritique of imagecritique.do.docxfaithxdunce63732
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning FoundationsAlbert Y. C. Chen
 
AA Section 7-1
AA Section 7-1AA Section 7-1
AA Section 7-1Jimbo Lamb
 

Similar to 17 vapnik chervonenkis dimension (20)

end behavior.....pptx
end behavior.....pptxend behavior.....pptx
end behavior.....pptx
 
BOLD Presentation Zane
BOLD Presentation ZaneBOLD Presentation Zane
BOLD Presentation Zane
 
Active Learning - DCS Presentation
Active Learning - DCS PresentationActive Learning - DCS Presentation
Active Learning - DCS Presentation
 
PREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptxPREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptx
 
Lecture5 xing
Lecture5 xingLecture5 xing
Lecture5 xing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10
 
Cognitive Load Theory for PPT
Cognitive Load Theory for PPT Cognitive Load Theory for PPT
Cognitive Load Theory for PPT
 
151028_abajpai1
151028_abajpai1151028_abajpai1
151028_abajpai1
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?
 
02 math essentials
02 math essentials02 math essentials
02 math essentials
 
Multiple comparison problem
Multiple comparison problemMultiple comparison problem
Multiple comparison problem
 
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
8R-wk15(Nov30-Dec4)
8R-wk15(Nov30-Dec4)8R-wk15(Nov30-Dec4)
8R-wk15(Nov30-Dec4)
 
Applied logic: A mastery learning approach
Applied logic: A mastery learning approachApplied logic: A mastery learning approach
Applied logic: A mastery learning approach
 
Chi square analysis
Chi square analysisChi square analysis
Chi square analysis
 
Critique of image~$itique.docxCritique of imagecritique.do.docx
Critique of image~$itique.docxCritique of imagecritique.do.docxCritique of image~$itique.docxCritique of imagecritique.do.docx
Critique of image~$itique.docxCritique of imagecritique.do.docx
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning Foundations
 
AA Section 7-1
AA Section 7-1AA Section 7-1
AA Section 7-1
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 

17 vapnik chervonenkis dimension

  • 1. Introduction to Machine Learning Vapnik–Chervonenkis Dimension Andres Mendez-Vazquez July 22, 2018 1 / 135
  • 2. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 2 / 135
  • 3. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 3 / 135
  • 4. Images/cinvestav Until Now We have been learning A Lot of functions to approximate the models f of a given data set D. Data Observation/Data Collection What we can observePhenomea Generating Samples 4 / 135
  • 5. Images/cinvestav The Question But Never asked ourselves if Are we able to really learn f from D? 5 / 135
  • 6. Images/cinvestav Example Consider the following data set D Consider a Boolean target function over a three-dimensional input space X = {0, 1}3 With a data set D n xn yn 1 000 0 2 001 1 3 010 1 4 011 0 5 100 1 6 / 135
  • 7. Images/cinvestav Example Consider the following data set D Consider a Boolean target function over a three-dimensional input space X = {0, 1}3 With a data set D n xn yn 1 000 0 2 001 1 3 010 1 4 011 0 5 100 1 6 / 135
  • 8. Images/cinvestav We have the following We have the space of input has 23 possibilities Therefore, we have 223 possible functions for f Learning outside the data D, basically we want a g that generalize outside D n xn yn g f1 f2 f3 f4 f5 f6 f7 f8 1 000 0 0 0 0 0 0 0 0 0 0 2 001 1 1 1 1 1 1 1 1 1 1 3 010 1 1 1 1 1 1 1 1 1 1 4 011 0 0 0 0 0 0 0 0 0 0 5 100 1 1 1 1 1 1 1 1 1 1 6 101 ? 0 0 0 0 1 1 1 1 7 110 ? 0 0 1 1 0 0 1 1 7 110 ? 0 1 0 1 0 1 0 1 7 / 135
  • 9. Images/cinvestav We have the following We have the space of input has 23 possibilities Therefore, we have 223 possible functions for f Learning outside the data D, basically we want a g that generalize outside D n xn yn g f1 f2 f3 f4 f5 f6 f7 f8 1 000 0 0 0 0 0 0 0 0 0 0 2 001 1 1 1 1 1 1 1 1 1 1 3 010 1 1 1 1 1 1 1 1 1 1 4 011 0 0 0 0 0 0 0 0 0 0 5 100 1 1 1 1 1 1 1 1 1 1 6 101 ? 0 0 0 0 1 1 1 1 7 110 ? 0 0 1 1 0 0 1 1 7 110 ? 0 1 0 1 0 1 0 1 7 / 135
  • 10. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 8 / 135
  • 11. Images/cinvestav Here is the Dilemma!!! Each of the f1, f2, ..., f8 It is a possible real f, the true f. Any of them is a possible good f Therefore The quality of the learning will be determined by how close our prediction is to the true value. 9 / 135
  • 12. Images/cinvestav Here is the Dilemma!!! Each of the f1, f2, ..., f8 It is a possible real f, the true f. Any of them is a possible good f Therefore The quality of the learning will be determined by how close our prediction is to the true value. 9 / 135
  • 13. Images/cinvestav Therefore, we have In order to select a g, we need to have an hypothesis H To be able to select such g by our training procedure. Further, any of the f1, f2, ..., f8 is a good choice for f Therefore, it does not matter how near we are to the bits in D Our problem, we want to generalize to the data outside D However, it does not make any difference if our Hypothesis is correct or incorrect in D 10 / 135
  • 14. Images/cinvestav Therefore, we have In order to select a g, we need to have an hypothesis H To be able to select such g by our training procedure. Further, any of the f1, f2, ..., f8 is a good choice for f Therefore, it does not matter how near we are to the bits in D Our problem, we want to generalize to the data outside D However, it does not make any difference if our Hypothesis is correct or incorrect in D 10 / 135
  • 15. Images/cinvestav Therefore, we have In order to select a g, we need to have an hypothesis H To be able to select such g by our training procedure. Further, any of the f1, f2, ..., f8 is a good choice for f Therefore, it does not matter how near we are to the bits in D Our problem, we want to generalize to the data outside D However, it does not make any difference if our Hypothesis is correct or incorrect in D 10 / 135
  • 16. Images/cinvestav We want to Generalize But, If we want to use only a deterministic approach to H Our Attempts to use H to learn g is a waste of time!!! 11 / 135
  • 17. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 12 / 135
  • 18. Images/cinvestav Consider a “bin” with red and green marbles Going back to our example BIN SAMPLES 13 / 135
  • 19. Images/cinvestav Therefore We have the “Real Probabilities” P [Pick a Red marble] = µ P [Pick a Blue marble] = 1 − µ However, the value of µ is not know Thus, we sample the space for N samples in an independent way. Here, the fraction of real marbles is equal to ν Question: Can ν can be used to know about µ? 14 / 135
  • 20. Images/cinvestav Therefore We have the “Real Probabilities” P [Pick a Red marble] = µ P [Pick a Blue marble] = 1 − µ However, the value of µ is not know Thus, we sample the space for N samples in an independent way. Here, the fraction of real marbles is equal to ν Question: Can ν can be used to know about µ? 14 / 135
  • 21. Images/cinvestav Therefore We have the “Real Probabilities” P [Pick a Red marble] = µ P [Pick a Blue marble] = 1 − µ However, the value of µ is not know Thus, we sample the space for N samples in an independent way. Here, the fraction of real marbles is equal to ν Question: Can ν can be used to know about µ? 14 / 135
  • 22. Images/cinvestav Two Answers... Possible vs. Probable No!!! Because we can see only the samples For example, Sample an be mostly blue while bin is mostly red. Yes!!! Sample frequency ν is likely close to bin frequency µ. 15 / 135
  • 23. Images/cinvestav Two Answers... Possible vs. Probable No!!! Because we can see only the samples For example, Sample an be mostly blue while bin is mostly red. Yes!!! Sample frequency ν is likely close to bin frequency µ. 15 / 135
  • 24. Images/cinvestav What does ν say about µ? We have the following hypothesis In a big sample (large N ), ν is probably close to µ (within ). How? Hoeffding’s Inequality . 16 / 135
  • 25. Images/cinvestav What does ν say about µ? We have the following hypothesis In a big sample (large N ), ν is probably close to µ (within ). How? Hoeffding’s Inequality . 16 / 135
  • 26. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 17 / 135
  • 27. Images/cinvestav We have the following theorem Theorem (Hoeffding’s inequality) Let Z1, ..., Zn be independent bounded random variables with Zi ∈ [a, b] for all i, where −∞ < a ≤ b < ∞. Then P 1 N N i=1 (Zi − E [Zi]) ≥ t ≤ exp − 2Nt2 (b−a)2 and P 1 N N i=1 (Zi − E [Zi]) ≤ −t ≤ exp − 2Nt2 (b−a)2 for all t ≥ 0. 18 / 135
  • 28. Images/cinvestav Therefore Assume that the Zi are the random variables from the N samples Then, we have that values for Zi ∈ {0, 1} therefore we have that... First inequality, for any > 0 and N P 1 N N i=1 Zi − µ ≥ ≤ exp−2N 2 Second inequality, for > 0 and N P 1 N N i=1 Zi − µ ≤ ≤ exp−2N 2 19 / 135
  • 29. Images/cinvestav Therefore Assume that the Zi are the random variables from the N samples Then, we have that values for Zi ∈ {0, 1} therefore we have that... First inequality, for any > 0 and N P 1 N N i=1 Zi − µ ≥ ≤ exp−2N 2 Second inequality, for > 0 and N P 1 N N i=1 Zi − µ ≤ ≤ exp−2N 2 19 / 135
  • 30. Images/cinvestav Therefore Assume that the Zi are the random variables from the N samples Then, we have that values for Zi ∈ {0, 1} therefore we have that... First inequality, for any > 0 and N P 1 N N i=1 Zi − µ ≥ ≤ exp−2N 2 Second inequality, for > 0 and N P 1 N N i=1 Zi − µ ≤ ≤ exp−2N 2 19 / 135
  • 31. Images/cinvestav Here We can use the fact that ν = 1 N N i=1 Zi Putting all together, we have P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ ) Finally P (|ν − µ| ≥ ) ≤ 2 exp−2N 2 20 / 135
  • 32. Images/cinvestav Here We can use the fact that ν = 1 N N i=1 Zi Putting all together, we have P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ ) Finally P (|ν − µ| ≥ ) ≤ 2 exp−2N 2 20 / 135
  • 33. Images/cinvestav Here We can use the fact that ν = 1 N N i=1 Zi Putting all together, we have P (ν − µ ≥ or ν − µ ≤ ) ≤ P (ν − µ ≥ ) + P (ν − µ ≤ ) Finally P (|ν − µ| ≥ ) ≤ 2 exp−2N 2 20 / 135
  • 34. Images/cinvestav Therefore We have the following If is small enough and as long as N is large 21 / 135
  • 35. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 36. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 37. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 38. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 39. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 40. Images/cinvestav Making Possible Possible to estimate ν ≈ µ How do we connect with Learning? Learning We want to find a function f : X −→ Y which is unknown!!! Here we assume that each ball in the bin is a sample x ∈ X. Thus, it is necessary to select an hypothesis Basically, we want to have an hypothesis h: h (x) = f (x) we color the sample blue. h (x) = f (x) we color the sample red. 22 / 135
  • 41. Images/cinvestav Here a Small Remark Here, we are not talking about classes When talking about blue and red balls, but if we are able to identify the correct label: yh = h (x) =f (x) = y or yh = h (x) =f (x) = y Still, the use of blue and red balls allows to see our Learning Problem as a Bernoulli distribution 23 / 135
  • 42. Images/cinvestav Here a Small Remark Here, we are not talking about classes When talking about blue and red balls, but if we are able to identify the correct label: yh = h (x) =f (x) = y or yh = h (x) =f (x) = y Still, the use of blue and red balls allows to see our Learning Problem as a Bernoulli distribution 23 / 135
  • 43. Images/cinvestav Swiss mathematician Jacob Bernoulli Definition The Bernoulli distribution is a discrete distribution having two possible outcomes X = 0 or X = 1. With the following probabilities P (X|p) = 1 − p if X = 0 p if X = 1 Also expressed as P (X = k|p) = (p)k (1 − p)1−k 24 / 135
  • 44. Images/cinvestav Swiss mathematician Jacob Bernoulli Definition The Bernoulli distribution is a discrete distribution having two possible outcomes X = 0 or X = 1. With the following probabilities P (X|p) = 1 − p if X = 0 p if X = 1 Also expressed as P (X = k|p) = (p)k (1 − p)1−k 24 / 135
  • 45. Images/cinvestav Swiss mathematician Jacob Bernoulli Definition The Bernoulli distribution is a discrete distribution having two possible outcomes X = 0 or X = 1. With the following probabilities P (X|p) = 1 − p if X = 0 p if X = 1 Also expressed as P (X = k|p) = (p)k (1 − p)1−k 24 / 135
  • 46. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 25 / 135
  • 47. Images/cinvestav Thus We define Ein (in-sample error) Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) We have made explicit the dependency of Ein on the particular h that we are considering. Now Eout (out-of-sample error) Eout (h) = P (h (x) = f (x)) = µ Where The probability is based on the distribution P over X which is used to sample the data points x. 26 / 135
  • 48. Images/cinvestav Thus We define Ein (in-sample error) Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) We have made explicit the dependency of Ein on the particular h that we are considering. Now Eout (out-of-sample error) Eout (h) = P (h (x) = f (x)) = µ Where The probability is based on the distribution P over X which is used to sample the data points x. 26 / 135
  • 49. Images/cinvestav Thus We define Ein (in-sample error) Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) We have made explicit the dependency of Ein on the particular h that we are considering. Now Eout (out-of-sample error) Eout (h) = P (h (x) = f (x)) = µ Where The probability is based on the distribution P over X which is used to sample the data points x. 26 / 135
  • 50. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 27 / 135
  • 51. Images/cinvestav Generalization Error Definition (Generalization Error/out-of-sample error) Given a hypothesis/proposed model h ∈ H, a target concept/real model f ∈ F, and an underlying distribution D, the generalization error or risk of h is defined by R (h) = Px∼D (h (x) = f (x)) = Ex∼D Ih(x)=f(x) a where Iω is the indicator function of the event ω. a This comes the fact that 1 ∗ P (A) + 0 ∗ P A = E [IA] 28 / 135
  • 52. Images/cinvestav Empirical Error Definition (Empirical Error/in-sample error) Given a hypothesis/proposed model h ∈ H, a target concept/real model f ∈ F, a sample X = {x1, x2, ..., xN }, the empirical error or empirical risk of h is defined by: R = 1 N N i=1 Ih(xi)=f(xi) 29 / 135
  • 53. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 30 / 135
  • 54. Images/cinvestav Basically We have P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2 Now, we need to consider an entire set of hypothesis, H H = {h1, h2, ..., hM } 31 / 135
  • 55. Images/cinvestav Basically We have P (|Ein (h) − Eout (h)| ≥ ) ≤ 2 exp−2Nt2 Now, we need to consider an entire set of hypothesis, H H = {h1, h2, ..., hM } 31 / 135
  • 56. Images/cinvestav Therefore Each Hypothesis is a scenario in the bin space 32 / 135
  • 57. Images/cinvestav Remark The Hoeffding Inequality still applies to each bin individually Now, we need to consider all the bins simultaneously. Here, we have the following situation h is fixed before the data set is generated!!! If you are allowed to change h after you generate the data set The Hoeffding Inequality no longer holds 33 / 135
  • 58. Images/cinvestav Remark The Hoeffding Inequality still applies to each bin individually Now, we need to consider all the bins simultaneously. Here, we have the following situation h is fixed before the data set is generated!!! If you are allowed to change h after you generate the data set The Hoeffding Inequality no longer holds 33 / 135
  • 59. Images/cinvestav Remark The Hoeffding Inequality still applies to each bin individually Now, we need to consider all the bins simultaneously. Here, we have the following situation h is fixed before the data set is generated!!! If you are allowed to change h after you generate the data set The Hoeffding Inequality no longer holds 33 / 135
  • 60. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 34 / 135
  • 61. Images/cinvestav Therefore With multiple hypotheses in H The Learning Algorithm chooses the final hypothesis g based on D after generating the data. The statement we would like to make is not P (|Ein (hm) − Eout (hm)| ≥ ) is small. We would rather P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g. 35 / 135
  • 62. Images/cinvestav Therefore With multiple hypotheses in H The Learning Algorithm chooses the final hypothesis g based on D after generating the data. The statement we would like to make is not P (|Ein (hm) − Eout (hm)| ≥ ) is small. We would rather P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g. 35 / 135
  • 63. Images/cinvestav Therefore With multiple hypotheses in H The Learning Algorithm chooses the final hypothesis g based on D after generating the data. The statement we would like to make is not P (|Ein (hm) − Eout (hm)| ≥ ) is small. We would rather P (|Ein (g) − Eout (g)| ≥ ) is small for the final hypothesis g. 35 / 135
  • 64. Images/cinvestav Therefore Something Notable The hypothesis g is not fixed ahead of time before generating the data Thus we need to bound P (|Ein (g) − Eout (g)| ≥ ) Which it does not depend on which g the algorithm picks. 36 / 135
  • 65. Images/cinvestav Therefore Something Notable The hypothesis g is not fixed ahead of time before generating the data Thus we need to bound P (|Ein (g) − Eout (g)| ≥ ) Which it does not depend on which g the algorithm picks. 36 / 135
  • 66. Images/cinvestav We have two rules First one if A1 =⇒ A2, then P (A1) ≤ P (A2) If you have any set of events A1, A2, ..., AM P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤ M m=1 P (Am) 37 / 135
  • 67. Images/cinvestav We have two rules First one if A1 =⇒ A2, then P (A1) ≤ P (A2) If you have any set of events A1, A2, ..., AM P (A1 ∪ A2 ∪ · · · ∪ AM ) ≤ M m=1 P (Am) 37 / 135
  • 68. Images/cinvestav Therefore Now assuming independence between hypothesis |Ein (g) − Eout (g)| ≥ =⇒ |Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ 38 / 135
  • 69. Images/cinvestav Thus We have P (|Ein (g) − Eout (g)| ≥ ) ≤P [|Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ ] 39 / 135
  • 70. Images/cinvestav Then We have P (|Ein (g) − Eout (g)| ≥ ) ≤ M m=1 [|Ein (hm) − Eout (hm)| ≥ ] Thus P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 40 / 135
  • 71. Images/cinvestav Then We have P (|Ein (g) − Eout (g)| ≥ ) ≤ M m=1 [|Ein (hm) − Eout (hm)| ≥ ] Thus P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 40 / 135
  • 72. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 41 / 135
  • 73. Images/cinvestav We have Something Notable We have introduced two apparently conflicting arguments about the feasibility of learning. We have two possibilities One argument says that we cannot learn anything outside of D. The other say it is possible!!! Here, we introduce the probabilistic answer This will solve our conundrum!!! 42 / 135
  • 74. Images/cinvestav We have Something Notable We have introduced two apparently conflicting arguments about the feasibility of learning. We have two possibilities One argument says that we cannot learn anything outside of D. The other say it is possible!!! Here, we introduce the probabilistic answer This will solve our conundrum!!! 42 / 135
  • 75. Images/cinvestav We have Something Notable We have introduced two apparently conflicting arguments about the feasibility of learning. We have two possibilities One argument says that we cannot learn anything outside of D. The other say it is possible!!! Here, we introduce the probabilistic answer This will solve our conundrum!!! 42 / 135
  • 76. Images/cinvestav We have Something Notable We have introduced two apparently conflicting arguments about the feasibility of learning. We have two possibilities One argument says that we cannot learn anything outside of D. The other say it is possible!!! Here, we introduce the probabilistic answer This will solve our conundrum!!! 42 / 135
  • 77. Images/cinvestav Then The Deterministic Answer Do we have something to say about f outside of D? The answer is NO. The Probabilistic Answer Is D telling us something likely about f outside of D? The answer is YES The reason why We approach our Learning from a Probabilistic point of view!!! 43 / 135
  • 78. Images/cinvestav Then The Deterministic Answer Do we have something to say about f outside of D? The answer is NO. The Probabilistic Answer Is D telling us something likely about f outside of D? The answer is YES The reason why We approach our Learning from a Probabilistic point of view!!! 43 / 135
  • 79. Images/cinvestav Then The Deterministic Answer Do we have something to say about f outside of D? The answer is NO. The Probabilistic Answer Is D telling us something likely about f outside of D? The answer is YES The reason why We approach our Learning from a Probabilistic point of view!!! 43 / 135
  • 80. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 44 / 135
  • 81. Images/cinvestav For example We could have hypothesis based in hyperplanes Linear regression output: h (x) = d i=1 wixi = wT x Therefore Ein (x) = 1 N N i=1 (h (xn) − yn)2 45 / 135
  • 82. Images/cinvestav For example We could have hypothesis based in hyperplanes Linear regression output: h (x) = d i=1 wixi = wT x Therefore Ein (x) = 1 N N i=1 (h (xn) − yn)2 45 / 135
  • 83. Images/cinvestav Clearly, we have used loss functions Mostly to give meaning h ≈ f By Error Measures E (h, f) By using pointwise definitions e (h (x) , f (x)) Examples Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2 Binary Error e (h (x) , f (x)) = I [h (x) = f (x)] 46 / 135
  • 84. Images/cinvestav Clearly, we have used loss functions Mostly to give meaning h ≈ f By Error Measures E (h, f) By using pointwise definitions e (h (x) , f (x)) Examples Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2 Binary Error e (h (x) , f (x)) = I [h (x) = f (x)] 46 / 135
  • 85. Images/cinvestav Clearly, we have used loss functions Mostly to give meaning h ≈ f By Error Measures E (h, f) By using pointwise definitions e (h (x) , f (x)) Examples Squared Error e (h (x) , f (x)) = [h (x) − f (x)]2 Binary Error e (h (x) , f (x)) = I [h (x) = f (x)] 46 / 135
  • 86. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 47 / 135
  • 87. Images/cinvestav Therefore, we have The Overall Error E (h, f) = Average of pointwise errors e (h (x) , f (x)) In-Sample Error Ein (h) = 1 N N i=1 e (h (xi) , f (xi)) Out-of-sample error Ein (h) = EX [e (h (x) , f (x))] 48 / 135
  • 88. Images/cinvestav Therefore, we have The Overall Error E (h, f) = Average of pointwise errors e (h (x) , f (x)) In-Sample Error Ein (h) = 1 N N i=1 e (h (xi) , f (xi)) Out-of-sample error Ein (h) = EX [e (h (x) , f (x))] 48 / 135
  • 89. Images/cinvestav Therefore, we have The Overall Error E (h, f) = Average of pointwise errors e (h (x) , f (x)) In-Sample Error Ein (h) = 1 N N i=1 e (h (xi) , f (xi)) Out-of-sample error Ein (h) = EX [e (h (x) , f (x))] 48 / 135
  • 90. Images/cinvestav We have the following Process Assuming P (y|x) instead of y = f (x) Then a data point (x, y) is now generated by the joint distribution P (x, y) = P (x) P (y|x) Therefore Noisy target is a deterministic target plus added noise. f (x) ≈ E [y|x] + (y − f (x)) 49 / 135
  • 91. Images/cinvestav We have the following Process Assuming P (y|x) instead of y = f (x) Then a data point (x, y) is now generated by the joint distribution P (x, y) = P (x) P (y|x) Therefore Noisy target is a deterministic target plus added noise. f (x) ≈ E [y|x] + (y − f (x)) 49 / 135
  • 92. Images/cinvestav Finally, we have as Learning Process Unknow Target Distribution Probability Distribution Training Examples 50 / 135
  • 93. Images/cinvestav Therefore Distinction between P (y|x) and P (x) Both convey probabilistic aspects of x and y. Therefore 1 The Target distribution P (y|x) is what we are trying to learn. 2 The Input distribution P (x) quantifies relative importance of x. Finally Merging P (x, y) = P (y|x) P (x) mixes the two concepts 51 / 135
  • 94. Images/cinvestav Therefore Distinction between P (y|x) and P (x) Both convey probabilistic aspects of x and y. Therefore 1 The Target distribution P (y|x) is what we are trying to learn. 2 The Input distribution P (x) quantifies relative importance of x. Finally Merging P (x, y) = P (y|x) P (x) mixes the two concepts 51 / 135
  • 95. Images/cinvestav Therefore Distinction between P (y|x) and P (x) Both convey probabilistic aspects of x and y. Therefore 1 The Target distribution P (y|x) is what we are trying to learn. 2 The Input distribution P (x) quantifies relative importance of x. Finally Merging P (x, y) = P (y|x) P (x) mixes the two concepts 51 / 135
  • 96. Images/cinvestav Therefore Learning is feasible because It is likely that Eout (g) ≈ Ein (g) Therefore, we need g ≈ f Eout (g) = P (g (x) = f (x)) ≈ 0 How do we achieve this? Eout (g) ≈ Ein (g) = 1 N N n=1 I (g (xn) = f (xn)) 52 / 135
  • 97. Images/cinvestav Therefore Learning is feasible because It is likely that Eout (g) ≈ Ein (g) Therefore, we need g ≈ f Eout (g) = P (g (x) = f (x)) ≈ 0 How do we achieve this? Eout (g) ≈ Ein (g) = 1 N N n=1 I (g (xn) = f (xn)) 52 / 135
  • 98. Images/cinvestav Therefore Learning is feasible because It is likely that Eout (g) ≈ Ein (g) Therefore, we need g ≈ f Eout (g) = P (g (x) = f (x)) ≈ 0 How do we achieve this? Eout (g) ≈ Ein (g) = 1 N N n=1 I (g (xn) = f (xn)) 52 / 135
  • 99. Images/cinvestav Then We make at the same time Ein (g) ≈ 0 To Make the Error in our selected hypothesis g with respect to the real function f Learning splits in two questions 1 Can we make Eout (g) is close enough Ein (g)? 2 Can we make Ein (g) small enough? 53 / 135
  • 100. Images/cinvestav Then We make at the same time Ein (g) ≈ 0 To Make the Error in our selected hypothesis g with respect to the real function f Learning splits in two questions 1 Can we make Eout (g) is close enough Ein (g)? 2 Can we make Ein (g) small enough? 53 / 135
  • 101. Images/cinvestav Therefore, we have Nice Connection with Bias-Variance Trade-off ERROR Out-of-Sample Error Model Complexity In-Sample Error 54 / 135
  • 102. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 55 / 135
  • 103. Images/cinvestav We have that The out-of-sample error Eout (h) = P (h (x) = f (x)) It Measures how well our training on D It has generalized to data that we have not seen before. Remark Eout is based on the performance over the entire input space X. 56 / 135
  • 104. Images/cinvestav We have that The out-of-sample error Eout (h) = P (h (x) = f (x)) It Measures how well our training on D It has generalized to data that we have not seen before. Remark Eout is based on the performance over the entire input space X. 56 / 135
  • 105. Images/cinvestav We have that The out-of-sample error Eout (h) = P (h (x) = f (x)) It Measures how well our training on D It has generalized to data that we have not seen before. Remark Eout is based on the performance over the entire input space X. 56 / 135
  • 106. Images/cinvestav Testing Data Set Intuitively we want to estimate the value of Eout using a sample of data points. Something Notable These points must be ’fresh’ test points that have not been used for training. Basically Out Testing Set. 57 / 135
  • 107. Images/cinvestav Testing Data Set Intuitively we want to estimate the value of Eout using a sample of data points. Something Notable These points must be ’fresh’ test points that have not been used for training. Basically Out Testing Set. 57 / 135
  • 108. Images/cinvestav Testing Data Set Intuitively we want to estimate the value of Eout using a sample of data points. Something Notable These points must be ’fresh’ test points that have not been used for training. Basically Out Testing Set. 57 / 135
  • 109. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 58 / 135
  • 110. Images/cinvestav Thus It is possible to define The generalization error as the discrepancy between Ein and Eout Therefore The Hoeffding Inequality is a way to characterize the generalization error with a probabilistic bound P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 For any > 0. 59 / 135
  • 111. Images/cinvestav Thus It is possible to define The generalization error as the discrepancy between Ein and Eout Therefore The Hoeffding Inequality is a way to characterize the generalization error with a probabilistic bound P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 For any > 0. 59 / 135
  • 112. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 60 / 135
  • 113. Images/cinvestav Reinterpreting This Assume a Tolerance Level δ, for example δ = 0.0005 It is possible to say that with probability 1 − δ : Eout (g) < Ein (g) + 1 2N ln 2M δ 61 / 135
  • 114. Images/cinvestav Proof We have the complement Hoeffding Probability using the absolute value P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2 Therefore, we have P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2 This imply Eout (g) < Ein (g) + 62 / 135
  • 115. Images/cinvestav Proof We have the complement Hoeffding Probability using the absolute value P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2 Therefore, we have P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2 This imply Eout (g) < Ein (g) + 62 / 135
  • 116. Images/cinvestav Proof We have the complement Hoeffding Probability using the absolute value P (|Eout (g) − Ein (g)| < ) ≤ 1 − 2M exp−2N 2 Therefore, we have P (− < Eout (g) − Ein (g) < ) ≤ 1 − 2M exp−2N 2 This imply Eout (g) < Ein (g) + 62 / 135
  • 117. Images/cinvestav Therefore We simply use δ = 2M exp−2N 2 Then ln 1 − ln δ 2M = 2N 2 Therefore = 1 2N ln 2M δ 63 / 135
  • 118. Images/cinvestav Therefore We simply use δ = 2M exp−2N 2 Then ln 1 − ln δ 2M = 2N 2 Therefore = 1 2N ln 2M δ 63 / 135
  • 119. Images/cinvestav Therefore We simply use δ = 2M exp−2N 2 Then ln 1 − ln δ 2M = 2N 2 Therefore = 1 2N ln 2M δ 63 / 135
  • 120. Images/cinvestav Generalization Bound This inequality is know as a generalization Bound Ein (g) < Eout (g) + 1 2N ln 2M δ 64 / 135
  • 121. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 65 / 135
  • 122. Images/cinvestav We have The following inequality also holds − < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) − Thus Not only we want our hypothesis g to do well int the out samples, Eout (g) < Ein (g) + But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis with higher Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) than g. 66 / 135
  • 123. Images/cinvestav We have The following inequality also holds − < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) − Thus Not only we want our hypothesis g to do well int the out samples, Eout (g) < Ein (g) + But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis with higher Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) than g. 66 / 135
  • 124. Images/cinvestav We have The following inequality also holds − < Eout (g) − Ein (g) ⇒ Eout (g) > Ein (g) − Thus Not only we want our hypothesis g to do well int the out samples, Eout (g) < Ein (g) + But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis with higher Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) than g. 66 / 135
  • 125. Images/cinvestav Therefore But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis h with higher than g Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) It will have a higher Eout (h) given Eout (h) > Ein (h) − 67 / 135
  • 126. Images/cinvestav Therefore But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis h with higher than g Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) It will have a higher Eout (h) given Eout (h) > Ein (h) − 67 / 135
  • 127. Images/cinvestav Therefore But, we want to know how well we did with our H Thus, Eout (g) > Ein (g) − assures that it is not possible to do better!!! Given any hypothesis h with higher than g Ein (h) = 1 N N n=1 I (h (xn) = f (xn)) It will have a higher Eout (h) given Eout (h) > Ein (h) − 67 / 135
  • 128. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 68 / 135
  • 129. Images/cinvestav The Infiniteness of H A Problem with the Error Bound given its dependency on M 1 2N ln 2M δ What happens when M becomes infinity The number of hypothesis in H becomes infinity. Thus, the bound becomes infinity Problem, almost all interesting learning models have infinite H.... For Example... in our linear Regression... f (x) = wT x 69 / 135
  • 130. Images/cinvestav The Infiniteness of H A Problem with the Error Bound given its dependency on M 1 2N ln 2M δ What happens when M becomes infinity The number of hypothesis in H becomes infinity. Thus, the bound becomes infinity Problem, almost all interesting learning models have infinite H.... For Example... in our linear Regression... f (x) = wT x 69 / 135
  • 131. Images/cinvestav The Infiniteness of H A Problem with the Error Bound given its dependency on M 1 2N ln 2M δ What happens when M becomes infinity The number of hypothesis in H becomes infinity. Thus, the bound becomes infinity Problem, almost all interesting learning models have infinite H.... For Example... in our linear Regression... f (x) = wT x 69 / 135
  • 132. Images/cinvestav Therefore, we need to replace M We need to find a finite substitute with finite range values For this, we notice that |Ein (h1) − Eout (h1)| ≥ or |Ein (h2) − Eout (h2)| ≥ · · · or |Ein (hM ) − Eout (hM )| ≥ 70 / 135
  • 133. Images/cinvestav We have This guarantee |Ein (g) − Eout (g)| ≥ Thus, we can take a look at the events Bm events for which you have |Ein (hm) − Eout (hm)| ≥ Then P B1 or B2 · · · or BM ≤ M m=1 P [Bm] 71 / 135
  • 134. Images/cinvestav We have This guarantee |Ein (g) − Eout (g)| ≥ Thus, we can take a look at the events Bm events for which you have |Ein (hm) − Eout (hm)| ≥ Then P B1 or B2 · · · or BM ≤ M m=1 P [Bm] 71 / 135
  • 135. Images/cinvestav Now, we have the following Example We have a gross overestimate Basically, if hi and hj are quite similar the two events |Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥ are likely to coincide!!! 72 / 135
  • 136. Images/cinvestav Now, we have the following Example We have a gross overestimate Basically, if hi and hj are quite similar the two events |Ein (hi) − Eout (hi)| ≥ and |Ein (hj) − Eout (hj)| ≥ are likely to coincide!!! 72 / 135
  • 137. Images/cinvestav We have Something Notable In a typical learning model, many hypotheses are indeed very similar. The mathematical theory of generalization hinges on this observation We only need to account for the overlapping on different hypothesis to substitute M. 73 / 135
  • 138. Images/cinvestav We have Something Notable In a typical learning model, many hypotheses are indeed very similar. The mathematical theory of generalization hinges on this observation We only need to account for the overlapping on different hypothesis to substitute M. 73 / 135
  • 139. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 74 / 135
  • 140. Images/cinvestav Consider A finite data set X = {x1, x2, ..., xN } And we consider a set of hypothesis h ∈ H such that h : X → {−1, +1} We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of ±1. Such N-tuple is called a Dichotomy Given that it splits x1, x2, ..., xN into two groups... 75 / 135
  • 141. Images/cinvestav Consider A finite data set X = {x1, x2, ..., xN } And we consider a set of hypothesis h ∈ H such that h : X → {−1, +1} We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of ±1. Such N-tuple is called a Dichotomy Given that it splits x1, x2, ..., xN into two groups... 75 / 135
  • 142. Images/cinvestav Consider A finite data set X = {x1, x2, ..., xN } And we consider a set of hypothesis h ∈ H such that h : X → {−1, +1} We get a N-tuple, when applied to X, h (x1) , h (x2) , ..., h (xN ) of ±1. Such N-tuple is called a Dichotomy Given that it splits x1, x2, ..., xN into two groups... 75 / 135
  • 143. Images/cinvestav Dichotomy Definition Given a hypothesis set H, a dichotomy of a set X is one of the possible ways of labeling the points of X using a hypothesis in H. 76 / 135
  • 144. Images/cinvestav Examples of Dichotomies Here the first Dichotomy can be generated by a perceptron Class +1 Class -1 The Dichotomy Generated By a Perceptron 77 / 135
  • 145. Images/cinvestav Something Important Each h ∈ H generates a dichotomy on x1, ..., xN However, two different h’s may generate the same dichotomy if they generate the same pattern 78 / 135
  • 146. Images/cinvestav Remark Definition Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these points are defined by H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H} Therefore We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the geometry of the points. Thus A large H (x1, x2, ..., xN ) means H is more diverse. 79 / 135
  • 147. Images/cinvestav Remark Definition Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these points are defined by H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H} Therefore We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the geometry of the points. Thus A large H (x1, x2, ..., xN ) means H is more diverse. 79 / 135
  • 148. Images/cinvestav Remark Definition Let x1, x2, ..., xn ∈ X. The dichotomies generated by H on these points are defined by H (x1, x2, ..., xN ) = {(h [x1] , h [x2] , ..., h [xN ]) |h ∈ H} Therefore We can see H (x1, x2, ..., xN ) as a set of hypothesis by using the geometry of the points. Thus A large H (x1, x2, ..., xN ) means H is more diverse. 79 / 135
  • 149. Images/cinvestav Growth function, Our Replacement of M Definition The growth function is defined for a hypothesis set H by mH (N) = max x1,...,xN ∈X #H (x1, x2, ..., xN ) where # denotes the cardinality (number of elements) of a set. Therefore mH (N) is the maximum number of dichotomies that be generated by H on any N points. We remove dependency on the entire X 80 / 135
  • 150. Images/cinvestav Growth function, Our Replacement of M Definition The growth function is defined for a hypothesis set H by mH (N) = max x1,...,xN ∈X #H (x1, x2, ..., xN ) where # denotes the cardinality (number of elements) of a set. Therefore mH (N) is the maximum number of dichotomies that be generated by H on any N points. We remove dependency on the entire X 80 / 135
  • 151. Images/cinvestav Therefore We have that M and mH (N) is a measure of the of the number of hypothesis in H However, we avoid considering all of X Now we only consider N points instead of the entire X. 81 / 135
  • 152. Images/cinvestav Therefore We have that M and mH (N) is a measure of the of the number of hypothesis in H However, we avoid considering all of X Now we only consider N points instead of the entire X. 81 / 135
  • 153. Images/cinvestav Upper Bound for mH (N) First, we know that H (x1, x2, ..., xN ) ⊆ {−1, +1}N Hence, we have the value of mH (N) is at most # {−1, +1}N mH (N) ≤ 2N 82 / 135
  • 154. Images/cinvestav Upper Bound for mH (N) First, we know that H (x1, x2, ..., xN ) ⊆ {−1, +1}N Hence, we have the value of mH (N) is at most # {−1, +1}N mH (N) ≤ 2N 82 / 135
  • 155. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 83 / 135
  • 156. Images/cinvestav Therefore If H is capable of generating all possible dichotomies on x1, x2, ..., xN Then, H (x1, x2, ..., xN ) = {−1, +1} N and #H (x1, x2, ..., xN ) = 2N We can say that H can shatter x1, x2, ..., xN Meaning H is as diverse as can be on this particular sample. 84 / 135
  • 157. Images/cinvestav Therefore If H is capable of generating all possible dichotomies on x1, x2, ..., xN Then, H (x1, x2, ..., xN ) = {−1, +1} N and #H (x1, x2, ..., xN ) = 2N We can say that H can shatter x1, x2, ..., xN Meaning H is as diverse as can be on this particular sample. 84 / 135
  • 158. Images/cinvestav Therefore If H is capable of generating all possible dichotomies on x1, x2, ..., xN Then, H (x1, x2, ..., xN ) = {−1, +1} N and #H (x1, x2, ..., xN ) = 2N We can say that H can shatter x1, x2, ..., xN Meaning H is as diverse as can be on this particular sample. 84 / 135
  • 159. Images/cinvestav Shattering Definition A set X of N ≥ 1 points is said to be shattered by a hypothesis set H when H realizes all possible dichotomies of X, that is when mH (N) = 2N 85 / 135
  • 160. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 86 / 135
  • 161. Images/cinvestav Example Positive Rays Imagine a input space on R, with H consisting of all hypotheses h : R → {−1, +1} of the form h (x) = sign (x − a) Example 87 / 135
  • 162. Images/cinvestav Example Positive Rays Imagine a input space on R, with H consisting of all hypotheses h : R → {−1, +1} of the form h (x) = sign (x − a) Example 87 / 135
  • 163. Images/cinvestav Thus, we have that As we change a, we get N + 1 different dichotomies mH (N) = N + 1 Now, we have the case of positive intervals H consists of all hypotheses in one dimension that return +1 within some interval and −1 otherwise. 88 / 135
  • 164. Images/cinvestav Thus, we have that As we change a, we get N + 1 different dichotomies mH (N) = N + 1 Now, we have the case of positive intervals H consists of all hypotheses in one dimension that return +1 within some interval and −1 otherwise. 88 / 135
  • 165. Images/cinvestav Therefore We have The line is again split by the points into N + 1 regions. Furthermore The dichotomy we get is decided by which two regions contain the end values of the interval Therefore, we have the number of possible dichotomies N + 1 2 89 / 135
  • 166. Images/cinvestav Therefore We have The line is again split by the points into N + 1 regions. Furthermore The dichotomy we get is decided by which two regions contain the end values of the interval Therefore, we have the number of possible dichotomies N + 1 2 89 / 135
  • 167. Images/cinvestav Therefore We have The line is again split by the points into N + 1 regions. Furthermore The dichotomy we get is decided by which two regions contain the end values of the interval Therefore, we have the number of possible dichotomies N + 1 2 89 / 135
  • 168. Images/cinvestav Additionally If the two points fall in the same region, the H = −1 Then mH (N) = N + 1 2 + 1 = 1 2 N2 + 1 2 N + 1 90 / 135
  • 169. Images/cinvestav Finally In the case of a Convex Set in R2 H consists of all hypothesis in two dimensions that are positive inside some convex set and negative elsewhere. 91 / 135
  • 170. Images/cinvestav Therefore We have the following mH (N) = 2N By using the “Radon’s theorem” 92 / 135
  • 171. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 93 / 135
  • 172. Images/cinvestav Remember We have that P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 What if mH (N) replaces M If mH (N) is polynomial, we have an excellent case!!! Therefore, we need to prove that mH (N) is polynomial 94 / 135
  • 173. Images/cinvestav Remember We have that P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 What if mH (N) replaces M If mH (N) is polynomial, we have an excellent case!!! Therefore, we need to prove that mH (N) is polynomial 94 / 135
  • 174. Images/cinvestav Remember We have that P (|Ein (g) − Eout (g)| ≥ ) ≤ 2M exp−2N 2 What if mH (N) replaces M If mH (N) is polynomial, we have an excellent case!!! Therefore, we need to prove that mH (N) is polynomial 94 / 135
  • 175. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 95 / 135
  • 176. Images/cinvestav Break Point Definition If no data set of size k can be shattered by H, then k is said to be a break point for H: mH (k) < 2k 96 / 135
  • 177. Images/cinvestav Example For the Perceptron, we have k = 4 Non-ShatterShatter 97 / 135
  • 178. Images/cinvestav Important Something Notable In general, it is easier to find a break point for H than to compute the full growth function for that H. Using this concept We are ready to define the concept of Vapnik–Chervonenkis (VC) dimension. 98 / 135
  • 179. Images/cinvestav Important Something Notable In general, it is easier to find a break point for H than to compute the full growth function for that H. Using this concept We are ready to define the concept of Vapnik–Chervonenkis (VC) dimension. 98 / 135
  • 180. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 99 / 135
  • 181. Images/cinvestav VC-Dimension Definition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k A set containing k points, for arbitrary k, is in general linear position if and only if no (k − 1) −dimensional flat contains them all 100 / 135
  • 182. Images/cinvestav Important Remarks Remark 1 if V Cdim (H) = d, there exists a set of size d that can be fully shattered. Remark2 This does not imply that all sets of size d or less are fully shattered This is typically the case!!! 101 / 135
  • 183. Images/cinvestav Important Remarks Remark 1 if V Cdim (H) = d, there exists a set of size d that can be fully shattered. Remark2 This does not imply that all sets of size d or less are fully shattered This is typically the case!!! 101 / 135
  • 184. Images/cinvestav Why? General Linear Position For example in the Perceptron No General Position 102 / 135
  • 185. Images/cinvestav Now, we define B(N, k) Definition B(N, k) is the maximum number of dichotomies on N points such that no subset of size k of the N points can be shattered by these dichotomies. Something Notable The definition of B(N, k) assumes a break point k!!! 103 / 135
  • 186. Images/cinvestav Now, we define B(N, k) Definition B(N, k) is the maximum number of dichotomies on N points such that no subset of size k of the N points can be shattered by these dichotomies. Something Notable The definition of B(N, k) assumes a break point k!!! 103 / 135
  • 187. Images/cinvestav Further Since B(N, k) is a maximum It is an upper bound for mH (N) under a break point k. mH (N) ≤ B (N, k) if k is a break point for H. Then We need to find a Bound for B (N, k) to prove that mH (k) is polynomial. 104 / 135
  • 188. Images/cinvestav Further Since B(N, k) is a maximum It is an upper bound for mH (N) under a break point k. mH (N) ≤ B (N, k) if k is a break point for H. Then We need to find a Bound for B (N, k) to prove that mH (k) is polynomial. 104 / 135
  • 189. Images/cinvestav Therefore Thus, we start with two boundary conditions k = 1 and N = 1 B (N, 1) = 1 B (1, k) = 2 k > 1 105 / 135
  • 190. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135
  • 191. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135
  • 192. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135
  • 193. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135
  • 194. Images/cinvestav Why? Something Notable B(N, 1) = 1 for all N since if no subset of size 1 can be shattered Then only one dichotomy can be allowed. Because a second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered. Second B(1, k) = 2 for k > 1 since there do not even exist subsets of size k. Because the constraint is vacuously true and we have 2 possible dichotomies +1 and −1. 106 / 135
  • 195. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 107 / 135
  • 196. Images/cinvestav B(N, k) Dichotomies, N ≥ 2 and k ≥ 2 # of rows x1 x2 · · · xN−1 xN +1 +1 · · · +1 +1 -1 +1 · · · +1 -1 S1 α . . . . . . · · · . . . . . . +1 -1 · · · -1 -1 -1 +1 · · · -1 +1 +1 -1 · · · +1 +1 -1 -1 · · · +1 +1 S+ 2 β . . . . . . · · · . . . . . . +1 -1 · · · +1 +1 S2 -1 +1 · · · -1 +1 +1 -1 · · · +1 -1 -1 -1 · · · +1 -1 S− 2 β . . . . . . · · · . . . . . . +1 -1 · · · +1 -1 -1 +1 · · · -1 -1 108 / 135
  • 197. Images/cinvestav What is this partition mean First, Consider the dichotomies on x1x2 · · · xN−1 Some appear once (Either +1 or -1 at xN ), but only ONCE!!! We collect them in S1 The Remaining Dichotomies appear Twice Once with +1 and once with -1 in the xN column. 109 / 135
  • 198. Images/cinvestav What is this partition mean First, Consider the dichotomies on x1x2 · · · xN−1 Some appear once (Either +1 or -1 at xN ), but only ONCE!!! We collect them in S1 The Remaining Dichotomies appear Twice Once with +1 and once with -1 in the xN column. 109 / 135
  • 199. Images/cinvestav What is this partition mean First, Consider the dichotomies on x1x2 · · · xN−1 Some appear once (Either +1 or -1 at xN ), but only ONCE!!! We collect them in S1 The Remaining Dichotomies appear Twice Once with +1 and once with -1 in the xN column. 109 / 135
  • 200. Images/cinvestav Therefore, we collect them in three sets The ones with only one Dichotomy We use the set S1 The other in two different sets S+ 2 the ones with xN = +1. S− 2 the ones with xN = −1. 110 / 135
  • 201. Images/cinvestav Therefore, we collect them in three sets The ones with only one Dichotomy We use the set S1 The other in two different sets S+ 2 the ones with xN = +1. S− 2 the ones with xN = −1. 110 / 135
  • 202. Images/cinvestav Therefore We have the following B (N, k) = α + 2β The total number of different dichotomies on the first N − 1 points They are α + β. Additionally, no subset of k of these first N − 1 points can be shattered Since no k-subset of all N points can be shattered: α + β ≤ B (N − 1, k) By definition of B. 111 / 135
  • 203. Images/cinvestav Therefore We have the following B (N, k) = α + 2β The total number of different dichotomies on the first N − 1 points They are α + β. Additionally, no subset of k of these first N − 1 points can be shattered Since no k-subset of all N points can be shattered: α + β ≤ B (N − 1, k) By definition of B. 111 / 135
  • 204. Images/cinvestav Therefore We have the following B (N, k) = α + 2β The total number of different dichotomies on the first N − 1 points They are α + β. Additionally, no subset of k of these first N − 1 points can be shattered Since no k-subset of all N points can be shattered: α + β ≤ B (N − 1, k) By definition of B. 111 / 135
  • 205. Images/cinvestav Then Further, no subset of size k − 1 of the first N − 1 points can be shattered by the dichotomies in S+ 2 If there existed such a subset, then taking the corresponding set of dichotomies in S− 2 and xN You finish with a subset of size k that can be shattered a contradiction given the definition of B (N, k). Therefore β ≤ B (N − 1, k − 1) Then, we have B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1) 112 / 135
  • 206. Images/cinvestav Then Further, no subset of size k − 1 of the first N − 1 points can be shattered by the dichotomies in S+ 2 If there existed such a subset, then taking the corresponding set of dichotomies in S− 2 and xN You finish with a subset of size k that can be shattered a contradiction given the definition of B (N, k). Therefore β ≤ B (N − 1, k − 1) Then, we have B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1) 112 / 135
  • 207. Images/cinvestav Then Further, no subset of size k − 1 of the first N − 1 points can be shattered by the dichotomies in S+ 2 If there existed such a subset, then taking the corresponding set of dichotomies in S− 2 and xN You finish with a subset of size k that can be shattered a contradiction given the definition of B (N, k). Therefore β ≤ B (N − 1, k − 1) Then, we have B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1) 112 / 135
  • 208. Images/cinvestav Then Further, no subset of size k − 1 of the first N − 1 points can be shattered by the dichotomies in S+ 2 If there existed such a subset, then taking the corresponding set of dichotomies in S− 2 and xN You finish with a subset of size k that can be shattered a contradiction given the definition of B (N, k). Therefore β ≤ B (N − 1, k − 1) Then, we have B (N, k) ≤ B(N − 1, k) + B (N − 1, k − 1) 112 / 135
  • 209. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 113 / 135
  • 210. Images/cinvestav Connecting the Growth Function with the V Cdim Sauer’s Lemma For all k ∈ N , the following inequality holds: B (N, k) ≤ k−1 i=0 N i 114 / 135
  • 211. Images/cinvestav Proof Proof For k = 1 B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 = N 0 Then, by induction We assume that the statement is true for N ≤ N0 and all k. 115 / 135
  • 212. Images/cinvestav Proof Proof For k = 1 B (N, 1) ≤ B(N − 1, 1) + B (N − 1, 0) = 1 + 0 = N 0 Then, by induction We assume that the statement is true for N ≤ N0 and all k. 115 / 135
  • 213. Images/cinvestav Now We need to prove this for N = N0 + 1 and all k Observation: This is true for k = 1 given B (N, 1) = 1 Now, consider k ≥ 2 B(N0, k) + B (N0, k − 1) Therefore B (N0 + 1, k) ≤ k−1 i=0 N0 i + k−2 i=0 N0 i 116 / 135
  • 214. Images/cinvestav Now We need to prove this for N = N0 + 1 and all k Observation: This is true for k = 1 given B (N, 1) = 1 Now, consider k ≥ 2 B(N0, k) + B (N0, k − 1) Therefore B (N0 + 1, k) ≤ k−1 i=0 N0 i + k−2 i=0 N0 i 116 / 135
  • 215. Images/cinvestav Now We need to prove this for N = N0 + 1 and all k Observation: This is true for k = 1 given B (N, 1) = 1 Now, consider k ≥ 2 B(N0, k) + B (N0, k − 1) Therefore B (N0 + 1, k) ≤ k−1 i=0 N0 i + k−2 i=0 N0 i 116 / 135
  • 216. Images/cinvestav Therefore We have the following B (N0 + 1, k) ≤ 1 + k−1 i=1 N0 i + k−1 i=1 N0 i − 1 = 1 + k−1 i=1 N0 i + N0 i − 1 = 1 + k−1 i=1 N0 + 1 i = k−1 i=0 N0 + 1 i Because N0 i + N0 i − 1 = N0 + 1 i 117 / 135
  • 217. Images/cinvestav Therefore We have the following B (N0 + 1, k) ≤ 1 + k−1 i=1 N0 i + k−1 i=1 N0 i − 1 = 1 + k−1 i=1 N0 i + N0 i − 1 = 1 + k−1 i=1 N0 + 1 i = k−1 i=0 N0 + 1 i Because N0 i + N0 i − 1 = N0 + 1 i 117 / 135
  • 218. Images/cinvestav Therefore We have the following B (N0 + 1, k) ≤ 1 + k−1 i=1 N0 i + k−1 i=1 N0 i − 1 = 1 + k−1 i=1 N0 i + N0 i − 1 = 1 + k−1 i=1 N0 + 1 i = k−1 i=0 N0 + 1 i Because N0 i + N0 i − 1 = N0 + 1 i 117 / 135
  • 219. Images/cinvestav Therefore We have the following B (N0 + 1, k) ≤ 1 + k−1 i=1 N0 i + k−1 i=1 N0 i − 1 = 1 + k−1 i=1 N0 i + N0 i − 1 = 1 + k−1 i=1 N0 + 1 i = k−1 i=0 N0 + 1 i Because N0 i + N0 i − 1 = N0 + 1 i 117 / 135
  • 220. Images/cinvestav Now We have in conclusion for all k B (N, k) ≤ k−1 i=0 N i Therefore mH (N) ≤ B (N, k) ≤ k−1 i=0 N i 118 / 135
  • 221. Images/cinvestav Now We have in conclusion for all k B (N, k) ≤ k−1 i=0 N i Therefore mH (N) ≤ B (N, k) ≤ k−1 i=0 N i 118 / 135
  • 222. Images/cinvestav Then Theorem If mH (k) < 2k for some value k, then mH (N) ≤ k−1 i=0 N i 119 / 135
  • 223. Images/cinvestav Finally Corollary Let H be a hypothesis set with V Cdim (H) = k. Then, for all N ≥ k mH (N) ≤ eN k k−1 = O Nk 120 / 135
  • 224. Images/cinvestav We have Proof mH (N) ≤ k i=0 N i ≤ k i=0 N i N k k−i ≤ N i=0 N i N k k−i N k k N i=0 N i k N i 121 / 135
  • 225. Images/cinvestav We have Proof mH (N) ≤ k i=0 N i ≤ k i=0 N i N k k−i ≤ N i=0 N i N k k−i N k k N i=0 N i k N i 121 / 135
  • 226. Images/cinvestav We have Proof mH (N) ≤ k i=0 N i ≤ k i=0 N i N k k−i ≤ N i=0 N i N k k−i N k k N i=0 N i k N i 121 / 135
  • 227. Images/cinvestav We have Proof mH (N) ≤ k i=0 N i ≤ k i=0 N i N k k−i ≤ N i=0 N i N k k−i N k k N i=0 N i k N i 121 / 135
  • 228. Images/cinvestav Therefore We have mH (N) ≤ N k k N i=0 N i k N i = N k k 1 + k N N Given that (1 − x) = e−x mH (N) ≤ N k k e k N ≤ N k k−1 ek−1 = e k k Nk = O Nk 122 / 135
  • 229. Images/cinvestav Therefore We have mH (N) ≤ N k k N i=0 N i k N i = N k k 1 + k N N Given that (1 − x) = e−x mH (N) ≤ N k k e k N ≤ N k k−1 ek−1 = e k k Nk = O Nk 122 / 135
  • 230. Images/cinvestav Therefore We have mH (N) ≤ N k k N i=0 N i k N i = N k k 1 + k N N Given that (1 − x) = e−x mH (N) ≤ N k k e k N ≤ N k k−1 ek−1 = e k k Nk = O Nk 122 / 135
  • 231. Images/cinvestav Therefore We have mH (N) ≤ N k k N i=0 N i k N i = N k k 1 + k N N Given that (1 − x) = e−x mH (N) ≤ N k k e k N ≤ N k k−1 ek−1 = e k k Nk = O Nk 122 / 135
  • 232. Images/cinvestav Therefore We have that mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that mH (N) is polynomial We are not depending on the number of hypothesis!!!! 123 / 135
  • 233. Images/cinvestav Therefore We have that mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that mH (N) is polynomial We are not depending on the number of hypothesis!!!! 123 / 135
  • 234. Images/cinvestav Therefore We have that mH (N) is bounded by Nk−1 i.e. if mH (k) < 2k we have that mH (N) is polynomial We are not depending on the number of hypothesis!!!! 123 / 135
  • 235. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 124 / 135
  • 236. Images/cinvestav Remark about mH (k) We have bounded the number of effective hypothesis Yes!!! we can have M hypotheses but the number of dichotomies generated by them is bounded by mH (k) 125 / 135
  • 237. Images/cinvestav VC-Dimension Again Definition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k Something Notable If mH (N) = 2N for all N, V Cdim (H) = ∞ 126 / 135
  • 238. Images/cinvestav VC-Dimension Again Definition The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H (Those points need to be in “General Position”): V Cdim (H) = max k|mH (k) = 2k Something Notable If mH (N) = 2N for all N, V Cdim (H) = ∞ 126 / 135
  • 239. Images/cinvestav Remember We have the following Ein (g) < Eout (g) + 1 2N ln 2M δ We instead of using M, we use mH (N) We can use our growth function as the effective way to bound Ein (g) < Eout (g) + 1 2N ln 2mH (N) δ 127 / 135
  • 240. Images/cinvestav Remember We have the following Ein (g) < Eout (g) + 1 2N ln 2M δ We instead of using M, we use mH (N) We can use our growth function as the effective way to bound Ein (g) < Eout (g) + 1 2N ln 2mH (N) δ 127 / 135
  • 241. Images/cinvestav VC Generalized Bound Theorem (VC Generalized Bound) For any tolerance δ > 0 and H be a hypothesis set with V Cdim (H) = k., Ein (g) < Eout (g) + 2k N ln eN k + 1 2N ln 1 δ with probability ≥ 1 − δ Something Notable This Bound only fails when V Cdim (H) = ∞!!! 128 / 135
  • 242. Images/cinvestav VC Generalized Bound Theorem (VC Generalized Bound) For any tolerance δ > 0 and H be a hypothesis set with V Cdim (H) = k., Ein (g) < Eout (g) + 2k N ln eN k + 1 2N ln 1 δ with probability ≥ 1 − δ Something Notable This Bound only fails when V Cdim (H) = ∞!!! 128 / 135
  • 243. Images/cinvestav Proof Although we will not talk about it We will remark the that is possible to use the Rademacher complexity To manage the number of overlapping hypothesis (Which can be infinite) We will stop here, but But I will encourage to look at more about the proof... 129 / 135
  • 244. Images/cinvestav Proof Although we will not talk about it We will remark the that is possible to use the Rademacher complexity To manage the number of overlapping hypothesis (Which can be infinite) We will stop here, but But I will encourage to look at more about the proof... 129 / 135
  • 245. Images/cinvestav About the Proof For More, take a look at “A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al. “Foundations of Machine Learning” by Mehryar Mohori et al. This is the equivalent to use Measure Theory to understand the innards of Probability We are professionals, we must understand!!! 130 / 135
  • 246. Images/cinvestav About the Proof For More, take a look at “A Probabilistic Theory of Pattern Recognition” by Luc Devroye et al. “Foundations of Machine Learning” by Mehryar Mohori et al. This is the equivalent to use Measure Theory to understand the innards of Probability We are professionals, we must understand!!! 130 / 135
  • 247. Images/cinvestav Outline 1 Is Learning Feasible? Introduction The Dilemma A Binary Problem, Solving the Dilemma Hoeffding’s Inequality Error in the Sample and Error in the Phenomena Formal Definitions Back to the Hoeffding’s Inequality The Learning Process Feasibility of Learning Example Overall Error 2 Vapnik-Chervonenkis Dimension Theory of Generalization Generalization Error Reinterpretation Subtlety A Problem with M Dichotomies Shattering Example of Computing mH (N) What are we looking for? Break Point VC-Dimension Partition B(N, k) Connecting the Growth Function with the V Cdim VC Generalization Bound Theorem 3 Example Multi-Layer Perceptron 131 / 135
  • 248. Images/cinvestav As you remember from previous classes We have architectures like 132 / 135
  • 249. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135
  • 250. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135
  • 251. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135
  • 252. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135
  • 253. Images/cinvestav G-composition of H Let G be a layered directed acyclic graph Where directed edges go from one layer l to the next layer l + 1. Now, we have a set of hypothesis H NInput Nodes with in-degree 0 Intermediate Nodes with in-degree r Single Output node with out-degree 0 H our hypothesis over the space Euclidean space Rr Basically each node represent the hypothesis ci : Rr → {−1, 1} by mean of tanh. 133 / 135
  • 254. Images/cinvestav Therefore We have that The Neural concept represent an hypothesis from RN to {−1, 1} Therefore the entire hypothesis is a composition of concepts This is called a G-composition of H. 134 / 135
  • 255. Images/cinvestav Therefore We have that The Neural concept represent an hypothesis from RN to {−1, 1} Therefore the entire hypothesis is a composition of concepts This is called a G-composition of H. 134 / 135
  • 256. Images/cinvestav We have the following theorem Theorem (Kearns and Vazirani, 1994) Let G be a layered directed acyclic graph with N input nodes and r ≥ 2 internal nodes each of indegree r. Let H hypothesis set over Rr of V Cdim (H) = d, and let G-composition of H. then V Cdim (HG) ≤ 2ds log2 (es) 135 / 135
  • 257. Images/cinvestav We have the following theorem Theorem (Kearns and Vazirani, 1994) Let G be a layered directed acyclic graph with N input nodes and r ≥ 2 internal nodes each of indegree r. Let H hypothesis set over Rr of V Cdim (H) = d, and let G-composition of H. then V Cdim (HG) ≤ 2ds log2 (es) 135 / 135