f = a function in F
● F = set of linear functions on Rd
● F = set of neural networks with N neurons
in two layers on Rd,
● F = set of polynomial functions with degree d on Rd
We want to pick up a good f in F. E.g.
for some distribution on x,y:
● f minimizing the expectation of (f(x)-y)2
● f minimizing the -log likelihood of data -log P(f(x)=y)
We have some set F of functions, we want to find f in F such that Lf = E something is minimum
Noisy optimization boils down to fitting data:
● y=1 with probability A+B||x-x*||^2, 0 otherwise
● Or maybe E (y|x) = A+B||x-x*||^2
● We want to find f(x)=A+B|| x – f* ||^2 minimizing E(f(x)-y)^2
We want E Lf small.
All we know is:
= empirical error
with V the VC-dimension
With probability >= 1-eta
● Assumes that the data are
● Distribution-free bounds
● Scale as 1/n if empirical error
very small, 1/sqrt(n) otherwise
Ok, but what is the VC-dimension ?
First define shattering coefficient S(n)=
The VC-dimension is n maximum
(possibly infinite) such that S(n) = 2^n.
Explanation ? It is >= 7 if:
● For at least one set S of 7 points
● All 2^7 binary subsets of S are >=c for some f
and <c for the complement...
In many sufficiently smooth cases, the VC-dimension I
the number of parameters – but not always...
Polynomial of degree k over Rd:
Linear combinations of V functions
have VC-dimension at most V
● If distribution on X is known, there are better bounds
● VC bounds are nearly optimal - within huge constants
● For distribution-dep. rates, there exist faster results
● For distribution-dep rates, there are more general results
(even with infinite VC-dimension); see
● Donsker classes
● Glivenko-Cantelli classes (convergence, no rate)
● Fat-shattering dimension
Structural risk Minimization:
we choose f minimizing this bound:
In particular, we can have several families F, and minimize
this bound over several families.
Principle = penalization of families with high VC-dimension
We want a small
We want a small
Overfitting = choosing a function which is
empirically good, but generalizes poorly.
VC-dimension is about avoiding overfitting.
Structural risk minimization =
minimizing the VC bound
However, in everyday life, people use
cross-validation: choose the family of
functions such that
● learning on half examples
●testing on the other half
But: VC-dimension convenient for proving
Vapnik's books: centered on Vapnik's work, but good book
Devroye-Gyorfi-Lugosi = very good book,
mainly on binary case
Vidyasagar = using covering number, good book
- I try to promote in the team, the idea of talks for a wide audience
- do you find that interesting ?