VC-dimension: a superfast survey


Published on

VC-dimension is a great tool for machine learning (see Vapnik's books) and optimization (e.g. my paper

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

VC-dimension: a superfast survey

  1. 1. VC-dimension, very fast tutorial O. Teytaud
  2. 2. f = a function in F For example: ● F = set of linear functions on Rd ● F = set of neural networks with N neurons in two layers on Rd, ● F = set of polynomial functions with degree d on Rd We want to pick up a good f in F. E.g. for some distribution on x,y: ● f minimizing the expectation of (f(x)-y)2 ● f minimizing the -log likelihood of data -log P(f(x)=y) We have some set F of functions, we want to find f in F such that Lf = E something is minimum Noisy optimization boils down to fitting data: ● y=1 with probability A+B||x-x*||^2, 0 otherwise ● Or maybe E (y|x) = A+B||x-x*||^2 ● We want to find f(x)=A+B|| x – f* ||^2 minimizing E(f(x)-y)^2
  3. 3. Loss function: We want E Lf small. All we know is: General case = empirical error
  4. 4. with V the VC-dimension With probability >= 1-eta
  5. 5. Remarks: ● Assumes that the data are independently drawn ● Distribution-free bounds ● Scale as 1/n if empirical error very small, 1/sqrt(n) otherwise
  6. 6. Ok, but what is the VC-dimension ? First define shattering coefficient S(n)= The VC-dimension is n maximum (possibly infinite) such that S(n) = 2^n. Explanation ? It is >= 7 if: ● For at least one set S of 7 points ● All 2^7 binary subsets of S are >=c for some f and <c for the complement...
  7. 7. In many sufficiently smooth cases, the VC-dimension I the number of parameters – but not always... Polynomial of degree k over Rd: Linear combinations of V functions have VC-dimension at most V
  8. 8. Remarks: ● If distribution on X is known, there are better bounds ● VC bounds are nearly optimal - within huge constants ● For distribution-dep. rates, there exist faster results ● For distribution-dep rates, there are more general results (even with infinite VC-dimension); see ● Donsker classes ● Glivenko-Cantelli classes (convergence, no rate) ● Covering-numbers ● Fat-shattering dimension
  9. 9. Structural risk Minimization: we choose f minimizing this bound: In particular, we can have several families F, and minimize this bound over several families. Principle = penalization of families with high VC-dimension We want a small empirical error We want a small VC-dimension (complexity)
  10. 10. Overfitting = choosing a function which is empirically good, but generalizes poorly. VC-dimension is about avoiding overfitting. Structural risk minimization = minimizing the VC bound However, in everyday life, people use cross-validation: choose the family of functions such that ● learning on half examples ●testing on the other half performs well. But: VC-dimension convenient for proving useful theorems.
  11. 11. Vapnik's books: centered on Vapnik's work, but good book Devroye-Gyorfi-Lugosi = very good book, mainly on binary case Vidyasagar = using covering number, good book Feedback: - I try to promote in the team, the idea of talks for a wide audience - do you find that interesting ?