Upcoming SlideShare
×

# VC-dimension: a superfast survey

104
-1

Published on

VC-dimension is a great tool for machine learning (see Vapnik's books) and optimization (e.g. my paper http://hal.inria.fr/inria-00452791)

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
104
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### VC-dimension: a superfast survey

1. 1. VC-dimension, very fast tutorial O. Teytaud
2. 2. f = a function in F For example: ● F = set of linear functions on Rd ● F = set of neural networks with N neurons in two layers on Rd, ● F = set of polynomial functions with degree d on Rd We want to pick up a good f in F. E.g. for some distribution on x,y: ● f minimizing the expectation of (f(x)-y)2 ● f minimizing the -log likelihood of data -log P(f(x)=y) We have some set F of functions, we want to find f in F such that Lf = E something is minimum Noisy optimization boils down to fitting data: ● y=1 with probability A+B||x-x*||^2, 0 otherwise ● Or maybe E (y|x) = A+B||x-x*||^2 ● We want to find f(x)=A+B|| x – f* ||^2 minimizing E(f(x)-y)^2
3. 3. Loss function: We want E Lf small. All we know is: General case = empirical error
4. 4. with V the VC-dimension With probability >= 1-eta
5. 5. Remarks: ● Assumes that the data are independently drawn ● Distribution-free bounds ● Scale as 1/n if empirical error very small, 1/sqrt(n) otherwise
6. 6. Ok, but what is the VC-dimension ? First define shattering coefficient S(n)= The VC-dimension is n maximum (possibly infinite) such that S(n) = 2^n. Explanation ? It is >= 7 if: ● For at least one set S of 7 points ● All 2^7 binary subsets of S are >=c for some f and <c for the complement...
7. 7. In many sufficiently smooth cases, the VC-dimension I the number of parameters – but not always... Polynomial of degree k over Rd: Linear combinations of V functions have VC-dimension at most V
8. 8. Remarks: ● If distribution on X is known, there are better bounds ● VC bounds are nearly optimal - within huge constants ● For distribution-dep. rates, there exist faster results ● For distribution-dep rates, there are more general results (even with infinite VC-dimension); see ● Donsker classes ● Glivenko-Cantelli classes (convergence, no rate) ● Covering-numbers ● Fat-shattering dimension
9. 9. Structural risk Minimization: we choose f minimizing this bound: In particular, we can have several families F, and minimize this bound over several families. Principle = penalization of families with high VC-dimension We want a small empirical error We want a small VC-dimension (complexity)
10. 10. Overfitting = choosing a function which is empirically good, but generalizes poorly. VC-dimension is about avoiding overfitting. Structural risk minimization = minimizing the VC bound However, in everyday life, people use cross-validation: choose the family of functions such that ● learning on half examples ●testing on the other half performs well. But: VC-dimension convenient for proving useful theorems.
11. 11. Vapnik's books: centered on Vapnik's work, but good book Devroye-Gyorfi-Lugosi = very good book, mainly on binary case Vidyasagar = using covering number, good book Feedback: - I try to promote in the team, the idea of talks for a wide audience - do you find that interesting ?
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.