Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Regularization
Yow-Bang (Darren) Wang
8/1/2013
Outline
● VC dimension & VC bound – Frequentist viewpoint
● L1 regularization – An intuitive interpretation
● Model parame...
VC dimension & VC bound
– Frequentist viewpoint
Regularization
● (My) definition: Techniques to prevent overfitting
● Frequentists’ viewpoint:
○ Regularization = suppress...
VC dimension & VC bound
● Why suppressing model complexity?
○ A theoretical bound of testing error, called Vapnik–Chervone...
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a cert...
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a cert...
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a cert...
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a cert...
Regularization – Frequentist viewpoint
● In general, more model parameters
↔ higher VC dimension
↔ higher model complexity...
Regularization – Frequentist viewpoint
● ……Therefore, reduce model complexity
↔ reduce VC dimension
↔ reduce number of fre...
Regularization – Frequentist viewpoint
● The L-p norm of a K-dimensional vector x:
1. L-2 norm:
2. L-1 norm:
3. L-0 norm: ...
Regularization – Frequentist viewpoint
● However, since L-0 norm is hard to incorporate into the objective function (∵
not...
L1 regularization
– An intuitive interpretation
L1 Regularization – An Intuitive Interpretation
● Now we know we prefer sparse parameters
○ ↔ small L-0 norm
● ……but why p...
L1 Regularization – An Intuitive Interpretation
● An intuitive interpretation: L-p norm ≣ control our preference to parame...
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal ...
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal ...
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal ...
Model parameter prior
– Bayesian viewpoint
Regularization – Bayesian viewpoint
● Bayesian: model parameters are probabilistic.
● Frequentist: model parameters are de...
Regularization – Bayesian viewpoint
● To conclude:
Data Model parameter
Bayesian Fixed Variable
Frequentist Variable Fixed...
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution w...
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution w...
Early stopping
– Also a regularization
Early Stopping
● Early stopping – stop training before optimal
● Often used in MLP training
● An intuitive interpretation:...
Early Stopping
● Theoretical proof:
○ Consider a perceptron with hinge loss:
○ Assume the optimal separating hyperplane is...
Early Stopping
●
1.
∵
Early Stopping
●
1.
2.
R: radius of
data distribution
R
Early Stopping
●
1.
2.
→
R: radius of
data distribution
R
Early Stopping
● Small learning rate → Large margin
● Small number of updates → Large margin
→ Early Stopping!!!
Early Stopping
Early Stopping
Training iteration ↑
Conclusion
Conclusion
● Regularization: Techniques to prevent overfitting
○ L1-norm: Sparsity of parameter
○ L2-norm: Large Margin
○ ...
Reference
● Learning From Data - A Short Course
○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin
● Ronan Collob...
Upcoming SlideShare
Loading in …5
×

Regularization

1,447 views

Published on

VC dimension & VC bound – Frequentist viewpoint
L1 regularization – An intuitive interpretation
Model parameter prior – Bayesian viewpoint
Early stopping – Also a regularization
Conclusion

Published in: Technology

Regularization

  1. 1. Regularization Yow-Bang (Darren) Wang 8/1/2013
  2. 2. Outline ● VC dimension & VC bound – Frequentist viewpoint ● L1 regularization – An intuitive interpretation ● Model parameter prior – Bayesian viewpoint ● Early stopping – Also a regularization ● Conclusion
  3. 3. VC dimension & VC bound – Frequentist viewpoint
  4. 4. Regularization ● (My) definition: Techniques to prevent overfitting ● Frequentists’ viewpoint: ○ Regularization = suppress model complexity ○ “Usually” done by inserting a term representing model complexity into the objective function: Training error Model complexity Trade-off weight
  5. 5. VC dimension & VC bound ● Why suppressing model complexity? ○ A theoretical bound of testing error, called Vapnik–Chervonenkis (VC) bound, state the follows: ● To reduce the testing error, we prefer: ○ Low training error ( Etrain ↓) ○ Big data ( N ↑) ○ Low model complexity ( dVC ↓)
  6. 6. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} Label=1 Label=0 Label=1 Label=0 Label=1 Label=0 ……
  7. 7. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
  8. 8. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
  9. 9. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1} ○ N=4: fails in the case:
  10. 10. Regularization – Frequentist viewpoint ● In general, more model parameters ↔ higher VC dimension ↔ higher model complexity ↔
  11. 11. Regularization – Frequentist viewpoint ● ……Therefore, reduce model complexity ↔ reduce VC dimension ↔ reduce number of free parameters ↔ reduce ↔ sparsity of parameter! L-0 norm
  12. 12. Regularization – Frequentist viewpoint ● The L-p norm of a K-dimensional vector x: 1. L-2 norm: 2. L-1 norm: 3. L-0 norm: defined as 4. L-∞ norm:
  13. 13. Regularization – Frequentist viewpoint ● However, since L-0 norm is hard to incorporate into the objective function (∵ not continuous), we turn to the other more approachable L-p norms ● E.g. Linear SVM: ● Linear SVM = Hinge loss + L-2 regularization! L-2 regularization (a.k.a. Large Margin)Trade-off weight Hinge Loss:
  14. 14. L1 regularization – An intuitive interpretation
  15. 15. L1 Regularization – An Intuitive Interpretation ● Now we know we prefer sparse parameters ○ ↔ small L-0 norm ● ……but why people say minimizing L1 norm would introduce sparsity? ● “For most large underdetermined systems of linear equations, the minimal L1‐ norm solution is also the sparsest solution” ○ Donoho, David L, Communications on pure and applied mathematics, 2006.
  16. 16. L1 Regularization – An Intuitive Interpretation ● An intuitive interpretation: L-p norm ≣ control our preference to parameters ○ L-2 norm: ○ L-1 norm: Equal-preferable lines <Parameter Space>
  17. 17. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… Equal training error lines Optimal solution
  18. 18. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… ……
  19. 19. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles, then the minimal training error occurs at the tip points iff the centric of equal training error lines lies in the shaded areas as the figure shows, which is relatively highly probable!
  20. 20. Model parameter prior – Bayesian viewpoint
  21. 21. Regularization – Bayesian viewpoint ● Bayesian: model parameters are probabilistic. ● Frequentist: model parameters are deterministic. Given observation Fixed yet unknown universe Sampling Estimate parameters Unknown universe Random observation Sampling Estimate parameters assuming the universe is a certain type of model
  22. 22. Regularization – Bayesian viewpoint ● To conclude: Data Model parameter Bayesian Fixed Variable Frequentist Variable Fixed yet unknown
  23. 23. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance: <Parameter Probability Space> Equal probability lines
  24. 24. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance:
  25. 25. Early stopping – Also a regularization
  26. 26. Early Stopping ● Early stopping – stop training before optimal ● Often used in MLP training ● An intuitive interpretation: ○ Training iteration ↑ ○ → number of updates of weights ↑ ○ → number of active (far from 0) weights ↑ ○ → complexity ↑
  27. 27. Early Stopping ● Theoretical proof: ○ Consider a perceptron with hinge loss: ○ Assume the optimal separating hyperplane is , with maximal margin ○ Denote the weight at t-th iteration as , with margin
  28. 28. Early Stopping ● 1. ∵
  29. 29. Early Stopping ● 1. 2. R: radius of data distribution R
  30. 30. Early Stopping ● 1. 2. → R: radius of data distribution R
  31. 31. Early Stopping ● Small learning rate → Large margin ● Small number of updates → Large margin → Early Stopping!!!
  32. 32. Early Stopping
  33. 33. Early Stopping Training iteration ↑
  34. 34. Conclusion
  35. 35. Conclusion ● Regularization: Techniques to prevent overfitting ○ L1-norm: Sparsity of parameter ○ L2-norm: Large Margin ○ Early stopping ○ ……etc. ● The philosophy of regularization ○ Occam’s razor: “Entities must not be multiplied beyond necessity.”
  36. 36. Reference ● Learning From Data - A Short Course ○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin ● Ronan Collobert, Samy Bengio, “Links Between Perceptrons, MLPs and SVMs”, in ACM 2004.

×