Successfully reported this slideshow.
Upcoming SlideShare
×

# A short introduction to statistical learning

253 views

Published on

Groupe de travail SPARKGEN, INRA d’Auzeville
March 12th, 20178

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### A short introduction to statistical learning

1. 1. A short introduction to statistical learning Nathalie Villa-Vialaneix nathalie.villa-vialaneix@inra.fr http://www.nathalievilla.org GT “SPARKGEN” March 12th, 2018 - INRA, Toulouse Nathalie Villa-Vialaneix | Introduction to statistical learning 1/58
2. 2. Outline 1 Introduction Background and notations Underﬁtting / Overﬁtting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 2/58
3. 3. Outline 1 Introduction Background and notations Underﬁtting / Overﬁtting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 3/58
4. 4. Background Purpose: predict Y from X; Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
5. 5. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
6. 6. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
7. 7. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
8. 8. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Y can be: a numeric variable (Y ∈ R) ⇒ (supervised) regression régression; a factor ⇒ (supervised) classiﬁcation discrimination. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
9. 9. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
10. 10. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classiﬁcation; if Y is a factor, Φn is called a classiﬁer classiﬁeur; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
11. 11. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classiﬁcation; if Y is a factor, Φn is called a classiﬁer classiﬁeur; Φn is said to be trained or learned from the observations (xi, yi)i. Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
12. 12. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classiﬁcation; if Y is a factor, Φn is called a classiﬁer classiﬁeur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
13. 13. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classiﬁcation; if Y is a factor, Φn is called a classiﬁer classiﬁeur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
14. 14. Basics From (xi, yi)i, deﬁnition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classiﬁcation; if Y is a factor, Φn is called a classiﬁer classiﬁeur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Conﬂicting objectives!! Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
15. 15. Underﬁtting/Overﬁtting sous/sur - apprentissage Function x → y to be estimated Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
16. 16. Underﬁtting/Overﬁtting sous/sur - apprentissage Observations we might have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
17. 17. Underﬁtting/Overﬁtting sous/sur - apprentissage Observations we do have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
18. 18. Underﬁtting/Overﬁtting sous/sur - apprentissage First estimation from the observations: underﬁtting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
19. 19. Underﬁtting/Overﬁtting sous/sur - apprentissage Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
20. 20. Underﬁtting/Overﬁtting sous/sur - apprentissage Third estimation from the observations: overﬁtting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
21. 21. Underﬁtting/Overﬁtting sous/sur - apprentissage Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
22. 22. Errors training error (measures the accuracy to the observations) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
23. 23. Errors training error (measures the accuracy to the observations) if y is a factor: misclassiﬁcation rate {ˆyi yi, i = 1, . . . , n} n Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
24. 24. Errors training error (measures the accuracy to the observations) if y is a factor: misclassiﬁcation rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
25. 25. Errors training error (measures the accuracy to the observations) if y is a factor: misclassiﬁcation rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
26. 26. Errors training error (measures the accuracy to the observations) if y is a factor: misclassiﬁcation rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overﬁtting (estimates the generalization error) is the simple validation Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
27. 27. Errors training error (measures the accuracy to the observations) if y is a factor: misclassiﬁcation rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overﬁtting (estimates the generalization error) is the simple validation 1 split the data into training/test sets (usually 80%/20%) 2 train Φn from the training dataset 3 calculate the test error from the remaining data Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
28. 28. Example Observations Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
29. 29. Example Training/Test datasets Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
30. 30. Example Training/Test errors Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
31. 31. Example Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
32. 32. Bias / Variance trade-off This problem is also related to the well known bias / variance trade-off bias: error that comes from erroneous assumptions in the learning algorithm (average error of the predictor) variance: error that comes from sensitivity to small ﬂuctuations in the training set (variance of the predictor) Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
33. 33. Bias / Variance trade-off This problem is also related to the well known bias / variance trade-off bias: error that comes from erroneous assumptions in the learning algorithm (average error of the predictor) variance: error that comes from sensitivity to small ﬂuctuations in the training set (variance of the predictor) Overall error is: E(MSE) = Bias2 + Variance Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
34. 34. Consistency in the parametric/non parametric case Example in the parametric framework (linear methods) an assumption is made on the form of the relation between X and Y: Y = βT X + β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given method which calculates a βn . The estimation is said to be consistent if βn n→+∞ −−−−−−→ β under (possibly) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
35. 35. Consistency in the parametric/non parametric case Example in the nonparametric framework the form of the relation between X and Y is unknown: Y = Φ(X) + Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given method which calculates a Φn . The estimation is said to be consistent if Φn n→+∞ −−−−−−→ Φ under (possibly) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
36. 36. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating Φ or... Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
37. 37. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating Φ or... ... rather in having the smallest prediction error? Statistical learning perspective: a method that builds a machine Φn from the observations is said to be (universally) consistent if, given a risk function R : R × R → R+ (which calculates an error), E (R(Φn (X), Y)) n→+∞ −−−−−−→ inf Φ:X→R E (R(Φ(X), Y)) , for any distribution of (X, Y) ∈ X × R. Deﬁnitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)). Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
38. 38. Desirable properties from a mathematical perspective Simpliﬁed framework: X ∈ X and Y ∈ {−1, 1} (binary classiﬁcation) Learning process: choose a machine Φn in a class of functions C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a SVM). Error decomposition LΦn − L∗ ≤ LΦn − inf Φ∈C LΦ + inf Φ∈C LΦ − L∗ with infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that this term is small); Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
39. 39. Desirable properties from a mathematical perspective Simpliﬁed framework: X ∈ X and Y ∈ {−1, 1} (binary classiﬁcation) Learning process: choose a machine Φn in a class of functions C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a SVM). Error decomposition LΦn − L∗ ≤ LΦn − inf Φ∈C LΦ + inf Φ∈C LΦ − L∗ with infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that this term is small); LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |Ln Φ − LΦ|, Ln Φ = 1 n n i=1 R(Φ(xi), yi) is the generalization capability of C (i.e., in the worst case, the empirical error must be close to the true error: C must not be too rich to ensure that this term is small). Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
40. 40. Outline 1 Introduction Background and notations Underﬁtting / Overﬁtting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 13/58
41. 41. Overview CART: Classiﬁcation And Regression Trees introduced by [Breiman et al., 1984]. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
42. 42. Overview CART: Classiﬁcation And Regression Trees introduced by [Breiman et al., 1984]. Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
43. 43. Overview CART: Classiﬁcation And Regression Trees introduced by [Breiman et al., 1984]. Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Drawbacks require a large training dataset to be efﬁcient; as a consequence, are often too simple to provide accurate predictions. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
44. 44. Example X = (Gender, Age, Height) and Y = Weight Y1 Y2 Y3 Y4 Y5 Height<1.60m Height>1.60m Gender=M Gender=F Age<30 Age>30 Gender=M Gender=F Root a split a node a leaf (terminal node) Nathalie Villa-Vialaneix | Introduction to statistical learning 15/58
45. 45. CART learning process Algorithm Start from root repeat move to a “new” node if the node is homogeneous or small enough then STOP else split the node into two child nodes with maximal “homogeneity” end if until all nodes are processed Nathalie Villa-Vialaneix | Introduction to statistical learning 16/58
46. 46. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
47. 47. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
48. 48. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
49. 49. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
50. 50. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. An alternative approach is pruning... Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
51. 51. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
52. 52. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) the corresponding predicted ˆynew is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
53. 53. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) the corresponding predicted ˆynew is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf if Y is a factor, the majority class of the observations (training set) assigned to the same leaf Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
54. 54. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
55. 55. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor) non parametric method (no prior assumption needed) and accurate can deal with a large number of input variables, either numeric variables or factors can deal with small samples Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
56. 56. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor) non parametric method (no prior assumption needed) and accurate can deal with a large number of input variables, either numeric variables or factors can deal with small samples Drawbacks black box model is only supported by a few mathematical results (consistency...) until now Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
57. 57. Basic description A fact: When the sample size is small, you might be unable to estimate properly Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
58. 58. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more speciﬁcally, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
59. 59. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more speciﬁcally, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Bagging: combination of simple (and underefﬁcient) regression (or classiﬁcation) functions Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
60. 60. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more speciﬁcally, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Bagging: combination of simple (and underefﬁcient) regression (or classiﬁcation) functions Random forest CART bagging Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
61. 61. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
62. 62. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
63. 63. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating conﬁdence intervals (of X with no prior assumption on the distribution of X) 1 Build P bootstrap samples from (xi)i 2 Use them to estimate X P times 3 The conﬁdence interval is based on the percentiles of the empirical distribution of X Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
64. 64. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating conﬁdence intervals (of X with no prior assumption on the distribution of X) Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
65. 65. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating conﬁdence intervals (of X with no prior assumption on the distribution of X) Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
66. 66. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating conﬁdence intervals (of X with no prior assumption on the distribution of X) Also useful to estimate p-values, residuals, ... Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
67. 67. Bagging Average the estimates of the regression (or the classiﬁcation) function obtained from B bootstrap samples Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
68. 68. Bagging Average the estimates of the regression (or the classiﬁcation) function obtained from B bootstrap samples Bagging with regression trees 1: for b = 1, . . . , B do 2: Build a bootstrap sample ξb 3: Train a regression tree from ξb, ˆφb 4: end for 5: Estimate the regression function by ˆΦn (x) = 1 B B b=1 ˆφb(x) Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
69. 69. Bagging Average the estimates of the regression (or the classiﬁcation) function obtained from B bootstrap samples Bagging with regression trees 1: for b = 1, . . . , B do 2: Build a bootstrap sample ξb 3: Train a regression tree from ξb, ˆφb 4: end for 5: Estimate the regression function by ˆΦn (x) = 1 B B b=1 ˆφb(x) For classiﬁcation, the predicted class is the majority vote class. Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
70. 70. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classiﬁcation and p/3 for regression) Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
71. 71. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classiﬁcation and p/3 for regression) 2 the tree is fully developed (overﬁtted) Hyperparameters those of the CART algorithm those that are speciﬁc to the random forest: q, number of trees Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
72. 72. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classiﬁcation and p/3 for regression) 2 the tree is fully developed (overﬁtted) Hyperparameters those of the CART algorithm those that are speciﬁc to the random forest: q, number of trees Random forests are not very sensitive to hyper-parameter settings: default values for q and 500/1000 trees should work in most cases. Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
73. 73. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
74. 74. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Stabilization of OOB error is a good indication that there is enough trees in the forest Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
75. 75. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Importance of a variable to help interpretation: for a given variable Xj 1: randomize the values of the variable 2: make predictions from this new dataset 3: the importance is the mean decrease in accuracy (MSE or misclassiﬁcation rate) Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
76. 76. Outline 1 Introduction Background and notations Underﬁtting / Overﬁtting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 25/58
77. 77. Basic introduction Binary classiﬁcation problem: X ∈ H et Y ∈ {−1; 1} A training set is given: (x1, y1), . . . , (xn, yn) Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
78. 78. Basic introduction Binary classiﬁcation problem: X ∈ H et Y ∈ {−1; 1} A training set is given: (x1, y1), . . . , (xn, yn) SVM is a method based on kernels. It is universally consistent method, given that the kernel is universal [Steinwart, 2002]. Extensions to the regression case exist (SVR or LS-SVM) that are also universally consistent when the kernel is universal. Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
79. 79. Optimal margin classiﬁcation Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
80. 80. Optimal margin classiﬁcation Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
81. 81. Optimal margin classiﬁcation w margin: 1 w 2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
82. 82. Optimal margin classiﬁcation w margin: 1 w 2 Support Vector w is chosen such that: minw w 2 (the margin is the largest), under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation between the two classes is perfect). ⇒ ensures a good generalization capability. Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
83. 83. Soft margin classiﬁcation Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
84. 84. Soft margin classiﬁcation Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
85. 85. Soft margin classiﬁcation w margin: 1 w 2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
86. 86. Soft margin classiﬁcation w margin: 1 w 2 Support Vector w is chosen such that: minw,ξ w 2 + C n i=1 ξi (the margin is the largest), under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. (the separation between the two classes is almost perfect). ⇒ allowing a few errors improves the richness of the class. Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
87. 87. Non linear SVM Original space X Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
88. 88. Non linear SVM Original space X Feature space H Ψ (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
89. 89. Non linear SVM Original space X Feature space H Ψ (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
90. 90. Non linear SVM Original space X Feature space H Ψ (non linear) w ∈ H is chosen such that (PC,H ): minw,ξ w 2 H + C n i=1 ξi (the margin in the feature space is the largest), under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. (the separation between the two classes in the feature space is almost perfect). Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
91. 91. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) errors versus ˆy for y = 1: blue: hinge loss; green: misclassiﬁcation error. Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
92. 92. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) A dual problem: (PC,H ) ⇔ (DC,X) : maxα∈Rn n i=1 αi − n i=1 n j=1 αiαjyiyj Ψ(xi), Ψ(xj) H , with N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
93. 93. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) A dual problem: (PC,H ) ⇔ (DC,X) : maxα∈Rn n i=1 αi − n i=1 n j=1 αiαjyiyjK(xi, xj), with N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. There is no need to know Ψ and H: choose a function K with a few good properties; use it as the dot product in H: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H . Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
94. 94. Which kernels? Minimum properties that a kernel should fulﬁlled symmetry: K(u, u ) = K(u , u) positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN , ∀ (xi) ⊂ XN , i,j αiαjK(xi, xj) ≥ 0. [Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H such that: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
95. 95. Which kernels? Minimum properties that a kernel should fulﬁlled symmetry: K(u, u ) = K(u , u) positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN , ∀ (xi) ⊂ XN , i,j αiαjK(xi, xj) ≥ 0. [Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H such that: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H Examples the Gaussian kernel: ∀ x, x ∈ Rd , K(x, x ) = e−γ x−x 2 (it is universal for all bounded subset of Rd ); the linear kernel: ∀ x, x ∈ Rd , K(x, x ) = xT (x ) (it is not universal). Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
96. 96. In summary, how does the solution write???? Φn (x) = i αiyiK(xi, x) where only a few αi 0. i such that αi 0 are the support vectors! Nathalie Villa-Vialaneix | Introduction to statistical learning 32/58
97. 97. I’m almost dead with all these stuffs on my mind!!! What in practice? data(iris) iris <- iris[iris\$Species%in%c("versicolor","virginica"),] plot(iris\$Petal.Length, iris\$Petal.Width, col=iris\$Species , pch=19) legend("topleft", pch=19, col=c(2,3), legend=c("versicolor", "virginica")) Nathalie Villa-Vialaneix | Introduction to statistical learning 33/58
98. 98. I’m almost dead with all these stuffs on my mind!!! What in practice? library(e1071) res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear", cost = 2^(-1:4)) # Parameter tuning of ’svm’: # - sampling method: 10fold cross validation # - best parameters: # cost # 0.5 # - best performance: 0.05 res.tune\$best.model # Call: # best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4), # kernel = "linear") # Parameters: # SVM-Type: C-classification # SVM-Kernel: linear # cost: 0.5 # gamma: 0.25 # Number of Support Vectors: 21 Nathalie Villa-Vialaneix | Introduction to statistical learning 34/58
99. 99. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune\$best.model\$fitted, iris\$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 45 0 % virginica 0 5 50 plot(res.tune\$best.model, data=iris, Petal.Width~Petal.Length, slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 35/58
100. 100. I’m almost dead with all these stuffs on my mind!!! What in practice? res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1), cost = 2^(2:4)) # Parameter tuning of ’svm’: # - sampling method: 10fold cross validation # - best parameters: # gamma cost # 0.5 4 # - best performance: 0.08 res.tune\$best.model # Call: # best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1), # cost = 2^(2:4)) # Parameters: # SVM-Type: C-classification # SVM-Kernel: radial # cost: 4 # gamma: 0.5 # Number of Support Vectors: 32 Nathalie Villa-Vialaneix | Introduction to statistical learning 36/58
101. 101. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune\$best.model\$fitted, iris\$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 49 0 % virginica 0 1 50 plot(res.tune\$best.model, data=iris, Petal.Width~Petal.Length, slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 37/58
102. 102. Outline 1 Introduction Background and notations Underﬁtting / Overﬁtting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 38/58
103. 103. What are (artiﬁcial) neural networks? Common properties (artiﬁcial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
104. 104. What are (artiﬁcial) neural networks? Common properties (artiﬁcial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain combination (network) of simple elements (neurons) Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
105. 105. What are (artiﬁcial) neural networks? Common properties (artiﬁcial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain combination (network) of simple elements (neurons) Example of graphical representation: INPUTS OUTPUTS Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
106. 106. Different types of neural networks A neural network is deﬁned by: 1 the network structure; 2 the neuron type. Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
107. 107. Different types of neural networks A neural network is deﬁned by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classiﬁcation and regression); Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
108. 108. Different types of neural networks A neural network is deﬁned by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classiﬁcation and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
109. 109. Different types of neural networks A neural network is deﬁned by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classiﬁcation and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
110. 110. Different types of neural networks A neural network is deﬁned by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classiﬁcation and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . In this talk, focus on MLP. Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
111. 111. MLP: Advantages/Drawbacks Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: ﬂexible; good theoretical properties. Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
112. 112. MLP: Advantages/Drawbacks Advantages classiﬁcation OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: ﬂexible; good theoretical properties. Drawbacks hard to train (high computational cost, especially when d is large); overﬁt easily; “black box” models (hard to interpret). Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
113. 113. References Advised references: [Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more than statistical) perspective [Devroye et al., 1996, Györﬁ et al., 2002] in dedicated chapters present statistical properties of perceptrons Nathalie Villa-Vialaneix | Introduction to statistical learning 42/58
114. 114. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
115. 115. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
116. 116. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated 3 ... and a signal is sent to other neurons through the axon connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
117. 117. First model of artiﬁcial neuron [Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962] x(1) x(2) x(p) f(x)Σ+ w1 w2 wp w0 f : x ∈ Rp → 1 p j=1 wjx(j)+w0 ≥ 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 44/58
118. 118. (artiﬁcial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
119. 119. (artiﬁcial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 weights w (1) jk Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
120. 120. (artiﬁcial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
121. 121. (artiﬁcial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
122. 122. (artiﬁcial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS 2 hidden layer MLP Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
123. 123. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 + w0 (Bias Biais) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
124. 124. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
125. 125. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions fonctions de transfert / d’activation Biologically inspired: Heaviside function h(z) = 0 if z < 0 1 otherwise. Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
126. 126. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Main issue with the Heaviside function: not continuous! Identity h(z) = z Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
127. 127. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions But identity activation function gives linear model if used with one hidden layer: not ﬂexible enough Logistic function h(z) = 1 1+exp(−z) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
128. 128. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Another popular activation function (useful to model positive real numbers) Rectiﬁed linear (ReLU) h(z) = max(0, z) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
129. 129. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 General sigmoid sigmoid: nondecreasing function h : R → R such that lim z→+∞ h(z) = 1 lim z→−∞ h(z) = 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
130. 130. Focus on one-hidden-layer perceptrons Regression case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q f(x) f(x) = Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0 , with hk a (logistic) sigmoid Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
131. 131. Focus on one-hidden-layer perceptrons Binary classiﬁcation case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) ψ(x) = h0   Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0   with h0 logistic sigmoid or identity. Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
132. 132. Focus on one-hidden-layer perceptrons Binary classiﬁcation case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) decision with: f(x) = 0 if ψ(x) < 1/2 1 otherwise Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
133. 133. Focus on one-hidden-layer perceptrons Extension to any classiﬁcation problem in {1, . . . , K − 1} x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) Straightforward extension to multiple classes with a multiple output perceptron (number of output units equal to K) and a maximum probability rule for the decision. Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
134. 134. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
135. 135. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? 2 when a perceptron is trained with i.i.d. observations from an arbitrary random variable pair (X, Y), is it consistent? (i.e., does it reach the minimum possible error asymptotically when the number of observations grows to inﬁnity?) Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
136. 136. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
137. 137. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 trying to approximate (how this is performed is explained later in this talk) this function with MLP having different numbers of neurons on their hidden layer Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
138. 138. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
139. 139. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Set of all MLPs: P(h) = ∪Q∈NPQ (h) Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
140. 140. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Set of all MLPs: P(h) = ∪Q∈NPQ (h) Universal approximation [Pinkus, 1999] If h is a non polynomial continuous function, then, for any continuous function g : [0, 1]p → R and any > 0, ∃ f ∈ P(h) such that: sup x∈[0,1]p |f(x) − g(x)| ≤ . Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
141. 141. Remarks on universal approximation continuity of the activation function is not required (see [Devroye et al., 1996] for a result with arbitrary sigmoids) other versions of this property are given in [Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional spaces for g none of the spaces PQ (h), for a ﬁxed Q, has this property this result can be used to show that perceptron are consistent whenever Q log(n)/n n→+∞ −−−−−−→ 0 [Farago and Lugosi, 1993, Devroye et al., 1996] Nathalie Villa-Vialaneix | Introduction to statistical learning 51/58
142. 142. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q fw(x) Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
143. 143. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? Standard approach: minimize the empirical L2 risk: Rn(w) = n i=1 [fw(Xi) − Yi]2 with Yi ∈ R for the regression case Yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule x → 1{fw (x)≤1/2}. Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
144. 144. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? Standard approach: minimize the empirical L2 risk: Rn(w) = n i=1 [fw(Xi) − Yi]2 with Yi ∈ R for the regression case Yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule x → 1{fw (x)≤1/2}. But: ˆRn(w) is not convex in w ⇒ general optimization problem Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
145. 145. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) w ˆRn(w(t)); Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
146. 146. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) w ˆRn(w(t)); online (or stochastic) approach: write ˆRn(w) = n i=1 [fw(Xi) − Yi]2 =Ei and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update: w(t + 1) = w(t) − µ(t) wEi(w(t)). Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
147. 147. Discussion about practical choices for this approach batch version converges (in an optimization point of view) to a local minimum of the error for a good choice of µ(t) but convergence can be slow stochastic version is usually very inefﬁcient but is useful for large datasets (n large) more efﬁcient algorithms exist to solve the optimization task. The one implemented in the R package nnet uses higher order derivatives (BFGS algorithm) in all cases, solutions returned are, at best, local minima which strongly depends on the initialization: using more than one initialization state is advised Nathalie Villa-Vialaneix | Introduction to statistical learning 54/58
148. 148. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily calculate gradients in perceptrons (or in other types of neural network): Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
149. 149. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily calculate gradients in perceptrons (or in other types of neural network): This way, stochastic gradient descent alternates: a forward step which aims at calculating outputs from all observations Xi given a value of the weights w a backward step in which the gradient backpropagation principle is used to obtain the gradient for the current weights w Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
150. 150. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) initialize weights Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
151. 151. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Forward step: for all k, calculate a (1) k = Xi w (1) k + w (0) k Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
152. 152. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Forward step: for all k, calculate z (1) k = hk (a (1) k ) Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
153. 153. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) = a(2) Forward step: calculate a(2) = Q k=1 w (2) k z (1) k + w (2) 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
154. 154. Backpropagation in practice x(1) x(2) x(p) + + w(1) w (0) 1 w (0) Q w(2) fw(x) = a(2) Backward step: calculate ∂Ei ∂w (2) k = δ(2) × zk with δ(2) = ∂Ei ∂a(2) = [h0(a(2))−Yi ]2 ∂a(2) = 2h0(a(2) ) × [h0(a(2) ) − Yi] Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
155. 155. Backpropagation in practice x(1) x(2) x(p) + + w(2) w (0) 1 w (0) Q w(1) fw(x) Backward step: ∂Ei ∂w (1) kj = δ (1) k × X (j) i with δ (1) k = ∂Ei ∂a (1) k = ∂Ei ∂a(2) × ∂a(2) ∂a (1) k = δ(2) × w (2) k hk (a (1) k ) Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
156. 156. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Backward step: ∂Ei ∂w (0) k = δ (1) k Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
157. 157. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
158. 158. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆRn(w) target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1)) Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
159. 159. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) In the R package nnet, weights are sampled uniformly between [−0.5, 0.5] or between − 1 maxi X (j) i , 1 maxi X (j) i if X(j) is large. 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆRn(w) target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1)) In the R package nnet, a combination of the three criteria is used and tunable. Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
160. 160. Strategies to avoid overﬁtting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
161. 161. Strategies to avoid overﬁtting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
162. 162. Strategies to avoid overﬁtting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆRn(w) + λw w Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
163. 163. Strategies to avoid overﬁtting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆRn(w) + λw w Noise injection: modify the input data with a random noise during the training Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
164. 164. References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York, USA. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984). Classiﬁcation and Regression Trees. Chapman and Hall, Boca Raton, Florida, USA. Devroye, L., Györﬁ, L., and Lugosi, G. (1996). A Probabilistic Theory for Pattern Recognition. Springer-Verlag, New York, NY, USA. Farago, A. and Lugosi, G. (1993). Strong universal consistency of neural network classiﬁers. IEEE Transactions on Information Theory, 39(4):1146–1151. Györﬁ, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, New York, NY, USA. Hornik, K. (1991). Approximation capabilities of multilayer feedfoward networks. Neural Networks, 4(2):251–257. Hornik, K. (1993). Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
165. 165. Some new results on neural network approximation. Neural Networks, 6(8):1069–1072. Mc Culloch, W. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4):115–133. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143–195. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408. Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, DC, USA. Rumelhart, D. and Mc Clelland, J. (1986). Parallel Distributed Processing: Exploration in the MicroStructure of Cognition. MIT Press, Cambridge, MA, USA. Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18:768–791. Stinchcombe, M. (1999). Neural network approximation of continuous functionals and continuous functions on compactiﬁcations. Neural Network, 12(3):467–477. Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
166. 166. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York, USA. Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58