Successfully reported this slideshow.
Upcoming SlideShare
×

# MLHEP Lectures - day 2, basic track

1,743 views

Published on

* linear models: logistic regression
* polynomial decision rule and polynomial regression
* SVM (Support Vector Machine), kernel trick
* Overfitting: two definitions
* Model selection
* Regularization: L1, L2, elastic net.
* Decision trees
* splitting criteria for classification and regression
* overfitting in trees: pre-stopping and post-pruning
* non-stability of trees
* feature importance
* Ensembling
* RSM, subsampling, bagging.
* Random Forest

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### MLHEP Lectures - day 2, basic track

1. 1. Machine Learning in High Energy Physics Lectures 3 & 4 Alex Rogozhnikov Lund, MLHEP 2016 1 / 99
2. 2. Recapitulation classiﬁcation, regression kNN classiﬁer and regressor ROC curve, ROC AUC 1 / 99
3. 3. Bayes optimal classifier Given exact distributions' density functions, we can build an optimal classiﬁer Need to estimate ratio of likelihoods. = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 2 / 99
4. 4. Density estimation Histograms Kernel density estimation 3 / 99
5. 5. Parametric density estimation single Gaussian distribution Gaussian mixtures EM algorithm 4 / 99
6. 6. QDA (Quadratic discriminant analysis) QDA follows generative approach. Main assumption is that distribution of events within each class is multivariate gaussian. 5 / 99
7. 7. Logistic regression Decision function Sharp rule: d(x) =< w, x > +w0 = sgn d(x)ŷ  6 / 99
8. 8. Logistic regression Smooth rule: Optimizing weights to maximize log-likelihood d(x) =< w, x > +w0 (x)p+1 (x)p−1 = = σ(d(x)) σ(−d(x)) w, w0  = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi 7 / 99
9. 9. Logistic loss Term loss refers to somewhat we are minimizing. Losses typically estimate our risks, denoted as . LogLoss penalty for single observation: Margin is expected to be high for all events.   = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi L( , ) = − ln( ( )) = { = ln(1 + )xi yi py i xi ln(1 + ),e −d( )xi ln(1 + ),e d( )xi = +1yi = −1yi e − d( )y i xi d( )yi xi 8 / 99
10. 10. Logistic loss is convex function. Simple analysis shows that is sum of convex functions w.r.t. to , so the optimization problem has at most one optimum. Comment: MLE is not guaranteed to be a good choice. L( , )xi yi  w 9 / 99
11. 11. Visualization of logistic regression 10 / 99
12. 12. Gradient descent Problem: ﬁnd to minimize . Gradient descent: is step size (also called shrinkage, learning rate) w  w ← w − η ∂ ∂w η 11 / 99
13. 13. Stochastic gradient descent (SGD) On each iteration make a step using only one event: take — random event from training data  = L( , ) → min 1 N ∑ i xi yi i w ← w − η ∂L( , )xi yi ∂w 12 / 99
14. 14. Stochastic gradient descent (SGD) On each iteration make a step using only one event: take — random event from training data Each iteration is done much faster, but training process is less stable. Making smaller steps.  = L( , ) → min 1 N ∑ i xi yi i w ← w − η ∂L( , )xi yi ∂w 13 / 99
15. 15. Stochastic gradient descent We can decrease the learning rate over time: . At iteration : This process converges to local minima if: ηt t w ← w − ηt ∂L( , )xi yi ∂w = ∞, < ∞, > 0 ∑ t ηt ∑ t η2 t ηt 14 / 99
16. 16. SGD with momentum SGD (and GD) has problems with narrow valley (when hessian is very far from identity) Improvement: use momentum which accumulates gradient, 0.9 < γ < 1 v w ← ← γv + ηt ∂L( , )xi yi ∂w w − v 15 / 99
17. 17. Stochastic optimization methods 16 / 99
18. 18. Stochastic optimization methods applied to additive loss function should be preferred when optimization time is the bottleneck more advanced modiﬁcations exist: AdaDelta, RMSProp, Adam. those are using adaptive step size (individually for each sample) crucial when scale of gradients is very different in practice predictions are computed using minibatches (small groups of 16 to 256 samples) not on event-by-event basis  = L( , ) ∑ i xi yi 17 / 99
19. 19. Polynomial decision rule d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj 18 / 99
20. 20. Polynomial decision rule is again a linear model, introduce extended set of features: and reuse logistic regression. d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = =< w, z > ∑ i wi zi 19 / 99
21. 21. Polynomial decision rule is again a linear model, introduce extended set of features: and reuse logistic regression. We can add as one more variable to dataset and forget about term: d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = =< w, z > ∑ i wi zi = 1x0 w0 d(x) =< w, x > 20 / 99
22. 22. Polynomial regression is done in the same way. E.g. to ﬁt the polynomial of one variate, we constrict for each event a vector of and train a linear regression. = (1, x, , , . . )x̃  x 2 x 3 x d d( ) = + x + +. . .x̃  w0 w1 w2 x 2 wd x d 21 / 99
23. 23. Projecting into the space of higher dimension SVM with polynomial kernel visualization 22 / 99
24. 24. Logistic regression overview classiﬁer based on linear decision rule training is reduced to convex optimization stochastic optimization can be used can handle > 1000 features, but requires regularization (see later) no interaction between features other decision rules are achieved by adding new features 23 / 99
25. 25. Support Vector Machine [Vapnik, Chervonenkis, 1963] SVM selects a decision rule with maximal possible margin (rule A). 24 / 99
26. 26. Hinge loss function SVM uses different loss function: Margin no penalty (only signal losses compared on the plot) ( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi d( ) > 1 →yi xi 25 / 99
27. 27. Kernel trick is a projection operator (which "adds new features"). Assume that optimal (combination of support vectors) and look for We need only kernel, not projection operator: P d(x) = < w, x > → d(x) = < w, P(x) >new w = P( )∑i αi xi αi d(x) = < P( ), P(x) = K( , x) ∑ i αi xi >new ∑ i αi xi K(x, ) =< P(x), P( )x̃  x̃  >new 26 / 99
28. 28. Kernel trick Polynomial kernel: projection contains all monomials up to degree . Popular kernel is a gaussian Radial Basis Function: Corresponds to projection to the Hilbert space. Exercise: ﬁnd a corresponding projection. K(x, ) = (1 + xx̃  x̃  T ) d d K(x, ) =x̃  e −c||x− |x̃ | 2 27 / 99
29. 29. SVM + RBF kernel 28 / 99
30. 30. SVM + RBF kernel 29 / 99
31. 31. Overfitting nn with k=1 gives ideal classiﬁcation of training data. SVM with small radius of RBF kernel has the same property. k 30 / 99
32. 32. Overfitting Same issues for regression. Provided high enough degree, the polynomial can go through any set of points and get zero error this way. 31 / 99
33. 33. There are two deﬁnitions of overﬁtting, which often coincide. Difference-overfitting (academical deﬁnition) There is a signiﬁcant difference in quality of predictions between train and holdout. Complexity-overfitting (practitioners' deﬁnition) Formula has too high complexity (e.g. too many parameters), increasing the number of parameters drives to lower quality. 32 / 99
34. 34. Model selection Given two models, which one should we select? 33 / 99
35. 35. Model selection Given two models, which one should we select? ML is about inference of statistical dependencies, which give us ability to predict The best model is the model which gives better predictions for new observations. Simplest way to control this is to check quality on a holdout — a sample not used during training (cross-validation).This gives unbiased estimate of quality for new data. estimates have variance multiple testing introduces bias (solution: train + validation + test, like kaggle) 34 / 99
36. 36. Difference-overﬁtting is inessential, provided that we measure quality on a holdout sample (though easy to check and sometimes helpful). Complexity-overﬁtting is a problem — we need to test different parameters for optimality (more examples through the course). 35 / 99
37. 37. Difference-overﬁtting is inessential, provided that we measure quality on a holdout sample (though easy to check and sometimes helpful). Complexity-overﬁtting is a problem — we need to test different parameters for optimality (more examples through the course). Don't use distribution comparison to detect overﬁtting 36 / 99
38. 38. -minutes breakn2 37 / 99
39. 39. Reminder: linear regression We can use linear function for regression: Minimize MSE: Explicit solution: d(x) =< w, x >  = (d( ) − → min∑i xi yi ) 2 ( ) w =∑i xi x T i ∑i yi xi 38 / 99
40. 40. Regularization: motivation When the number of parameters is high (compared to the number of observations) hard to estimate reliably all parameters linear regression with MSE: in -dimensional space you can ﬁnd hyperplane through any points non-unique solution if the matrix degenerates Solution 1: manually decrease dimensionality of the problem Solution 2: use regularization d d n < d ∑i xi x T i 39 / 99
41. 41. Regularization When number of parameters in model is high, overﬁtting is very probable Solution: add a regularization term to the loss function: regularization : regularization: regularization:  = L( , ) + → min 1 N ∑ i xi yi reg L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj 40 / 99
42. 42. , – regularizations Dependence of parameters (components of ) on the regularization (stronger regularization to the left) regularization (solid), (dashed) L2 L1 w L2 L1 +L1 L2 41 / 99
43. 43. Regularizations regularization encourages sparsity (many coefﬁcients in turn to zero)L1 w 42 / 99
44. 44. regularizations What is the expression for ? But nobody uses it, even . Why? Lp =p ∑i w p i L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp 43 / 99
45. 45. regularizations What is the expression for ? But nobody uses it, even . Why? Because it is not convex Lp =p ∑i w p i L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp 44 / 99
46. 46. Regularization summary important tool to ﬁght overﬁtting (= poor generalization on a new data) different modiﬁcations for other models makes it possible to handle really many features machine learning should detect important features itself from mathematical point: turning convex problem to strongly convex (NB: only for linear models) from practical point: softly limiting the space of parameters breaks scale-invariance of linear models 45 / 99
47. 47. SVM and regularization Width of margin is , so SVM loss is actually: ﬁrst term is maximizing a margin second term penalizes samples that are not on the correct side of the margin is controlling the trade-off 1 ||w||  = ||w| + C ( , ) 1 2 | 2 ∑ i Lhinge xi yi C 46 / 99
48. 48. Linear models summary linear decision function in the core reduced to optimization problems losses are additive stochastic optimizations applicable can support nonlinear decisions w.r.t. to original features by using kernels apply regularizations to avoid bad situations and overﬁtting  = L( , ) ∑ i xi yi 47 / 99
49. 49. Decision Trees 48 / 99
50. 50. Decision tree Example: predict outside play based on weather conditions. 49 / 99
51. 51. Decision tree: binary tree 50 / 99
52. 52. Decision tree: splitting space 51 / 99
53. 53. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem 52 / 99
54. 54. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed 53 / 99
55. 55. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed need a criterion to select best splitting (feature and threshold) 54 / 99
56. 56. Splitting criterion Several impurity functions: where is a portion of signal events in a leaf, and is a portion of background events, is number of training events in a leaf. TreeImpurity = impurity(leaf ) × size(leaf)∑leaf Misclass. Gini Entropy = = = min(p, 1 − p) p(1 − p) − p log p − (1 − p) log(1 − p) p 1 − p size(leaf) 55 / 99
57. 57. Splitting criterion Impurity as a function of p 56 / 99
58. 58. Splitting criterion: why not misclassification? 57 / 99
59. 59. Decision trees for regression Greedy optimization (minimizing MSE): Can be rewritten as: is like an 'impurity' of the leaf: TreeMSE ∼ ( − ∑ i yi ŷ  i ) 2 TreeMSE ∼ MSE(leaf) × size(leaf)∑leaf MSE(leaf) MSE(leaf) = ( − 1 size(leaf) ∑ i∈leaf yi ŷ  i ) 2 58 / 99
60. 60. 59 / 99
61. 61. 60 / 99
62. 62. 61 / 99
63. 63. 62 / 99
64. 64. 63 / 99
65. 65. Decision trees instability Little variation in training dataset produce different classiﬁcation rule. 64 / 99
66. 66. Tree keeps splitting until each event is correctly classiﬁed: 65 / 99
67. 67. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree 66 / 99
68. 68. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree Any combinations of rules above is possible. 67 / 99
69. 69. no pre-stopping max_depth min # of samples in leaf maximal number of leaves 68 / 99
70. 70. Post-pruning When a tree is already built we can try optimize it to simplify formula. Generally, much slower than pre-stopping. 69 / 99
71. 71. 70 / 99
72. 72. 71 / 99
73. 73. Decision tree overview 1. Very intuitive algorithm for regression and classiﬁcation 2. Fast prediction 3. Scale-independent 4. Supports multiclassiﬁcation But 1. Training optimal tree is NP-complex 2. Trained greedily by optimizing Gini index or entropy (fast!) 3. Non-stable 4. Uses only trivial conditions 72 / 99
74. 74. Missing values in decision trees If event being predicted lacks feature , we use prior probabilities.x1 73 / 99
75. 75. Feature importances Different approaches exist to measure an importance of feature in the ﬁnal model Importance of feature quality provided by one feature≠ 74 / 99
76. 76. Feature importances tree: counting number of splits made over this feature 75 / 99
77. 77. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate 76 / 99
78. 78. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate model-agnostic recipe: train without one feature, compare quality on test with/without one feature requires many evaluations 77 / 99
79. 79. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate model-agnostic recipe: train without one feature, compare quality on test with/without one feature requires many evaluations model-agnostic recipe: feature shufﬂing take one column in test dataset and shufﬂe it. Compare quality with/without shufﬂing. 78 / 99
80. 80. Ensembles 79 / 99
81. 81. Composition of models Basic motivation: improve quality of classiﬁcation by reusing strong sides of different classiﬁers / regressors. 80 / 99
82. 82. Simple Voting Averaging predictions Averaging predicted probabilities Averaging decision functions = [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ  P+1 P−1 (x) = (x)P±1 1 J ∑J j=1 p±1,j D(x) = (x) 1 J ∑J j=1 dj 81 / 99
83. 83. Weighted voting The way to introduce importance of classiﬁers General case of ensembling: D(x) = (x)∑j αj dj D(x) = f ( (x), (x), …, (x))d1 d2 dJ 82 / 99
84. 84. Problems very close base classiﬁers need to keep variation and still have good quality of basic classiﬁers 83 / 99
85. 85. Decision tree reminder 84 / 99
86. 86. Generating training subset subsampling taking ﬁxed part of samples (sampling without replacement) bagging (Bootstrap AGGregating) sampling with replacement, If #generated samples = length of the dataset, the fraction of unique samples in new dataset is 1 − ∼ 63.2 1 e 85 / 99
87. 87. Random subspace model (RSM) Generating subspace of features by taking random subset of features 86 / 99
88. 88. Random Forest [Leo Breiman, 2001] Random forest is a composition of decision trees. Each individual tree is trained on a subset of training data obtained by bagging samples taking random subset of features Predictions of random forest are obtained via simple voting. 87 / 99
89. 89. data optimal boundary 88 / 99
90. 90. data optimal boundary 50 trees 89 / 99
91. 91. data optimal boundary 50 trees 2000 trees 90 / 99
92. 92. Overfitting 91 / 99
93. 93. Overfitting 92 / 99
94. 94. Overfitting overﬁtted (in the sense that predictions for train and test are different) doesn't overﬁt: increasing complexity (adding more trees) doesn't spoil a classiﬁer 93 / 99
95. 95. Works with features of different nature Stable to noise in data 94 / 99
96. 96. Works with features of different nature Stable to noise in data From 'Testing 179 Classiﬁers on 121 Datasets' The classiﬁers most likely to be the bests are the random forest (RF) versions, the best of which [...] achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. 95 / 99
97. 97. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overﬁt 96 / 99
98. 98. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overﬁt Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: =Nused Nfeatures ‾ ‾‾‾‾‾‾√ 97 / 99
99. 99. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overﬁt Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: Hardly interpretable Trained trees take much space, some kind of pre-stopping is required in practice Doesn't ﬁx mistakes done by previous trees =Nused Nfeatures ‾ ‾‾‾‾‾‾√ 98 / 99
100. 100. 99 / 99