Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- MLHEP Lectures - day 1, basic track by arogozhnikov 1719 views
- Machine learning in science and ind... by arogozhnikov 5374 views
- Machine learning in science and ind... by arogozhnikov 5291 views
- Machine learning in science and ind... by arogozhnikov 5249 views
- Reweighting and Boosting to unifori... by arogozhnikov 1667 views
- MLHEP Lectures - day 2, basic track by arogozhnikov 1742 views

1,706 views

Published on

* AdaBoost

* Gradient Boosting for regression

* Gradient Boosting for classification

* Second-order information

* Losses: regression, classification, ranking

Multiclass classification:

* ensembling

* softmax modifications

Feature engineering

Output engineering

Feature selection

Dimensionality rediction:

* PCA

* LDA, CSP

* LLE

* Isomap

Hyperparameter optimization

* ML-based approach

* Gaussian processes

Published in:
Science

No Downloads

Total views

1,706

On SlideShare

0

From Embeds

0

Number of Embeds

1,572

Shares

0

Downloads

17

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Machine Learning in High Energy Physics Lectures 5 & 6 Alex Rogozhnikov Lund, MLHEP 2016 1 / 101
- 2. Linear models: linear regression Minimizing MSE: d(x) =< w, x > +w0 = ( , ) → min 1 N ∑i Lmse xi yi ( , ) = (d( ) −Lmse xi yi xi yi ) 2 1 / 101
- 3. Linear models: logistic regression Minimizing logistic loss: Penalty for single observation : d(x) =< w, x > +w0 = ( , ) → min∑i Llogistic xi yi ± 1yi ( , ) = ln(1 + )Llogistic xi yi e − d( )y i xi 2 / 101
- 4. Linear models: support vector machine (SVM) Margin no penalty ( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi d( ) > 1 →yi xi 3 / 101
- 5. Kernel trick we can project data into higher-dimensional space, e.g. by adding new features. Hopefully, in the new space distributions are separable 4 / 101
- 6. Kernel trick is a projection operator: We need only kernel: Popular choices: polynomial kernel and RBF kernel. P w = P( )∑i αi xi d(x) = < w, P(x) >new d(x) = K( , x)∑i αi xi K(x, ) =< P(x), P( )x̃ x̃ >new 5 / 101
- 7. Regularizations regularization : regularization: : = L( , ) + → min 1 N ∑ i xi yi reg L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj 6 / 101
- 8. Stochastic optimization methods Stochastic gradient descent take — random event from training data (can be applied to additive loss functions) i w ← w − η ∂L( , )xi yi ∂w 7 / 101
- 9. Decision trees NP complex to build heuristic: use greedy optimization optimization criterions (impurities): misclassiﬁcation, Gini, entropy 8 / 101
- 10. Decision trees for regression Optimizing MSE, prediction inside a leaf is constant. 9 / 101
- 11. Overfitting in decision tree pre-stopping post-pruning unstable to the changes in training dataset 10 / 101
- 12. Random Forest Many trees built independently bagging of samples subsampling of features Simple voting is used to get prediction of an ensemble 11 / 101
- 13. Random Forest overﬁtted (in the sense that predictions for train and test are different) doesn't overﬁt: increasing complexity (adding more trees) doesn't spoil a classiﬁer 12 / 101
- 14. Random Forest simple and parallelizable doesn't require much tuning hardly interpretable but feature importances can be computed doesn't ﬁx samples poorly classiﬁes at previous stages 13 / 101
- 15. Ensembles Averaging decision functions Weighted decision D(x) = (x) 1 J ∑J j=1 dj D(x) = (x)∑j αj dj 14 / 101
- 16. Sample weights in ML Can be used with many estimators. We now have triples weight corresponds to frequency of observation expected behavior: is the same as having copies of th event global normalization of weights doesn't matter , , i − index of an eventxi yi wi = nwi n i 15 / 101
- 17. Sample weights in ML Can be used with many estimators. We now have triples weight corresponds to frequency of observation expected behavior: is the same as having copies of th event global normalization of weights doesn't matter Example for logistic regression: , , i − index of an eventxi yi wi = nwi n i = L( , ) → min ∑ i wi xi yi 16 / 101
- 18. Weights (parameters) of a classiﬁer sample weights In code: tree = DecisionTreeClassiﬁer(max_depth=4) tree.ﬁt(X, y, sample_weight=weights) Sample weights are convenient way to regulate importance of training events. Only sample weights when talking about AdaBoost. ≠ 17 / 101
- 19. AdaBoost [Freund, Shapire, 1995] Bagging: information from previous trees not taken into account. Adaptive Boosting is a weighted composition of weak learners: We assume , labels , th weak learner misclassiﬁed th event iff D(x) = (x) ∑ j αj dj (x) = ±1dj = ±1yi j i ( ) = −1yi dj xi 18 / 101
- 20. AdaBoost Weak learners are built in sequence, each classiﬁer is trained using different weights initially = 1 for each training sample After building th base classiﬁer: 1. compute the total weight of correctly and wrongly classiﬁed events 2. increase weight of misclassiﬁed samples D(x) = (x) ∑ j αj dj wi j = ln ( ) αj 1 2 wcorrect wwrong ← ×wi wi e − ( )αj y i dj xi 19 / 101
- 21. AdaBoost example Decision trees of depth will be used. 1 20 / 101
- 22. 21 / 101
- 23. 22 / 101
- 24. (1, 2, 3, 100 trees) 23 / 101
- 25. AdaBoost secret sample weight is equal to penalty for event is obtained as a result of analytical optimization Exercise: prove formula for D(x) = (x) ∑ j αj dj = L( , ) = exp(− D( )) → min ∑ i xi yi ∑ i yi xi = L( , ) = exp(− D( ))wi xi yi yi xi αj αj 24 / 101
- 26. Loss function of AdaBoost 25 / 101
- 27. AdaBoost summary is able to combine many weak learners takes mistakes into account simple, overhead for boosting is negligible too sensitive to outliers In scikit-learn, one can run AdaBoost over other algorithms. 26 / 101
- 28. Gradient Boosting 27 / 101
- 29. Decision trees for regression 28 / 101
- 30. Gradient boosting to minimize MSE Say, we're trying to build an ensemble to minimize MSE: When ensemble's prediction is obtained by taking weighted sum Assuming that we already built estimators, how do we train a next one? (D( ) − → min ∑ i xi yi ) 2 D(x) = (x)∑j dj (x) = (x) = (x) + (x)Dj ∑j =1j ′ dj ′ Dj−1 dj j − 1 29 / 101
- 31. Natural solution is to greedily minimize MSE: Introduce residual: , now we need to simply minimize MSE So the th estimator (tree) is trained using the following data: ( ( ) − = ( ( ) + ( ) − → min ∑ i Dj xi yi ) 2 ∑ i Dj−1 xi dj xi yi ) 2 ( ) = − ( )Rj xi yi Dj−1 xi ( ( ) − ( ) → min ∑ i dj xi Rj xi ) 2 j , ( )xi Rj xi 30 / 101
- 32. Example: regression with GB using regression trees of depth=2 31 / 101
- 33. number of trees = 1, 2, 3, 100 32 / 101
- 34. Gradient Boosting visualization 33 / 101
- 35. Gradient Boosting [Friedman, 1999] composition of weak regressors, Borrow an approach to encode probabilities from logistic regression Optimization of log-likelihood ( ): D(x) = (x) ∑ j αj dj (x)p+1 (x)p−1 = = σ(D(x)) σ(−D(x)) = ±1yi = L( , ) = ln(1 + ) → min ∑ i xi yi ∑ i e − D( )y i xi 34 / 101
- 36. Gradient Boosting Optimization problem: ﬁnd all and weak leaners Mission impossible D(x) = (x) ∑ j αj dj = ln(1 + ) → min ∑ i e − D( )y i xi αj dj 35 / 101
- 37. Gradient Boosting Optimization problem: ﬁnd all and weak leaners Mission impossible Main point: greedy optimization of loss function by training one more weak learner Each new estimator follows the gradient of loss function D(x) = (x) ∑ j αj dj = ln(1 + ) → min ∑ i e − D( )y i xi αj dj dj 36 / 101
- 38. Gradient Boosting Gradient boosting ~ steepest gradient descent. At jth iteration: compute pseudo-residual train regressor to minimize MSE: ﬁnd optimal Important exercise: compute pseudo-residuals for MSE and logistic losses. (x) = (x)Dj ∑j =1j ′ αj ′ dj ′ (x) = (x) + (x)Dj Dj−1 αj dj R( ) = − xi ∂ ∂D( )xi ∣∣D(x)= (x)Dj−1 dj ( ( ) − R( ) → min∑i dj xi xi ) 2 αj 37 / 101
- 39. Additional GB tricks to make training more stable, add learning rate : randomization to ﬁght noise and build different trees: subsampling of features subsampling of training samples η (x) = η (x)Dj ∑ j αj dj 38 / 101
- 40. AdaBoost is a particular case of gradient boosting with different target loss function*: This loss function is called ExpLoss or AdaLoss. *(also AdaBoost expects that ) = → minada ∑ i e − D( )y i xi ( ) = ±1dj xi 39 / 101
- 41. Loss functions Gradient boosting can optimize different smooth loss function. regression, Mean Squared Error Mean Absolute Error binary classiﬁcation, ExpLoss (ada AdaLoss) LogLoss y ∈ ℝ (d( ) −∑i xi yi ) 2 d( ) −∑i ∣∣ xi yi ∣∣ = ±1yi ∑i e − d( )y i xi log(1 + )∑i e − d( )y i xi 40 / 101
- 42. 41 / 101
- 43. 42 / 101
- 44. Usage of second-order information For additive loss function apart from gradient , we can make use of second derivatives . E.g. select leaf value using second-order step: gi hi = L( ( ), ) = L( ( ) + (x), ) ≈ ∑ i Dj xi yi ∑ i Dj−1 xi dj yi ≈ L( ( ), ) + (x) + (x) ∑ i Dj−1 xi yi gi dj hi 2 d 2 j = + ( + ) → minj−1 ∑ leaf gleaf wleaf hleaf 2 w 2 leaf 43 / 101
- 45. Using second-order information Independent optimization. Explicit solution for optimal values in the leaves: ≈ + ( + ) → minj−1 ∑ leaf gleaf wleaf hleaf 2 w 2 leaf where = , = .gleaf ∑ i∈leaf gi hleaf ∑ i∈leaf hi = −wleaf gleaf hleaf 44 / 101
- 46. Using second-order information: recipe On each iteration of gradient boosting 1. train a tree to follow gradient (minimize MSE with gradient) 2. change the values assigned in leaves to: 3. update predictions (no weight for estimator: ) This improvement is quite cheap and allows smaller GBs to be more effective. We can use information about hessians on the tree building step (step 1), ← −wleaf gleaf hleaf = 1aj (x) = (x) + η (x)Dj Dj−1 dj 45 / 101
- 47. Multiclass classification: ensembling One-vs-one, One-vs-rest, scikit-learn implements those as meta-algorithms. × ( − 1)nclasses nclasses 2 nclasses 46 / 101
- 48. Multiclass classification: modifying an algorithm Most classiﬁers have natural generalizations to multiclass classiﬁcation. Example for logistic regression: introduce for each class a vector . Converting to probabilities using softmax function: And minimize LogLoss: c ∈ 1, 2, …, C wc (x) =< , x >dc wc (x) =pc e (x)dc ∑c̃ e (x)d c̃ = − log ( )∑i py i xi 47 / 101
- 49. Softmax function Typical way to convert numbers to probabilities. Mapping is surjective, but not injective ( dimensions to dimension). Invariant to global shift: For the case of two classes: Coincides with logistic function for n n n n − 1 (x) → (x) + constdc dc (x) = =p1 e (x)d1 +e (x)d1 e (x)d2 1 1 + e (x)− (x)d2 d1 d(x) = (x) − (x)d1 d2 48 / 101
- 50. Loss function: ranking example In ranking we need to order items by : We can penalize for misordering: yi < ⇒ d( ) < d( )yi yj xi xj = L( , , , ) ∑ i,i ̃ xi x i ̃ yi y i ̃ L(x, , y, ) = { x̃ ỹ σ(d( ) − d(x)),x̃ 0, y < ỹ otherwise 49 / 101
- 51. Adapting boosting By modifying boosting or changing loss function we can solve different problems classiﬁcation regression ranking HEP-speciﬁc examples in Tatiana's lecture tomorrow. 50 / 101
- 52. Gradient Boosting classification playground 51 / 101
- 53. -minutes breakn3 52 / 101
- 54. Recapitulation: AdaBoost Minimizes by increasing weights of misclassiﬁed samples: = → minada ∑ i e − D( )y i xi ← ×wi wi e − ( )αj y i dj xi 53 / 101
- 55. Gradient Boosting overview A powerful ensembling technique (typically used over trees, GBDT) a general way to optimize differentiable losses can be adapted to other problems 'following' the gradient of loss at each step making steps in the space of functions gradient of poorly-classiﬁed events is higher increasing number of trees can drive to overﬁtting (= getting worse quality on new data) requires tuning, better when trees are not complex widely used in practice 54 / 101
- 56. Feature engineering Feature engineering = creating features to get the best result with ML important step mostly relying on domain knowledge requires some understanding most of practitioners' time is spent at this step 55 / 101
- 57. Feature engineering Analyzing available features scale and shape of features Analyze which information lacks challenge example: maybe subleading jets matter? Validate your guesses 56 / 101
- 58. Feature engineering Analyzing available features scale and shape of features Analyze which information lacks challenge example: maybe subleading jets matter? Validate your guesses Machine learning is a proper tool for checking your understanding of data 57 / 101
- 59. Linear models example Single event with sufﬁciently large value of feature can break almost all linear models. Heavy-tailed distributions are harmful, pretransforming required logarithm power transform and throwing out outliers Same tricks actually help to more advanced methods. Which transformation is the best for Random Forest? 58 / 101
- 60. One-hot encoding Categorical features (= 'not orderable'), being one-hot encoded, are easier for ML to operate with 59 / 101
- 61. Decision tree example is hard for tree to use, since provides no good splitting Don't forget that a tree can't reconstruct linear combinations — take care of this. ηlepton 60 / 101
- 62. Example of feature: invariant mass Using HEP coordinates, invariant mass of two products is: Can't be recovered with ensembles of trees of depth < 4 when using only canonical features. What about invariant mass of 3 particles? ≈ 2 (cosh( − ) − cos( − ))m 2 inv pT1 pT2 η1 η2 ϕ1 ϕ2 61 / 101
- 63. Example of feature: invariant mass Using HEP coordinates, invariant mass of two products is: Can't be recovered with ensembles of trees of depth < 4 when using only canonical features. What about invariant mass of 3 particles? (see Vicens' talk today). Good features are ones that are explainable by physics. Start from simplest and most natural. Mind the cost of computing the features. ≈ 2 (cosh( − ) − cos( − ))m 2 inv pT1 pT2 η1 η2 ϕ1 ϕ2 62 / 101
- 64. Output engineering Typically not discussed, but the target of learning plays an important role. Example: predicting number of upvotes for comment. Assuming the error is MSE Consider two cases: 100 comments when predicted 0 1200 comments when predicted 1000 = (d( ) − → min ∑ i xi yi ) 2 63 / 101
- 65. Output engineering Ridiculously large impact of highly-commented articles. We need to predict order, not exact number of comments. Possible solutions: alternate loss function. E.g. use MAPE apply logarithm to the target, predict Evaluation score should be changed accordingly. log(# comments) 64 / 101
- 66. Sample weights Typically used to estimate the contribution of event (how often we expect this to happen). Sample weights in some situations also matter. highly misbalanced dataset (e.g. 99 % of events in class 0) tend to have problems during optimization. changing sample weights to balance the dataset frequently helps. 65 / 101
- 67. Feature selection Why? speed up training / prediction reduce time of data preparation in the pipeline help algorithm to 'focus' on ﬁnding reliable dependencies useful when amount of training data is limited Problem: ﬁnd a subset of features, which provides best quality 66 / 101
- 68. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data 2 d 67 / 101
- 69. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data basic nice solution: estimate importance with RF / GBDT. 2 d 68 / 101
- 70. Feature selection Exhaustive search: cross-validations. incredible amount of resources too many cross-validation cycles drive to overly-optimistic quality on the test data basic nice solution: estimate importance with RF / GBDT. Filtering methods Eliminate variables which seem not to carry statistical information about the target. E.g. by measuring Pearson correlation or mutual information. Example: all angles will be thrown out. 2 d ϕ 69 / 101
- 71. Feature selection: embedded methods Feature selection is a part of training. Example: — regularized linear models. Forward selection Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality. Backward elimination Almost the same, but this time we iteratively eliminate features. Bidirectional combination of the above is possible, some algorithms can use previously trained model as the new starting point for optimization. L1 70 / 101
- 72. Unsupervised dimensionality reduction 71 / 101
- 73. Principal component analysis [Pearson, 1901] PCA is ﬁnding axes along which variance is maximal 72 / 101
- 74. PCA description PCA is based on the principal axis theorem is covariance matrix of the dataset, is orthogonal matrix, is diagonal matrix. Q = ΛUU T Q U Λ Λ = diag( , , …, ), ≥ ≥ ⋯ ≥λ1 λ2 λn λ1 λ2 λn 73 / 101
- 75. PCA optimization visualized 74 / 101
- 76. PCA: eigenfaces Emotion = α[scared] + β[laughs] + γ[angry]+. . . 75 / 101
- 77. Locally linear embedding handles the case of non-linear dimensionality reduction Express each sample as a convex combination of neighbours − →∑i ∣∣xi ∑i ̃ w ii ̃ x i ̃ ∣∣ minw 76 / 101
- 78. Locally linear embedding subject to constraints: , , and if are not neighbors. Finding an optimal mapping for all points simultaneously ( are images — positions in the new space): − →∑ i ∣ ∣ ∣ ∣xi ∑ i ̃ w ii ̃ x i ̃ ∣ ∣ ∣ ∣ min w = 1∑i ̃ w ii ̃ >= 0w ii ̃ = 0w ii ̃ i, i ̃ yi − →∑ i ∣ ∣ ∣ ∣yi ∑ i ̃ w ii ̃ y i ̃ ∣ ∣ ∣ ∣ min y 77 / 101
- 79. PCA and LLE 78 / 101
- 80. Isomap Isomap is targeted to preserve geodesic distance on the manifold between two points 79 / 101
- 81. Supervised dimensionality reduction 80 / 101
- 82. Fisher's LDA (Linear Discriminant Analysis) [1936] Original idea: ﬁnd a projection to discriminate classes best 81 / 101
- 83. Fisher's LDA Mean and variance within a single class ( ): Total within-class variance: Total between-class variance: Goal: ﬁnd a projection to maximize a ratio c c ∈ {1, 2, …, C} =< xμk >events of class c =< ||x − |σk μk | 2 >events of class c =σwithin ∑c pc σc = || − μ|σbetween ∑c pc μc | 2 σbetween σwithin 82 / 101
- 84. Fisher's LDA 83 / 101
- 85. LDA: solving optimization problem We are interested in ﬁnding -dimensional projection : Naturally connected to the generalized eigenvalue problem Projection vector corresponds to the highest generalized eigenvalue Finds a subspace of components when applied to a classiﬁcation problem with classes Fisher's LDA is a basic popular binary classiﬁcation technique. 1 w →ww T Σ within ww T Σ between max w C − 1 C 84 / 101
- 86. Common spacial patterns When we expect that each class is close to some linear subspace, we can (naively) ﬁnd for each class this subspace by PCA (better idea) take into account variation of other data and optimize Natural generalization is to take several components: is a projection matrix; and are number of dimensions in original and new spaces Frequently used in neural sciences, in particular in BCI based on EEG / MEG. W ∈ ℝn×n1 n n1 tr W → maxW T Σ class subject to W = IW T Σ total 85 / 101
- 87. Common spacial patterns Patters found describe the projection into 6-dimensional space 86 / 101
- 88. Dimensionality reduction summary is capable of extracting sensible features from highly-dimensional data frequently used to 'visualize' the data nonlinear methods rely on the distance in the space works well with highly-dimensional spaces with features of same nature 87 / 101
- 89. Finding optimal hyperparameters some algorithms have many parameters (regularizations, depth, learning rate, ...) not all the parameters are guessed checking all combinations takes too long 88 / 101
- 90. Finding optimal hyperparameters some algorithms have many parameters (regularizations, depth, learning rate, ...) not all the parameters are guessed checking all combinations takes too long We need automated hyperparameter optimization! 89 / 101
- 91. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it 90 / 101
- 92. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results 91 / 101
- 93. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results function reconstruction is a problem 92 / 101
- 94. Finding optimal parameters randomly picking parameters is a partial solution given a target optimal value we can optimize it no gradient with respect to parameters noisy results function reconstruction is a problem Before running grid optimization make sure your metric is stable (i.e. by train/testing on different subsets). Overﬁtting (=getting too optimistic estimate of quality on a holdout) by using many attempts is a real issue. 93 / 101
- 95. Optimal grid search stochastic optimization (Metropolis-Hastings, annealing) requires too many evaluations, using only last checked combination regression techniques, reusing all known information (ML to optimize ML!) 94 / 101
- 96. Optimal grid search using regression General algorithm (point of grid = set of parameters): 1. evaluations at random points 2. build regression model based on known results 3. select the point with best expected quality according to trained model 4. evaluate quality at this points 5. Go to 2 if not enough evaluations Why not using linear regression? Exploration vs. exploitation trade-off: should we try explore poorly-covered regions or try to enhance currently seen to be optimal? 95 / 101
- 97. Gaussian processes for regression Some deﬁnitions: , where and are functions of mean and covariance: , represents our prior expectation of quality (may be taken constant) represents inﬂuence of known results on the expectation of values in new points RBF kernel is used here too: Another popular choice: We can model the posterior distribution of results in each point. Y ∼ GP(m, K) m K m(x) K(x, )x̃ m(x) = Y(x) K(x, ) = Y(x)Y( )x̃ x̃ K(x, ) = exp(−c||x − | )x̃ x̃ | 2 K(x, ) = exp(−c||x − ||)x̃ x̃ 96 / 101
- 98. Gaussian Process Demo on Mathematica 97 / 101
- 99. Gaussian processes Gaussian processes model posterior distribution at each point of the grid. we know at which point we have already well-estimated quality and we are able to ﬁnd regions which need exploration. See also this demo. 98 / 101
- 100. Summary about hyperoptimization parameters can be tuned automatically be sure that metric being optimized is stable mind the optimistic quality estimation (resolved by one more holdout) 99 / 101
- 101. Summary about hyperoptimization parameters can be tuned automatically be sure that metric being optimized is stable mind the optimistic quality estimation (resolved by one more holdout) but this is not what you should spend your time on the gain from properly cooking features / reconsidering problem is much higher 100 / 101
- 102. 101 / 101

No public clipboards found for this slide

Be the first to comment