Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MLHEP Lectures - day 1, basic track

1,719 views

Published on

Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.

* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression

Published in: Science
  • Be the first to comment

  • Be the first to like this

MLHEP Lectures - day 1, basic track

  1. 1. Machine Learning in High Energy Physics Lectures 1 & 2 Alex Rogozhnikov Lund, MLHEP 2016 1 / 87
  2. 2. Intro notes two tracks: introductory course (this one) advanced track: Mon, Tue, Wed, then two tracks are merged Introductory track: two lectures and two practice seminars on each day Kaggle challenges 'Triggers' — only for advanced track, lasts for 3 days 'Higgs' — for both tracks, lasts for 7 days know material? Spend more time on challenges! 1 / 87
  3. 3. Intro notes — 2 chat rooms gitter if you want to share something between teams — please do it publicly (via chat) repository glossary is in the repository 2 / 87
  4. 4. What is Machine Learning about? a method of teaching computers to make and improve predictions or behaviors based on some data? a field of computer science, probability theory, and optimization theory which allows complex tasks to be solved for which a logical/procedural approach would not be possible or feasible? a type of AI that provides computers with the ability to learn without being explicitly programmed? somewhat in between of statistics, AI, optimization theory, signal processing and pattern matching? 3 / 87
  5. 5. What is Machine Learning about Inference of statistical dependencies which give us ability to predict 4 / 87
  6. 6. What is Machine Learning about Inference of statistical dependencies which give us ability to predict Data is cheap, knowledge is precious 5 / 87
  7. 7. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering 6 / 87
  8. 8. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering and hundreds more 7 / 87
  9. 9. Machine Learning in High Energy Physics Triggers (LHCb, CMS to join soon) Particle identification Calibration Tagging Stripping line Analysis 8 / 87
  10. 10. Machine Learning in High Energy Physics Triggers (LHCb, CMS to join soon) Particle identification Calibration Tagging Stripping line Analysis On each stage different data is used and different information is inferred, but the ideas beyond are quite similar. 9 / 87
  11. 11. General notion In supervised learning the training data is represented as a set of pairs is an index of event is a vector of features available for event is a target — the value we need to predict features = observables = variables ,xi yi i xi yi 10 / 87
  12. 12. Classification problem , where is finite set of labels. Examples particle identification based on information about track: binary classification: — is signal, is background ∈ Yyi Y xi Y = = (p, η, E, charge, PV , FlightTime)χ2 {electron, muon, pion, ... } Y = {0, 1} 1 0 11 / 87
  13. 13. Regression problem Examples: predicting price of a house by it's positions predicting number of customers / money income reconstructing real momentum of particle y ∈ ℝ 12 / 87
  14. 14. Regression problem Examples: predicting price of a house by it's positions predicting number of customers / money income reconstructing real momentum of particle Why do we need automatic classification/regression? in applications up to thousands of features higher quality much faster adaptation to new problems y ∈ ℝ 13 / 87
  15. 15. Classification based on nearest neighbours Given training set of objects and their labels we predict the label for the new observation : Here and after is the distance in the space of features. { , }xi yi x = , j = arg ρ(x, )ŷ  yj min i xi ρ(x, )x̃  14 / 87
  16. 16. Visualization of decision rule Consider a classification problem with 2 features: ,= ( , )xi x 1 i x 2 i ∈ Y = {0, 1}yi 15 / 87
  17. 17. Nearest Neighbours ( NN) A better way is to use neighbors: k k k (x) =pỹ  # of knn events of x in class ỹ  k 16 / 87
  18. 18. 17 / 87
  19. 19. k = 1, 2, 5, 30 18 / 87
  20. 20. Overfitting What is the quality of classification on training dataset when ?k = 1 19 / 87
  21. 21. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) k = 1 20 / 87
  22. 22. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) quality is lower when k = 1 k > 1 21 / 87
  23. 23. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) quality is lower when this doesn't mean is the best, it means we cannot use training events to estimate quality when classifier's decision rule is too complex and captures details from training data that are not relevant to distribution, we call this an overfitting (more details tomorrow) k = 1 k > 1 k = 1 22 / 87
  24. 24. Regression using NN Regression with nearest neighbours is done by averaging of output k =ŷ  1 k ∑ j∈knn(x) yj 23 / 87
  25. 25. NN with weights Average neighbours' output with weights: the closer the neighbour, the higher weights of its contribution, i.e.: k =ŷ  ∑j∈knn(x) wj yj ∑j∈knn(x) wj = 1/ρ(x, )wj xj 24 / 87
  26. 26. Computational complexity Given that dimensionality of space is and there are training samples: training time: ~ O(save a link to the data) prediction time: for each sample d n n × d 25 / 87
  27. 27. Spacial index: ball tree 26 / 87
  28. 28. Ball tree training time ~ prediction time ~ for each sample Other option exists: KD-tree. O(d × n log(n)) log(n) × d 27 / 87
  29. 29. Overview of NN Awesomely simple classifier and regressor Provides too optimistic quality on training data Quite slow, though optimizations exist Too sensitive to scale of features Hard times with data of high dimensions k 28 / 87
  30. 30. Sensitivity to scale of features Euclidean distance: ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 29 / 87
  31. 31. Sensitivity to scale of features Euclidean distance: Change scale of first feature: ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 ρ(x, x̃ ) 2 ρ(x, x̃ ) 2 = ∼ (10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 100 ( −x1 x̃ 1 ) 2 30 / 87
  32. 32. Sensitivity to scale of features Euclidean distance: Change scale of first feature: Scaling of features frequently increases quality. ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 ρ(x, x̃ ) 2 ρ(x, x̃ ) 2 = ∼ (10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 100 ( −x1 x̃ 1 ) 2 31 / 87
  33. 33. Distance function matters Minkowski distance Canberra Cosine metric ρ(x, = ( −x̃ ) p ∑l xl x̃ l ) p ρ(x, ) =x̃  ∑ l | − |xl x̃ l | | + | |xl x̃ l ρ(x, ) =x̃  < x, >x̃  |x| | |x̃  32 / 87
  34. 34. Problems with high dimensions With higher dimensions the neighboring points are further. Example: consider training data points being distributed unformly in the unit cube: expected number of point in the ball of size is proportional to the to collect the same amount on neighbors, we need to put NN suffers from curse of dimensionality. d >> 1 n r r d r = → 1const 1/d k 33 / 87
  35. 35. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? 34 / 87
  36. 36. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? Discrimination is identical in all three cases 35 / 87
  37. 37. ROC curve demonstration 36 / 87
  38. 38. ROC curve 37 / 87
  39. 39. ROC curve These distributions have the same ROC curve: (ROC curve is passed signal vs passed bck dependency) 38 / 87
  40. 40. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold 39 / 87
  41. 41. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information 40 / 87
  42. 42. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of events: b b s b s b ... s s b s s 41 / 87
  43. 43. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of events: b b s b s b ... s s b s s Comparison of algorithms should be based on the information from ROC curve. 42 / 87
  44. 44. Terminology and Conventions fpr = background efficiency = b tpr = signal efficiency = s 43 / 87
  45. 45. Terminology and Conventions fpr = background efficiency = b tpr = signal efficiency = s → 44 / 87
  46. 46. ROC AUC (area under the ROC curve) where are predictions of random background and signal events. ROC AUC = P( < )rb rs ,rb rs 45 / 87
  47. 47. Classifier have the same ROC AUC, but which is better for triggers at the LHC? (we need to pass very few background) 46 / 87
  48. 48. Classifier have the same ROC AUC, but which is better for triggers at the LHC? (we need to pass very few background) Applications frequently demand different metric. 47 / 87
  49. 49. -minutes breakn 48 / 87
  50. 50. Recapitulation 1. Statistical ML: applications and problems 2. ML in HEP 3. nearest neighbours classifier and regressor. 4. ROC curve, ROC AUC k 49 / 87
  51. 51. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: Does there really exist the distribution of people / pages / texts? p(x, y) 50 / 87
  52. 52. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: Does there really exist the distribution of people / pages / texts? In HEP these distributions do exist p(x, y) 51 / 87
  53. 53. Optimal classification. Bayes optimal classifier Assuming that we know real distributions we reconstruct using Bayes' rule p(x, y) p(y|x) = = p(x, y) p(x) p(y)p(x|y) p(x) = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) 52 / 87
  54. 54. Optimal classification. Bayes optimal classifier Assuming that we know real distributions we reconstruct using Bayes' rule Lemma (Neyman–Pearson): The best classification quality is provided by (Bayes optimal classifier) p(x, y) p(y|x) = = p(x, y) p(x) p(y)p(x|y) p(x) = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) p(y = 1 | x) p(y = 0 | x) 53 / 87
  55. 55. Optimal Binary Classification Bayes optimal classifier has highest possible ROC curve. Since the classification quality depends only on order, gives optimal classification quality too! p(y = 1 | x) = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 54 / 87
  56. 56. Optimal Binary Classification Bayes optimal classifier has highest possible ROC curve. Since the classification quality depends only on order, gives optimal classification quality too! How can we estimate terms from this expression? p(y = 1 | x) = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 55 / 87
  57. 57. Histograms density estimation Counting number of samples in each bin and normalizing. fast choice of binning is crucial number of bins grows exponentially curse of dimensionality→ 56 / 87
  58. 58. Kernel density estimation is kernel, is bandwidth Typically, gaussian kernel is used, but there are many others. Approach is very close to weighted NN. f (x) = K ( ) 1 nh ∑ i x − xi h K(x) h k 57 / 87
  59. 59. Kernel density estimation bandwidth selection Silverman's rule of thumb: h = σ̂  ( ) 4 3n 1 5 58 / 87
  60. 60. Kernel density estimation bandwidth selection Silverman's rule of thumb: may be irrelevant if the data is far from being gaussian h = σ̂  ( ) 4 3n 1 5 59 / 87
  61. 61. Parametric density estimation Family of density functions: . Problem: estimate parameters of a Gaussian distribution. f (x; θ) f (x; μ, Σ) = exp ( − (x − μ (x − μ) ) 1 (2π |Σ ) d/2 | 1/2 1 2 ) T Σ −1 60 / 87
  62. 62. QDA (Quadratic discriminant analysis) Reconstructing probabilities from data, assuming those are multidimensional normal distributions: p(x | y = 1), p(x | y = 0) p(x | y = 0) ∼  ( , )μ0 Σ 0 p(x | y = 1) ∼  ( , )μ1 Σ 1 = = const = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) n1 n0 exp(− (x − (x − )) 1 2 μ1 ) T Σ −1 1 μ1 exp(− (x − (x − )) 1 2 μ0 ) T Σ −1 0 μ0 = exp ( − (x − (x − ) + (x − (x − ) + const ) 1 2 μ1 ) T Σ −1 1 μ1 1 2 μ0 ) T Σ −1 0 μ0 61 / 87
  63. 63. 62 / 87
  64. 64. QDA complexity samples, dimensions training consists of fitting and takes computing covariance matrix inverting covariance matrix prediction takes for each sample spent on computing dot product n d p(x | y = 0) p(x | y = 1) O(n + )d 2 d 3 O(n )d 2 O( )d 3 O( )d 2 63 / 87
  65. 65. QDA overview simple decision rule fast prediction many parameters to reconstruct in high dimensions data almost never has gaussian distribution 64 / 87
  66. 66. Gaussian mixtures for density estimation Mixture of distributions: Mixture of Gaussian distributions: Parameters to be found: , , f (x) = (x, ) = 1 ∑ c−components πc fc θc ∑ c−components πc f (x) = f (x; , ) ∑ c−components πc μc Σ c , …,π1 πC , …,μ1 μC , …,Σ 1 Σ C 65 / 87
  67. 67. 66 / 87
  68. 68. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) no analytic solution we can use general-purpose optimization methods log f ( ; θ) →∑ i xi max θ 67 / 87
  69. 69. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) no analytic solution we can use general-purpose optimization methods In mixtures parameters are split in two groups: — parameters of components — contributions of components log f ( ; θ) →∑ i xi max θ , …,θ1 θC , …,π1 πC 68 / 87
  70. 70. Expectation-Maximization algorithm [Dempster et al., 1977] Idea: introduce set of hidden variables Expectation: Maximization: Maximization step is trivial for Gaussian distributions. EM-algorithm is more stable and has good convergence properties. (x)πc (x) ← p(x ∈ c) =πc (x; )πc fc θc (x; )∑c̃  πc̃ fc̃  θc̃  πc θc ← ← ( ) ∑ i πc xi arg (x) log (x, )max θ ∑ i πc fc θc 69 / 87
  71. 71. EM algorithm 70 / 87
  72. 72. EM algorithm 71 / 87
  73. 73. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct , then use Bayes classification formula to predict. QDA, MDA are generative classifiers. p(x, y) 72 / 87
  74. 74. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct , then use Bayes classification formula to predict. QDA, MDA are generative classifiers. Problems of generative approach Real life distributions hardly can be reconstructed Especially in high-dimensional spaces So, we switch to discriminative approach: guessing directly p(x, y) p(y|x) 73 / 87
  75. 75. Classification: truck vs car 74 / 87
  76. 76. If we can avoid density estimation, we'd better do it. 75 / 87
  77. 77. Linear decision rule Decision function is linear: This is a parametric model (finding parameters ). QDA & MDA are parametric as well. d(x) =< w, x > +w0 { d(x) > 0 → d(x) < 0 → = +1ŷ  = −1ŷ  w, w0 76 / 87
  78. 78. Finding Optimal Parameters A good initial guess: get such , that error of classification is minimal: Notion: . Discontinuous optimization (arrrrgh!) w, w0  = = sgn(d( )) ∑ i∈events 1 ≠y i ŷ  i ŷ  i xi = 1, = 01true 1f alse 77 / 87
  79. 79. Finding Optimal Parameters - 2 Discontinuous optimization solution: let's make decision rule smooth (x)p+1 (x)p−1 = f (d(x)) = 1 − (x)p+1 ⎧ ⎩ ⎨ ⎪ ⎪ f (0) = 0.5 f (x) > 0.5 f (x) < 0.5 if x > 0 if x < 0 78 / 87
  80. 80. Logistic function Properties 1. monotonic, 2. 3. 4. σ(x) = = e x 1 + e x 1 1 + e −x σ(x) ∈ (0, 1) σ(x) + σ(−x) = 1 (x) = σ(x)(1 − σ(x))σ′ 2 σ(x) = 1 + tanh(x/2) 79 / 87
  81. 81. Logistic regression Define probabilities obtained with logistic function and optimize log-likelihood: Important exercise: find an expression and build a plot for d(x) (x)p+1 (x)p−1 = = = < w, x > +w0 σ(d(x)) σ(−d(x))  = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi L( , ) = − ln( ( ))xi yi py i xi 80 / 87
  82. 82. Linear model for regression How to use linear function for regression? Simplification of notion: . d(x) =< w, x > +w0 = 1, x = (1, , …, )x0 x1 xd d(x) =< w, x > 81 / 87
  83. 83. Linear regression (ordinary least squares) We can use linear function for regression: This is a linear system with variables and equations. Minimize OLS aka MSE (mean squared error): Explicit solution: d( ) = d(x) =< w, x >xi yi d + 1 n  = (d( ) − → min ∑ i xi yi ) 2 ( ) w =∑i xi x T i ∑i yi xi 82 / 87
  84. 84. Linear regression can use some other error but no explicit solution in other cases demonstrates properties of linear models reliable estimates when able to completely fit to the data if undefined when n >> d n = d d < n 83 / 87
  85. 85. Data Scientist Pipeline Experiments in appropriate high-level language or environment After experiments are over — implement final algorithm in low-level language (C++, CUDA, FPGA) Second point is not always needed 84 / 87
  86. 86. Scientific Python NumPy vectorized computations in python Matplotlib for drawing Pandas for data manipulation and analysis (based on NumPy) 85 / 87
  87. 87. Scientific Python Scikit-learn most popular library for machine learning Scipy libraries for science and engineering Root_numpy convenient way to work with ROOT files 86 / 87
  88. 88. 87 / 87

×