Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2019.09.10 - Barthelemy - Normalisation en analyse de donnees

435 views

Published on

Présentation "L'importance de la normalisation de données" pour Lyon Data Science le 10/09/2019

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

2019.09.10 - Barthelemy - Normalisation en analyse de donnees

  1. 1. L'importance de la normalisation en analyse de données Lyon Data Science - 10/09/2019 Quentin Barthélemy Ingénieur R&D, PhD @ Foxstream
  2. 2. 2 Foxstream, Vaulx-en-Velin (69)  2 domaines • système de management de la vidéo : gestion caméras, enregistrement, diffusion, recherche, export • vidéo-protection en temps-réel : détection d’intrusions en extérieur, files d’attente  PME, 35 personnes • équipe développement : 10 ingénieurs • équipe recherche et innovation : 3 docteurs et 2 doctorants 2 Pour une maîtrise de la vidéo de bout-en-bout
  3. 3. 2 Data analysis workflow 3 Machine learning pipeline  Normalization • ambiguous term • not reduced to pre-processing step ScoresRaw data Data Features Model Pre-processing Feature extraction Modeling Scoring
  4. 4. 2 Outline 1. Normalizations a) Univariate data b) Multivariate data 2. Applications a) Machine learning b) Anomaly detection c) Deep learning d) Statistics 4
  5. 5. 2 Normalizations 5
  6. 6. 2 Normalizations Data: heterogeneous values 𝑥𝑖 , with different ranges • z-score standardization 𝑧𝑖 = 𝑥𝑖 − 𝜇 𝜎  homogeneous values  z-score is a distance to the distribution • hypothesis: Gaussian/normal distribution 𝒩 𝜇,𝜎2 𝑓 𝑥𝑖 = 1 2𝜋𝜎2 𝑒 − 1 2 𝑥 𝑖−𝜇 𝜎 2  z-score value has a strong meaning valid normalization? depends on data… 6 𝜇 𝜎 𝜇: mean 𝜎: standard deviation
  7. 7. 2 Normalizations Non-Gaussian data • problems • z-score has not the same meaning for left and right parts • mean and standard deviation sensitive to values present in the tail • z-score normalization • does not change the shape of the distribution  outputs are not necessarily Gaussian normalized → rescaled • geometric z-score 𝑧 𝑔 = log(𝑥 𝑖/𝜇 𝑔) log(𝜎 𝑔) adapted to log-normal distributions 7
  8. 8. 2 Normalizations 8  2nd order statistic: dispersion • standard deviation • median absolute deviation (MAD) • dispersion ratio  1st order statistic: central tendency 𝜇 = arg min 𝑚 ෍ 𝑖 𝑥𝑖 − 𝑚 𝑝 giving • mean for 𝑝 = 2 • median for 𝑝 = 1 • mode i.e. most frequent ≠ maximal value  robust statistics  robust rescaling 𝑥 − mean std  equal for symmetric distributions 𝑥 − med 1.4826 × mad  z-score
  9. 9. 2 Normalizations  Other rescalings • mean 𝑥 ← 𝑥−𝑚𝑒𝑎𝑛 𝑚𝑎𝑥−𝑚𝑖𝑛 • min-max 𝑥 ← 𝑥−𝑚𝑖𝑛 𝑚𝑎𝑥−𝑚𝑖𝑛 • etc.  Unit norm 𝑥 ← 𝑥 𝑥 𝑝 giving • unitary energy, using Euclidean norm 𝑝 = 2 • max absolute value = 1, using infinity norm 𝑝 = ∞ • etc. normalized → normed 9  𝑥𝑖 ∈ [−1, 1]  𝑥𝑖 ∈ 0, 1  𝑥𝑖 ∈ −1, 1  min and max: sensitive to outliers
  10. 10. Original data 2 Multivariate normalizations Normalizations for multiple variables/components/features 𝑥 ∈ ℝ 𝑁 • component-wise normalization • multi-component normalization: whitening/sphering  decorrelates components  improves problem conditioning 10 𝑧 = Σ−1/2 𝑥 − 𝜇 with Σ the covariance matrix 𝑧 𝑛 = 𝑥 𝑛 − 𝜇 𝑛 𝜎 𝑛 Centered data Standardized data Whitened data (≠ dimension reduction)
  11. 11. 2 Quantile transformation Non-linear transformation based on quantile function  modifies the shape of the distribution (Gaussian, uniform) normalized → Gaussianized 11 Original data Standardized data Gaussianized data
  12. 12. 2 Applications 12
  13. 13. NormalizedRaw 2 Machine learning Model optimization during training • Problem not well-conditioned  slow convergence of gradient descent, or non-convergence • Normalization (often called feature scaling)  error surface with a more spherical shape  better convergence of curvature-ignorant optimizers, like gradient descent Gradient descent optimizers: stochastic gradient descent (SGD), momentum, AdaGrad, RMSprop, Adadelta, Nesterov, etc.  No normalization for decision trees, random forest, Naive Bayes, LDA, etc. 13 p1 p2
  14. 14. 2 Anomaly detection Many methods for anomaly detection [Chandola2009] 1. Modeling the features • Univariate case (1 feature) • z-score + threshold • Multivariate case (vector of 𝑛 features) • multiple z-scores + multiple thresholds • Mahalanobis distance 𝑑 = 𝑥 − 𝜇 𝑇Σ−1 𝑥 − 𝜇 = Euclidean distance after whitening 14
  15. 15. 2 Anomaly detection 2. Modeling the distances to the average features a) compute the average features (scalar, vector or matrix) b) compute distances between the average and all samples c) compute z-scores of distances d) detect outliers with a z-score threshold Anomaly detector • one-class classifier with unsupervised training • few data for calibration • simple and intuitive • flexible: mean and distance can be adapted to features • Riemannian mean and distance for neurophysiologic time-series [Barachant2013] • Optimal transport for acoustic spectra [Alaoui2019] • multi-modal extension [Saifutdinova2019] 15 𝜇 𝜎
  16. 16. : covariance mean threshold of detection 2 Anomaly detection Example of Riemannian potato [Barachant2013] 16 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Electroencephalographic (EEG): multivariate time-series Time 𝚺𝒊 = 𝐯𝐚𝐫(𝐅𝐳) 𝐜𝐨𝐯(𝐅𝐳, 𝐂𝟒) 𝐜𝐨𝐯(𝐂𝟒, 𝐅𝐳) 𝐯𝐚𝐫(𝐂𝟒)
  17. 17. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 17 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  18. 18. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 18 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  19. 19. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 19 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  20. 20. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 20 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  21. 21. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 21 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  22. 22. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 22 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  23. 23. 2 Anomaly detection Example of Riemannian potato [Barachant2013] 23 𝐜𝐨𝐯𝐂𝟒,𝐅𝐳 𝐯𝐚𝐫(𝐂𝟒) 𝐂𝟒 𝐅𝐳 Time
  24. 24. 2 Deep learning Batch normalization in Inception v2 [Ioffe2015] • Problem during training update of preceding layers  internal covariate-shift  instabilities in training process • Batch-normalization compute a z-score on each batch  reduce the amount of variations  for a same learning rate (LR), it speeds up learning Deep learning: no need to normalize/pre-process input data 24
  25. 25. 2 Statistical hypothesis tests 25 Example of a medical trial subjects randomly assigned to 2 treatments and evaluated with a clinical scale (𝐷) 𝐻0: null hypothesis, which we are trying to reject, is that there is no difference between the two treatments We want to answer the following question: Under 𝐻0, what is the probability to obtain the difference in the means that we observed in the experimental data?  use a unpaired Student’s t-test and obtain a p-value 𝑝 𝐻0 𝐷 Assumptions: Gaussian distributions of scales, equal variances, etc Todo: check normality/Gaussianity of data, use non-parametric tests (Wilcoxon, Mann-Whitney, etc)
  26. 26. 2 Statistical hypothesis tests Example of Higgs boson (04/07/2012) 𝐻0: a world without Higgs 𝐻1: a world with Higgs 𝑝 𝐻0 𝐷 ≈ 3. 10−7 • Bayes’ theorem 𝑝 𝐻0|𝐷 = 𝑝 𝐷|𝐻0 𝑝(𝐻0) 𝑝(𝐷) • Title explanation 𝑝 𝐻1|𝐷 = 0,999999 = 1 − 𝑝 𝐻0|𝐷 = 1 − 𝑝 𝐻0 𝐷 https://en.wikipedia.org/wiki/Misuse_of_p-values Conclusion: before bashing p-value, let’s start computing it correctly 26
  27. 27. 2 Conclusion 27
  28. 28. 2 Take home messages  Normalization is crucial  Normalization is an ambiguous word 28 normalized  normed  rescaled  whitened  Gaussianized
  29. 29. 2 Take home messages  Avoid black-boxes because hypotheses behind each equation  Check your data, plot histograms 29
  30. 30. 2 Thank you 30 q.barthelemy@foxstream.fr

×