Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DutchMLSchool. Logistic Regression, Deepnets, Time Series

69 views

Published on

DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DutchMLSchool. Logistic Regression, Deepnets, Time Series

  1. 1. 1st edition | July 8-11, 2019
  2. 2. #DutchMLSchoolBigML, Inc 2 Logistic Regression, Deepnets and Univariate Time Series Going Further With Supervised Learning Charles Parker VP Machine Learning Algorithms
  3. 3. #DutchMLSchoolBigML, Inc 3 Logistic Regression
  4. 4. #DutchMLSchoolBigML, Inc Supervised learning review 4 animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classification animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label
  5. 5. #DutchMLSchoolBigML, Inc Logistic Regression 5 Classification implies a discrete objective. How can this be a regression? Logistic Regression is a classification algorithm Potential Confusion:
  6. 6. #DutchMLSchoolBigML, Inc Linear Regression 6
  7. 7. #DutchMLSchoolBigML, Inc Linear Regression 7
  8. 8. #DutchMLSchoolBigML, Inc Polynomial Regression 8
  9. 9. #DutchMLSchoolBigML, Inc Regression 9 • Linear Regression: 𝛽₀+𝛽1·(INPUT) ≈ OBJECTIVE • Quadratic Regression: 𝛽₀+𝛽1·(INPUT)+𝛽2·(INPUT)2 ≈ OBJECTIVE • Decision Tree Regression: DT(INPUT) ≈ OBJECTIVE NEW PROBLEM • What if we want to do a classification problem: T/F or 1/0 • What function can we fit to discrete data? Regression is the process of "fitting" a function to the data Key Take-Away
  10. 10. #DutchMLSchoolBigML, Inc Discrete Data Function? 10
  11. 11. #DutchMLSchoolBigML, Inc Discrete Data Function? 11 ????
  12. 12. #DutchMLSchoolBigML, Inc Logistic Function 12 𝑥➝-∞ 𝑓(𝑥)➝0 • Looks promising, but still not "discrete" • What about the "green" in the middle? • Let’s change the problem… 𝑥➝∞ 𝑓(𝑥)➝1 Goal 1 1 + 𝒆− 𝑥𝑓(𝑥) = Logistic Function
  13. 13. #DutchMLSchoolBigML, Inc Modeling Probabilities 13 𝑃≈0 𝑃≈10<𝑃<1
  14. 14. #DutchMLSchoolBigML, Inc Logistic Regression 14 • Assumes that output is linearly related to "predictors" • Question: how do we "fit" the logistic function to real data? LR is a classification algorithm … that uses a regression …
 to model the probability of the discrete objective Clarification: Caveats:
  15. 15. #DutchMLSchoolBigML, Inc Logistic Regression 15 For "𝑖" dimensions, 𝑿﹦[ 𝑥1, 𝑥2,⋯, 𝑥𝑖 ], we solve 𝑃(𝑿)= 1 1+𝑒−𝑓(𝑿) 𝑓(𝑿)=𝛽0+𝞫·𝑿=𝛽0+𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖 where:
  16. 16. #DutchMLSchoolBigML, Inc Interpreting Coefficients 16 • LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗 • negative 𝛽𝑗 → negatively correlated: • positive 𝛽𝑗 → positively correlated: • "larger" 𝛽𝑗 → more impact: • "smaller" → less impact: • 𝛽𝑗 "size" should not be confused with field importance 𝑥𝑗↑ then 𝑃(𝑿)↓ 𝑥𝑗↑ then 𝑃(𝑿)↑ 𝑥𝑗≫ then 𝑃(𝑿)﹥ 𝑥𝑗﹥then 𝑃(𝑿)≫
  17. 17. #DutchMLSchoolBigML, Inc LR versus DT 17 • Expects a "smooth" linear relationship with predictors. • LR is concerned with probability of a discrete outcome. • Lots of parameters to get wrong: 
 regularization, scaling, codings • Slightly less prone to over-fitting
 • Because fits a shape, might work better when less data available.
 • Adapts well to ragged non-linear relationships • No concern: classification, regression, multi-class all fine. • Virtually parameter free
 • Slightly more prone to over-fitting
 • Prefers surfaces parallel to parameter axes, but given enough data will discover any shape. Logistic Regression Decision Tree
  18. 18. #DutchMLSchoolBigML, Inc Summary 18 • Logistic Regression is a classification algorithm that models the probabilities of each class • Expects a linear relationship between the features and the objective, and how to fix it • LR outputs a set of coefficients and how to interpret • Scale relates to impact • Sign relates to direction of impact
  19. 19. #DutchMLSchoolBigML, Inc 19 Deep Neural Networks
  20. 20. #DutchMLSchoolBigML, Inc 20 Power To The People! • Why another supervised learning algorithm? • Deep neural networks have been shown to be state of the art in several niche applications • Vision • Speech recognition • NLP • While powerful, these networks have historically been difficult for novices to train
  21. 21. #DutchMLSchoolBigML, Inc Goals of BigML Deepnets 21 • What BigML Deepnets are not (yet) • Convolutional networks (Coming Soon!) • Recurrent networks (e.g., LSTM Networks) • These solve a particular type of sub-problem, and are carefully engineered by experts to do so • Can we bring some of the power of Deep Neural Networks to your problem, even if you have no deep learning expertise? • Let’s try to separate deep neural network myths from realities
  22. 22. #DutchMLSchoolBigML, Inc Myth #1 22 Deep neural networks are the next step in evolution, destined to perfect humanity or destroy it utterly.
  23. 23. #DutchMLSchoolBigML, Inc Some Weaknesses 23 • Trees • Pro: Massive representational power that expands as the data gets larger; efficient search through this space • Con: Difficult to represent smooth functions and functions of many variables • Ensembles mitigate some of these difficulties • Logistic Regression • Pro: Some smooth, multivariate, functions are not a problem; fast optimization • Con: Parametric - If decision boundary is nonlinear, tough luck • Can these be mitigated?
  24. 24. #DutchMLSchoolBigML, Inc Logistic Level Up 24 Outputs Inputs
  25. 25. #DutchMLSchoolBigML, Inc Logistic Level Up 25 wi Class “a”, logistic(w, b)
  26. 26. #DutchMLSchoolBigML, Inc Logistic Level Up 26 Outputs Inputs Hidden layer
  27. 27. #DutchMLSchoolBigML, Inc Logistic Level Up 27 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b)
  28. 28. #DutchMLSchoolBigML, Inc Logistic Level Up 28 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b) n hidden nodes?
  29. 29. #DutchMLSchoolBigML, Inc Logistic Level Up 29 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b) n hidden layers?
  30. 30. #DutchMLSchoolBigML, Inc Logistic Level Up 30 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b)
  31. 31. #DutchMLSchoolBigML, Inc Myth #2 31 Deep neural networks are great for the established marquee applications, but less interesting for general use.
  32. 32. #DutchMLSchoolBigML, Inc 32 Parameter Paralysis Parameter Name Possible Values Descent Algorithm Adam, RMSProp, Adagrad, Momentum, FTRL Number of hidden layers 0 - 32 Activation Function (per layer) relu, tanh, sigmoid, softplus, etc. Number of nodes (per layer) 1 - 8192 Learning Rate 0 - 1 Dropout Rate 0 - 1 Batch size 1 - 1024 Batch Normalization True, False Learn Residuals True, False Missing Numerics True, False Objective weights Weight per class . . . and that’s ignoring the parameters that are specific to the descent algorithm.
  33. 33. #DutchMLSchoolBigML, Inc What Can We Do? 33 • Clearly there are too many parameters to fuss with • Setting them takes significant expert knowledge • Solution: Metalearning (a good initial guess) • Solution: Network search (try a bunch)
  34. 34. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 34 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6
  35. 35. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 35 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75
  36. 36. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 36 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48
  37. 37. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 37 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48 0.91
  38. 38. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 38 Structure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48 0.91 Machine Learning! Structure → performance Model and Evaluate
  39. 39. #DutchMLSchoolBigML, Inc Benchmarking 39 • The ML world is filled with crummy benchmarks • Not enough datasets • No cross-validation • Only one metric • Solution: Roll our own • 50+ datasets, 5 replications of 10-fold CV • 10 different metrics • 30+ competing algorithms (R, scikit-learn, weka, xgboost) http://www.clparker.org/ml_benchmark/
  40. 40. #DutchMLSchoolBigML, Inc Myth #3 40 Deep neural networks are not interpretable
  41. 41. #DutchMLSchoolBigML, Inc Explainability 41 • Recent work in model interpretation applies broadly to any model • Feature importance (overall) • Prediction explanation (feature importance for a given prediction) • Most (good) techniques rely on data perturbation and multiple predictions
  42. 42. #DutchMLSchoolBigML, Inc Myth #4 42 Deep neural networks have such spectacular performance that all other supervised learning techniques are now irrelevant
  43. 43. #DutchMLSchoolBigML, Inc Caveat Emptor 43 • Things that make deep learning less useful: • Small data (where that could still be thousands of instances) • Problems where you could benefit by iterating quickly (better features always beats better models) • Problems that are easy, or for which top-of-the-line performance isn’t absolutely critical • Remember deep learning is just another sort of supervised learning algorithm “…deep learning has existed in the neural network community for over 20 years. Recent advances are driven by some relatively minor improvements in algorithms and models and by the availability of large data sets and much more powerful collections of computers.” — Stuart Russell
  44. 44. #DutchMLSchoolBigML, Inc 44 Univariate Time Series
  45. 45. #DutchMLSchoolBigML, Inc Beyond IID Data 45 • Traditional machine learning data is assumed to be IID • Independent (points have no information about each other’s class) and • Identically distributed (come from the same distribution) • But what if you want to predict just the next value in a sequence? Is all lost? • Applications • Predicting battery life from change-discharge cycles • Predicting sales for the next day/week/month
  46. 46. #DutchMLSchoolBigML, Inc Machine Learning Data 46 Color Mass Type red 11 pen green 45 apple red 53 apple yellow 0 pen blue 2 pen green 422 pineapple yellow 555 pineapple blue 7 pen Discovering patterns within data: • Color = “red” Mass < 100 • Type = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  47. 47. #DutchMLSchoolBigML, Inc Machine Learning Data 47 Color Mass Type red 53 apple blue 2 pen red 11 pen blue 7 pen green 45 apple yellow 555 pineapple green 422 pineapple yellow 0 pen Patterns valid despite reshuffling • Color = “red” Mass < 100 • Type = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  48. 48. #DutchMLSchoolBigML, Inc Time Series Data 48 Year Pineapple Harvest 1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend
  49. 49. #DutchMLSchoolBigML, Inc Time Series Data 49 Year Pineapple Harvest 1986 139,09 1987 175,31 1988 9,91 1989 22,95 1990 450,53 1991 73,93 1992 40,38 1993 22,03 1994 295,03 1995 50,74 1996 29,8 1997 223,41 1998 115,17 1999 193,88 2000 50,69 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Patterns invalid after shuffling
  50. 50. #DutchMLSchoolBigML, Inc Prediction 50 Use the data from the past to predict the future
  51. 51. #DutchMLSchoolBigML, Inc Exponential Smoothing 51
  52. 52. #DutchMLSchoolBigML, Inc Exponential Smoothing 52 Weight 0 0,05 0,1 0,15 0,2 Lag 1 3 5 7 9 11 13
  53. 53. #DutchMLSchoolBigML, Inc Trend 53 y 0 12,5 25 37,5 50 Time Apr May Jun Jul y 0 50 100 150 200 Time Apr May Jun Jul Additive Multiplicative
  54. 54. #DutchMLSchoolBigML, Inc Seasonality 54 y 0 30 60 90 120 Time 1 4 7 10 13 16 19 y 0 35 70 105 140 Time 1 4 7 10 13 16 19 Additive Multiplicative
  55. 55. #DutchMLSchoolBigML, Inc Error 55 y 0 150 300 450 600 Time 1 4 7 10 13 16 19 y 0 125 250 375 500 Time 1 4 7 10 13 16 19 Additive Multiplicative
  56. 56. #DutchMLSchoolBigML, Inc Model Types 56 None Additive Multiplicative None A,N,N M,N,N A,N,A M,N,A A,N,M M,N,M Additive A,A,N M,A,N A,A,A M,A,A A,A,M M,A,M Additive + Damped A,Ad,N M,Ad,N A,Ad,A M,Ad,A A,Ad,M M,Ad,M Multiplicative A,M,N M,M,N A,M,A M,M,A A,M,M M,M,M Multiplicative + Damped A,Md,N M,Md,N A,Md,A M,Md,A A,Md,M M,Md,M M,N,A Multiplicative Error No Trend Additive Seasonality
  57. 57. #DutchMLSchoolBigML, Inc Evaluating Model Fit 57 • AIC: Akaike Information Criterion; tries to trade off accuracy and model complexity • AICc: Like the AIC, but with a sample size correction • BIC: Bayesian Information Criterion; like the AIC but penalizes large numbers of parameters more harshly • R-squared: Raw performance, the number of model parameters isn’t considered
  58. 58. Co-organized by: Sponsor: Business Partners:

×