Successfully reported this slideshow.
Upcoming SlideShare
×

# DutchMLSchool. Logistic Regression, Deepnets, Time Series

69 views

Published on

DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### DutchMLSchool. Logistic Regression, Deepnets, Time Series

1. 1. 1st edition | July 8-11, 2019
2. 2. #DutchMLSchoolBigML, Inc 2 Logistic Regression, Deepnets and Univariate Time Series Going Further With Supervised Learning Charles Parker VP Machine Learning Algorithms
3. 3. #DutchMLSchoolBigML, Inc 3 Logistic Regression
4. 4. #DutchMLSchoolBigML, Inc Supervised learning review 4 animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classiﬁcation animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label
5. 5. #DutchMLSchoolBigML, Inc Logistic Regression 5 Classiﬁcation implies a discrete objective. How can this be a regression? Logistic Regression is a classification algorithm Potential Confusion:
6. 6. #DutchMLSchoolBigML, Inc Linear Regression 6
7. 7. #DutchMLSchoolBigML, Inc Linear Regression 7
8. 8. #DutchMLSchoolBigML, Inc Polynomial Regression 8
9. 9. #DutchMLSchoolBigML, Inc Regression 9 • Linear Regression: 𝛽₀＋𝛽1·(INPUT) ≈ OBJECTIVE • Quadratic Regression: 𝛽₀＋𝛽1·(INPUT)＋𝛽2·(INPUT)2 ≈ OBJECTIVE • Decision Tree Regression: DT(INPUT) ≈ OBJECTIVE NEW PROBLEM • What if we want to do a classiﬁcation problem: T/F or 1/0 • What function can we ﬁt to discrete data? Regression is the process of "fitting" a function to the data Key Take-Away
10. 10. #DutchMLSchoolBigML, Inc Discrete Data Function? 10
11. 11. #DutchMLSchoolBigML, Inc Discrete Data Function? 11 ????
12. 12. #DutchMLSchoolBigML, Inc Logistic Function 12 𝑥➝-∞ 𝑓(𝑥)➝0 • Looks promising, but still not "discrete" • What about the "green" in the middle? • Let’s change the problem… 𝑥➝∞ 𝑓(𝑥)➝1 Goal 1 1 ＋ 𝒆− 𝑥𝑓(𝑥) ＝ Logistic Function
13. 13. #DutchMLSchoolBigML, Inc Modeling Probabilities 13 𝑃≈0 𝑃≈10<𝑃<1
14. 14. #DutchMLSchoolBigML, Inc Logistic Regression 14 • Assumes that output is linearly related to "predictors" • Question: how do we "ﬁt" the logistic function to real data? LR is a classification algorithm … that uses a regression …  to model the probability of the discrete objective Clarification: Caveats:
15. 15. #DutchMLSchoolBigML, Inc Logistic Regression 15 For "𝑖" dimensions, 𝑿﹦[ 𝑥1, 𝑥2,⋯, 𝑥𝑖 ], we solve 𝑃(𝑿)＝ 1 1＋𝑒−𝑓(𝑿) 𝑓(𝑿)＝𝛽0＋𝞫·𝑿＝𝛽0＋𝛽1 𝑥1＋⋯＋𝛽𝑖 𝑥𝑖 where:
16. 16. #DutchMLSchoolBigML, Inc Interpreting Coefﬁcients 16 • LR computes 𝛽0 and coefﬁcients 𝛽𝑗 for each feature 𝑥𝑗 • negative 𝛽𝑗 → negatively correlated: • positive 𝛽𝑗 → positively correlated: • "larger" 𝛽𝑗 → more impact: • "smaller" → less impact: • 𝛽𝑗 "size" should not be confused with ﬁeld importance 𝑥𝑗↑ then 𝑃(𝑿)↓ 𝑥𝑗↑ then 𝑃(𝑿)↑ 𝑥𝑗≫ then 𝑃(𝑿)﹥ 𝑥𝑗﹥then 𝑃(𝑿)≫
17. 17. #DutchMLSchoolBigML, Inc LR versus DT 17 • Expects a "smooth" linear relationship with predictors. • LR is concerned with probability of a discrete outcome. • Lots of parameters to get wrong:   regularization, scaling, codings • Slightly less prone to over-ﬁtting  • Because ﬁts a shape, might work better when less data available.  • Adapts well to ragged non-linear relationships • No concern: classiﬁcation, regression, multi-class all ﬁne. • Virtually parameter free  • Slightly more prone to over-ﬁtting  • Prefers surfaces parallel to parameter axes, but given enough data will discover any shape. Logistic Regression Decision Tree
18. 18. #DutchMLSchoolBigML, Inc Summary 18 • Logistic Regression is a classiﬁcation algorithm that models the probabilities of each class • Expects a linear relationship between the features and the objective, and how to ﬁx it • LR outputs a set of coeﬃcients and how to interpret • Scale relates to impact • Sign relates to direction of impact
19. 19. #DutchMLSchoolBigML, Inc 19 Deep Neural Networks
20. 20. #DutchMLSchoolBigML, Inc 20 Power To The People! • Why another supervised learning algorithm? • Deep neural networks have been shown to be state of the art in several niche applications • Vision • Speech recognition • NLP • While powerful, these networks have historically been diﬃcult for novices to train
21. 21. #DutchMLSchoolBigML, Inc Goals of BigML Deepnets 21 • What BigML Deepnets are not (yet) • Convolutional networks (Coming Soon!) • Recurrent networks (e.g., LSTM Networks) • These solve a particular type of sub-problem, and are carefully engineered by experts to do so • Can we bring some of the power of Deep Neural Networks to your problem, even if you have no deep learning expertise? • Let’s try to separate deep neural network myths from realities
22. 22. #DutchMLSchoolBigML, Inc Myth #1 22 Deep neural networks are the next step in evolution, destined to perfect humanity or destroy it utterly.
23. 23. #DutchMLSchoolBigML, Inc Some Weaknesses 23 • Trees • Pro: Massive representational power that expands as the data gets larger; eﬃcient search through this space • Con: Diﬃcult to represent smooth functions and functions of many variables • Ensembles mitigate some of these diﬃculties • Logistic Regression • Pro: Some smooth, multivariate, functions are not a problem; fast optimization • Con: Parametric - If decision boundary is nonlinear, tough luck • Can these be mitigated?
24. 24. #DutchMLSchoolBigML, Inc Logistic Level Up 24 Outputs Inputs
25. 25. #DutchMLSchoolBigML, Inc Logistic Level Up 25 wi Class “a”, logistic(w, b)
26. 26. #DutchMLSchoolBigML, Inc Logistic Level Up 26 Outputs Inputs Hidden layer
27. 27. #DutchMLSchoolBigML, Inc Logistic Level Up 27 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b)
28. 28. #DutchMLSchoolBigML, Inc Logistic Level Up 28 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b) n hidden nodes?
29. 29. #DutchMLSchoolBigML, Inc Logistic Level Up 29 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b) n hidden layers?
30. 30. #DutchMLSchoolBigML, Inc Logistic Level Up 30 Class “a”, logistic(w, b) Hidden node 1, logistic(w, b)
31. 31. #DutchMLSchoolBigML, Inc Myth #2 31 Deep neural networks are great for the established marquee applications, but less interesting for general use.
32. 32. #DutchMLSchoolBigML, Inc 32 Parameter Paralysis Parameter Name Possible Values Descent Algorithm Adam, RMSProp, Adagrad, Momentum, FTRL Number of hidden layers 0 - 32 Activation Function (per layer) relu, tanh, sigmoid, softplus, etc. Number of nodes (per layer) 1 - 8192 Learning Rate 0 - 1 Dropout Rate 0 - 1 Batch size 1 - 1024 Batch Normalization True, False Learn Residuals True, False Missing Numerics True, False Objective weights Weight per class . . . and that’s ignoring the parameters that are speciﬁc to the descent algorithm.
33. 33. #DutchMLSchoolBigML, Inc What Can We Do? 33 • Clearly there are too many parameters to fuss with • Setting them takes signiﬁcant expert knowledge • Solution: Metalearning (a good initial guess) • Solution: Network search (try a bunch)
34. 34. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 34 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6
35. 35. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 35 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75
36. 36. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 36 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48
37. 37. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 37 Model and EvaluateStructure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48 0.91
38. 38. #DutchMLSchoolBigML, Inc Bayesian Parameter Optimization 38 Structure 1 Structure 2 Structure 3 Structure 4 Structure 5 Structure 6 0.75 0.48 0.91 Machine Learning! Structure → performance Model and Evaluate
39. 39. #DutchMLSchoolBigML, Inc Benchmarking 39 • The ML world is ﬁlled with crummy benchmarks • Not enough datasets • No cross-validation • Only one metric • Solution: Roll our own • 50+ datasets, 5 replications of 10-fold CV • 10 diﬀerent metrics • 30+ competing algorithms (R, scikit-learn, weka, xgboost) http://www.clparker.org/ml_benchmark/
40. 40. #DutchMLSchoolBigML, Inc Myth #3 40 Deep neural networks are not interpretable
41. 41. #DutchMLSchoolBigML, Inc Explainability 41 • Recent work in model interpretation applies broadly to any model • Feature importance (overall) • Prediction explanation (feature importance for a given prediction) • Most (good) techniques rely on data perturbation and multiple predictions
42. 42. #DutchMLSchoolBigML, Inc Myth #4 42 Deep neural networks have such spectacular performance that all other supervised learning techniques are now irrelevant
43. 43. #DutchMLSchoolBigML, Inc Caveat Emptor 43 • Things that make deep learning less useful: • Small data (where that could still be thousands of instances) • Problems where you could beneﬁt by iterating quickly (better features always beats better models) • Problems that are easy, or for which top-of-the-line performance isn’t absolutely critical • Remember deep learning is just another sort of supervised learning algorithm “…deep learning has existed in the neural network community for over 20 years. Recent advances are driven by some relatively minor improvements in algorithms and models and by the availability of large data sets and much more powerful collections of computers.” — Stuart Russell
44. 44. #DutchMLSchoolBigML, Inc 44 Univariate Time Series
45. 45. #DutchMLSchoolBigML, Inc Beyond IID Data 45 • Traditional machine learning data is assumed to be IID • Independent (points have no information about each other’s class) and • Identically distributed (come from the same distribution) • But what if you want to predict just the next value in a sequence? Is all lost? • Applications • Predicting battery life from change-discharge cycles • Predicting sales for the next day/week/month
46. 46. #DutchMLSchoolBigML, Inc Machine Learning Data 46 Color Mass Type red 11 pen green 45 apple red 53 apple yellow 0 pen blue 2 pen green 422 pineapple yellow 555 pineapple blue 7 pen Discovering patterns within data: • Color = “red” Mass < 100 • Type = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
47. 47. #DutchMLSchoolBigML, Inc Machine Learning Data 47 Color Mass Type red 53 apple blue 2 pen red 11 pen blue 7 pen green 45 apple yellow 555 pineapple green 422 pineapple yellow 0 pen Patterns valid despite reshufﬂing • Color = “red” Mass < 100 • Type = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
48. 48. #DutchMLSchoolBigML, Inc Time Series Data 48 Year Pineapple Harvest 1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend
49. 49. #DutchMLSchoolBigML, Inc Time Series Data 49 Year Pineapple Harvest 1986 139,09 1987 175,31 1988 9,91 1989 22,95 1990 450,53 1991 73,93 1992 40,38 1993 22,03 1994 295,03 1995 50,74 1996 29,8 1997 223,41 1998 115,17 1999 193,88 2000 50,69 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Patterns invalid after shufﬂing
50. 50. #DutchMLSchoolBigML, Inc Prediction 50 Use the data from the past to predict the future
51. 51. #DutchMLSchoolBigML, Inc Exponential Smoothing 51
52. 52. #DutchMLSchoolBigML, Inc Exponential Smoothing 52 Weight 0 0,05 0,1 0,15 0,2 Lag 1 3 5 7 9 11 13
53. 53. #DutchMLSchoolBigML, Inc Trend 53 y 0 12,5 25 37,5 50 Time Apr May Jun Jul y 0 50 100 150 200 Time Apr May Jun Jul Additive Multiplicative
54. 54. #DutchMLSchoolBigML, Inc Seasonality 54 y 0 30 60 90 120 Time 1 4 7 10 13 16 19 y 0 35 70 105 140 Time 1 4 7 10 13 16 19 Additive Multiplicative
55. 55. #DutchMLSchoolBigML, Inc Error 55 y 0 150 300 450 600 Time 1 4 7 10 13 16 19 y 0 125 250 375 500 Time 1 4 7 10 13 16 19 Additive Multiplicative
56. 56. #DutchMLSchoolBigML, Inc Model Types 56 None Additive Multiplicative None A,N,N M,N,N A,N,A M,N,A A,N,M M,N,M Additive A,A,N M,A,N A,A,A M,A,A A,A,M M,A,M Additive + Damped A,Ad,N M,Ad,N A,Ad,A M,Ad,A A,Ad,M M,Ad,M Multiplicative A,M,N M,M,N A,M,A M,M,A A,M,M M,M,M Multiplicative + Damped A,Md,N M,Md,N A,Md,A M,Md,A A,Md,M M,Md,M M,N,A Multiplicative Error No Trend Additive Seasonality
57. 57. #DutchMLSchoolBigML, Inc Evaluating Model Fit 57 • AIC: Akaike Information Criterion; tries to trade oﬀ accuracy and model complexity • AICc: Like the AIC, but with a sample size correction • BIC: Bayesian Information Criterion; like the AIC but penalizes large numbers of parameters more harshly • R-squared: Raw performance, the number of model parameters isn’t considered