DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
2. #DutchMLSchoolBigML, Inc 2
Logistic Regression, Deepnets
and Univariate Time Series
Going Further With Supervised Learning
Charles Parker
VP Machine Learning Algorithms
9. #DutchMLSchoolBigML, Inc
Regression
9
• Linear Regression: 𝛽₀+𝛽1·(INPUT) ≈ OBJECTIVE
• Quadratic Regression: 𝛽₀+𝛽1·(INPUT)+𝛽2·(INPUT)2 ≈ OBJECTIVE
• Decision Tree Regression: DT(INPUT) ≈ OBJECTIVE
NEW PROBLEM
• What if we want to do a classification problem: T/F or 1/0
• What function can we fit to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away
14. #DutchMLSchoolBigML, Inc
Logistic Regression
14
• Assumes that output is linearly related to "predictors"
• Question: how do we "fit" the logistic function to real data?
LR is a classification algorithm … that uses a regression …
to model the probability of the discrete objective
Clarification:
Caveats:
16. #DutchMLSchoolBigML, Inc
Interpreting Coefficients
16
• LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗
• negative 𝛽𝑗 → negatively correlated:
• positive 𝛽𝑗 → positively correlated:
• "larger" 𝛽𝑗 → more impact:
• "smaller" → less impact:
• 𝛽𝑗 "size" should not be confused with field importance
𝑥𝑗↑ then 𝑃(𝑿)↓
𝑥𝑗↑ then 𝑃(𝑿)↑
𝑥𝑗≫ then 𝑃(𝑿)﹥
𝑥𝑗﹥then 𝑃(𝑿)≫
17. #DutchMLSchoolBigML, Inc
LR versus DT
17
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong:
regularization, scaling, codings
• Slightly less prone to over-fitting
• Because fits a shape, might work
better when less data available.
• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free
• Slightly more prone to over-fitting
• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree
18. #DutchMLSchoolBigML, Inc
Summary
18
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• Expects a linear relationship between the features and
the objective, and how to fix it
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact
20. #DutchMLSchoolBigML, Inc 20
Power To The People!
• Why another supervised learning algorithm?
• Deep neural networks have been shown to be
state of the art in several niche applications
• Vision
• Speech recognition
• NLP
• While powerful, these networks have historically
been difficult for novices to train
21. #DutchMLSchoolBigML, Inc
Goals of BigML Deepnets
21
• What BigML Deepnets are not (yet)
• Convolutional networks (Coming Soon!)
• Recurrent networks (e.g., LSTM Networks)
• These solve a particular type of sub-problem, and
are carefully engineered by experts to do so
• Can we bring some of the power of Deep Neural
Networks to your problem, even if you have no
deep learning expertise?
• Let’s try to separate deep neural network myths
from realities
23. #DutchMLSchoolBigML, Inc
Some Weaknesses
23
• Trees
• Pro: Massive representational power that expands as the data
gets larger; efficient search through this space
• Con: Difficult to represent smooth functions and functions of
many variables
• Ensembles mitigate some of these difficulties
• Logistic Regression
• Pro: Some smooth, multivariate, functions are not a problem;
fast optimization
• Con: Parametric - If decision boundary is nonlinear, tough luck
• Can these be mitigated?
32. #DutchMLSchoolBigML, Inc 32
Parameter Paralysis
Parameter Name Possible Values
Descent Algorithm Adam, RMSProp, Adagrad, Momentum, FTRL
Number of hidden layers 0 - 32
Activation Function (per layer) relu, tanh, sigmoid, softplus, etc.
Number of nodes (per layer) 1 - 8192
Learning Rate 0 - 1
Dropout Rate 0 - 1
Batch size 1 - 1024
Batch Normalization True, False
Learn Residuals True, False
Missing Numerics True, False
Objective weights Weight per class
. . . and that’s ignoring the parameters that are
specific to the descent algorithm.
33. #DutchMLSchoolBigML, Inc
What Can We Do?
33
• Clearly there are too many parameters to fuss with
• Setting them takes significant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)
39. #DutchMLSchoolBigML, Inc
Benchmarking
39
• The ML world is filled with crummy benchmarks
• Not enough datasets
• No cross-validation
• Only one metric
• Solution: Roll our own
• 50+ datasets, 5 replications of 10-fold CV
• 10 different metrics
• 30+ competing algorithms (R, scikit-learn, weka, xgboost)
http://www.clparker.org/ml_benchmark/
41. #DutchMLSchoolBigML, Inc
Explainability
41
• Recent work in model interpretation applies
broadly to any model
• Feature importance (overall)
• Prediction explanation (feature importance
for a given prediction)
• Most (good) techniques rely on data perturbation
and multiple predictions
43. #DutchMLSchoolBigML, Inc
Caveat Emptor
43
• Things that make deep learning less useful:
• Small data (where that could still be thousands of instances)
• Problems where you could benefit by iterating quickly (better
features always beats better models)
• Problems that are easy, or for which top-of-the-line
performance isn’t absolutely critical
• Remember deep learning is just another sort
of supervised learning algorithm
“…deep learning has existed in the neural network community for over 20 years. Recent advances are
driven by some relatively minor improvements in algorithms and models and by the availability of large
data sets and much more powerful collections of computers.” — Stuart Russell
45. #DutchMLSchoolBigML, Inc
Beyond IID Data
45
• Traditional machine learning data is assumed to
be IID
• Independent (points have no information about each
other’s class) and
• Identically distributed (come from the same distribution)
• But what if you want to predict just the next value
in a sequence? Is all lost?
• Applications
• Predicting battery life from change-discharge cycles
• Predicting sales for the next day/week/month
46. #DutchMLSchoolBigML, Inc
Machine Learning Data
46
Color Mass Type
red 11 pen
green 45 apple
red 53 apple
yellow 0 pen
blue 2 pen
green 422 pineapple
yellow 555 pineapple
blue 7 pen
Discovering patterns within data:
• Color = “red” Mass < 100
• Type = “pineapple” Color ≠ “blue”
• Color = “blue” PPAP = “pen”
47. #DutchMLSchoolBigML, Inc
Machine Learning Data
47
Color Mass Type
red 53 apple
blue 2 pen
red 11 pen
blue 7 pen
green 45 apple
yellow 555 pineapple
green 422 pineapple
yellow 0 pen
Patterns valid despite reshuffling
• Color = “red” Mass < 100
• Type = “pineapple” Color ≠ “blue”
• Color = “blue” PPAP = “pen”
57. #DutchMLSchoolBigML, Inc
Evaluating Model Fit
57
• AIC: Akaike Information Criterion; tries to trade off
accuracy and model complexity
• AICc: Like the AIC, but with a sample size
correction
• BIC: Bayesian Information Criterion; like the AIC
but penalizes large numbers of parameters more
harshly
• R-squared: Raw performance, the number of
model parameters isn’t considered