MLSEV. Logistic Regression, Deepnets, and Time Series

BigML, Inc X
Logistic Regression
Modeling probabilities for
classification only
Charles Parker
VP Machine Learning Algorithms

BigML, Inc XEnsembles
Supervised learning review
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classiﬁcation
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label

BigML, Inc X#MLSEV: Logistic Regressions
Logistic Regression
Classiﬁcation implies a discrete objective.
How can this be a regression?
Logistic Regression is a classification algorithm
Potential Confusion:

Linear Regression

Polynomial Regression

Regression
• Linear Regression: 𝛽₀＋𝛽1·(INPUT) ≈ OBJECTIVE
• Quadratic Regression: 𝛽₀＋𝛽1·(INPUT)＋𝛽2·(INPUT)2 ≈ OBJECTIVE
• Decision Tree Regression: DT(INPUT) ≈ OBJECTIVE
NEW PROBLEM
• What if we want to do a classiﬁcation problem: T/F or 1/0
• What function can we ﬁt to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away

Discrete Data Function?

Discrete Data Function?
????

Logistic Function
𝑥➝-∞
𝑓(𝑥)➝0
• Looks promising, but still not "discrete"
• What about the "green" in the middle?
• Let’s change the problem…
𝑥➝∞
𝑓(𝑥)➝1
Goal
1
1 ＋ 𝒆− 𝑥𝑓(𝑥) ＝
Logistic Function

Modeling Probabilities
𝑃≈0 𝑃≈10<𝑃<1

Logistic Regression
• Assumes that output is linearly related to "predictors"
• Question: how do we "ﬁt" the logistic function to real data?
LR is a classification algorithm … that uses a regression … 
to model the probability of the discrete objective
Clarification:
Caveats:

Warning: some math
“Someone told me that each
equation I included in the
book would halve the sales.
In the end, however, I did put
in one equation … I hope
that this will not scare off half
of my potential readers.”

Logistic Regression
𝛽₀ is the "intercept"
𝛽₁ is the "coeﬃcient"
• In which case solving is now a linear regression
• Solve for 𝛽₀ and 𝛽₁ to ﬁt the logistic function
• How? The inverse of the logistic function is called the "logit":
𝑃(𝑥)＝
1
1＋𝑒−(𝛽0＋𝛽1 𝑥)
logit( )𝑃(𝑥) ＝𝑙𝑛( )1－𝑃(𝑥)
𝑃(𝑥)
＝𝛽0＋𝛽1 𝑥

Logistic Regression
For "𝑖" dimensions, 𝑿﹦[ 𝑥1, 𝑥2,⋯, 𝑥𝑖 ], we solve
𝑃(𝑿)＝
1
1＋𝑒−𝑓(𝑿)
𝑓(𝑿)＝𝛽0＋𝞫·𝑿＝𝛽0＋𝛽1 𝑥1＋⋯＋𝛽𝑖 𝑥𝑖
where:

Interpreting Coefficients
• LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗
• negative 𝛽𝑗 → negatively correlated:
• positive 𝛽𝑗 → positively correlated:
• "larger" 𝛽𝑗 → more impact:
• "smaller" → less impact:
• 𝛽𝑗 "size" should not be confused with field importance
𝑥𝑗↑ then 𝑃(𝑿)↓
𝑥𝑗↑ then 𝑃(𝑿)↑
𝑥𝑗≫ then 𝑃(𝑿)﹥
𝑥𝑗﹥then 𝑃(𝑿)≫

LR versus DT
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong:  
regularization, scaling, codings
• Slightly less prone to over-fitting 
• Because fits a shape, might work
better when less data available. 
• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free 
• Slightly more prone to over-fitting 
• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree

Summary
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• Expects a linear relationship between the features and
the objective, and how to fix it
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact

BigML, Inc X#MLSEV: Time Series / Deepnets
Deepnets and Time Series
Going Further With Supervised Learning
Charles Parker
VP Machine Learning Algorithms

Deep Neural Networks

X
BigML, Inc #MLSEV: Time Series / Deepnets
Power To The People!
• Why another supervised learning algorithm?
• Deep neural networks have been shown to be
state of the art in several niche applications
• Vision
• Speech recognition
• NLP
• While powerful, these networks have historically
been diﬃcult for novices to train

X
Goals of BigML Deepnets
• What BigML Deepnets are not (yet)
• Convolutional networks
• Recurrent networks (e.g., LSTM Networks)
• These solve a particular type of sub-problem, and
are carefully engineered by experts to do so
• Can we bring some of the power of Deep Neural
Networks to your problem, even if you have no
deep learning expertise?
• Let’s try to separate deep neural network myths
from realities

X
Myth #1
Deep neural networks are the next step in evolution,
destined to perfect humanity or destroy it utterly.

X
Some Weaknesses
• Trees
• Pro: Massive representational power that expands as the data
gets larger; efficient search through this space
• Con: Difficult to represent smooth functions and functions of
many variables
• Ensembles mitigate some of these difficulties
• Logistic Regression
• Pro: Some smooth, multivariate, functions are not a problem;
fast optimization
• Con: Parametric - If decision boundary is nonlinear, tough luck
• Can these be mitigated?

X
Logistic Level Up
Outputs
Inputs

X
Logistic Level Up
wi
Class “a”, logistic(w, b)

X
Logistic Level Up
Outputs
Inputs
Hidden layer

X
Logistic Level Up
Hidden node 1,

logistic(w, b)

X
Logistic Level Up
Hidden node 1,

logistic(w, b)
n
hidden nodes?

X
Logistic Level Up
Hidden node 1,

logistic(w, b)
n
hidden

layers?

X
Myth #2
Deep neural networks are great for the established
marquee applications, but less interesting for general use.

X
Parameter Paralysis
Parameter Name Possible Values
Descent Algorithm Adam, RMSProp, Adagrad, Momentum, FTRL
Number of hidden layers 0 - 32
Activation Function (per layer) relu, tanh, sigmoid, softplus, etc.
Number of nodes (per layer) 1 - 8192
Learning Rate 0 - 1
Dropout Rate 0 - 1
Batch size 1 - 1024
Batch Normalization True, False
Learn Residuals True, False
Missing Numerics True, False
Objective weights Weight per class
. . . and that’s ignoring the parameters that are
speciﬁc to the descent algorithm.

X
What Can We Do?
• Clearly there are too many parameters to fuss with
• Setting them takes signiﬁcant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)

X
Bayesian Parameter Optimization
Model and EvaluateStructure 1
Structure 2
Structure 3
Structure 4
Structure 5
Structure 6

X
Structure 2
Structure 3
Structure 4
Structure 5
Structure 6
0.75

X
Structure 2
Structure 3
Structure 4
Structure 5
Structure 6
0.75
0.48

X
Structure 2
Structure 3
Structure 4
Structure 5
Structure 6
0.75
0.48
0.91

X
Structure 1
Structure 2
Structure 3
Structure 4
Structure 5
Structure 6
0.75
0.48
0.91
Machine Learning!
Structure → performance
Model and Evaluate

X
Benchmarking
• The ML world is ﬁlled with crummy benchmarks
• Not enough datasets
• No cross-validation
• Only one metric
• Solution: Roll our own
• 50+ datasets, 5 replications of 10-fold CV
• 10 diﬀerent metrics
• 30+ competing algorithms (R, scikit-learn, weka, xgboost)
http://www.clparker.org/ml_benchmark/

X
Myth #3
Deep neural networks are not interpretable

X
Explainability
• Recent work in model interpretation applies
broadly to any model
• Feature importance (overall)
• Prediction explanation (feature importance
for a given prediction)
• Most (good) techniques rely on data perturbation
and multiple predictions

X
Myth #4
Deep neural networks have such spectacular performance
that all other supervised learning techniques are now irrelevant

X
Caveat Emptor
• Things that make deep learning less useful:
• Small data (where that could still be thousands of instances)
• Problems where you could beneﬁt by iterating quickly (better
features always beats better models)
• Problems that are easy, or for which top-of-the-line
performance isn’t absolutely critical
• Remember deep learning is just another sort
of supervised learning algorithm
“…deep learning has existed in the neural network community for over 20 years. Recent advances are
driven by some relatively minor improvements in algorithms and models and by the availability of large
data sets and much more powerful collections of computers.” — Stuart Russell

Time Series Analysis

Beyond IID Data
• Traditional machine learning data is assumed to
be IID
• Independent (points have no information about each
other’s class) and
• Identically distributed (come from the same distribution)
• But what if you want to predict just the next value
in a sequence? Is all lost?
• Applications
• Predicting battery life from change-discharge cycles
• Predicting sales for the next day/week/month

Machine Learning Data
Color Mass Type
red 11 pen
green 45 apple
red 53 apple
yellow 0 pen
blue 2 pen
green 422 pineapple
yellow 555 pineapple
blue 7 pen
Discovering patterns within data:
• Color = “red” Mass < 100
• Type = “pineapple” Color ≠ “blue”
• Color = “blue” PPAP = “pen”

Machine Learning Data
Color Mass Type
red 53 apple
blue 2 pen
red 11 pen
blue 7 pen
green 45 apple
yellow 555 pineapple
green 422 pineapple
yellow 0 pen
Patterns valid despite reshufﬂing
• Color = “red” Mass < 100
• Type = “pineapple” Color ≠ “blue”
• Color = “blue” PPAP = “pen”

Time Series Data
Year Pineapple Harvest
1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend

Time Series Data
1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Patterns invalid after shufﬂing

Prediction
Use the data from the past to predict the future

Exponential Smoothing

Exponential Smoothing
Weight 0
0,05
0,1
0,15
0,2
Lag
1 3 5 7 9 11 13

Trendy
0
12,5
25
37,5
50
Time
Apr May Jun Jul
y
0
50
100
150
200
Time
Apr May Jun Jul
Additive Multiplicative

Seasonalityy
0
30
60
90
120
Time
1 4 7 10 13 16 19
y
0
35
70
105
140
Time
1 4 7 10 13 16 19

Errory
0
150
300
450
600
Time
1 4 7 10 13 16 19
y
0
125
250
375
500
Time
1 4 7 10 13 16 19

Model Types
None Additive Multiplicative
None A,N,N M,N,N A,N,A M,N,A A,N,M M,N,M
Additive A,A,N M,A,N A,A,A M,A,A A,A,M M,A,M
Additive + Damped A,Ad,N M,Ad,N A,Ad,A M,Ad,A A,Ad,M M,Ad,M
Multiplicative A,M,N M,M,N A,M,A M,M,A A,M,M M,M,M
Multiplicative + Damped A,Md,N M,Md,N A,Md,A M,Md,A A,Md,M M,Md,M
M,N,A
Multiplicative Error
No Trend
Additive Seasonality

Evaluating Model Fit
• AIC: Akaike Information Criterion; tries to trade oﬀ
accuracy and model complexity
• AICc: Like the AIC, but with a sample size
correction
• BIC: Bayesian Information Criterion; like the AIC
but penalizes large numbers of parameters more
harshly
• R-squared: Raw performance, the number of
model parameters isn’t considered

Linear Splitting
1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 115,17
Random Split
1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 115,17
Linear Split

MLSEV. Logistic Regression, Deepnets, and Time Series

MLSEV. Logistic Regression, Deepnets, and Time Series

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to MLSEV. Logistic Regression, Deepnets, and Time Series

Similar to MLSEV. Logistic Regression, Deepnets, and Time Series (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

MLSEV. Logistic Regression, Deepnets, and Time Series