VSSML17 L2. Ensembles and Logistic Regressions

Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017

BigML, Inc 2
Ensembles
Making trees unstoppable
Poul Petersen
CIO, BigML, Inc

BigML, Inc 3Ensembles
what is an Ensemble?
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?

No Model is Perfect
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to ﬁt a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…

No Data is Perfect
• Not enough data!
• Always working with ﬁnite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overﬁtting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data

Ensembles Techniques
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms

Simple Example

Simple Example
Partition the data… then model each partition…
For predictions, use the model for the same partition
?

Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER

Ensembles Demo #1

Decision Forest Conﬁg
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Number of models: How many trees to build
• Sampling options:
• Deterministic / Random
• Replacement:
• Allows sampling the same instance more than once
• Effectively the same as ≈ 63.21%
• “Full size” samples with zero covariance (good thing)
• At prediction time
• Combiner…

Quick Review
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classiﬁcation
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label

Ensemble Combiners
• Regression: Average of the predictions and expected error
• Classification:
• Plurality - majority wins.
• Confidence Weighted - majority wins but each vote is
weighted by the confidence.
• Probability Weighted - each tree votes the distribution at
it’s leaf node.
• K Threshold - only votes if the specified class and
required number of trees is met. For example, allowing a
“True” vote if and only if at least 9 out of 10 trees vote
“True”.
• Confidence Threshold - only votes the specified class if
the minimum confidence is met.

Ensembles Demo #2

Outlier Example
Diameter Color Shape Fruit
4 red round plum
5 red round apple
5 red round apple
6 red round plum
7 red round apple
All Data: “plum”
Sample 2: “apple”
Sample 3: “apple”
Sample 1: “plum”
}“apple”
What is a round, red 6cm fruit?

Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER

RDF Conﬁg
• Decision Forest parameters still available
• Number of model, Sampling, etc
• Random candidates:
• The number of features to consider at each split

Ensembles Demo #3

Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 307500
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 600000
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 625
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 12500
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393.83333
6879.67857
Why
stop
at
one
iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"

Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4

etc…

Boosting Conﬁg
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples

Boosting Conﬁg
“OUT OF BAG”
SAMPLES
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4

etc…

Boosting Conﬁg
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.

Iterations
Boosted
Ensemble
#1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Early Stop # Iterations
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Boosted
Ensemble
#2
Early Stop# Iterations
This is OK because the early stop means the iterative improvement is small

and we have "converged" before being forcibly stopped by the # iterations
This is NOT OK because the hard limit on iterations stopped improving the quality of the
boosting long before there was enough iterations to have achieved the best quality.

Boosting Config
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize

Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration
1
Iteration
2
Iteration
3
Iteration
4

etc…

Boosting Config
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize

Ensembles Demo #4

Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33.6 0.627 50 TRUE
1 85 66 29 0 26.6 0.351 31 FALSE
8 183 64 0 0 23.3 0.672 32 TRUE
1 89 66 23 94 28.1 0.167 21 FALSE
MODEL 1
predicted
diabetes
TRUE
TRUE
FALSE
FALSE
ERROR
?
?
?
?
…
what
about
classification?
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
MODEL 1
predicted
diabetes
1
1
0
0
ERROR
0
-1
1
0
…
we
could
try

Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
predicted
favorite color
BLUE
GREEN
RED
GREEN
ERROR
?
?
?
?
…
but
then
what
about
multiple
classes?

Boosting Classiﬁcation
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
RED/NOT RED
Class RED
Probability
0.9
0.7
0.46
0.12
Class RED
ERROR
0.1
-0.7
0.54
-0.12
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age ERROR
6 148 72 35 0 33.6 0.627 50 0.1
1 85 66 29 0 26.6 0.351 31 -0.7
8 183 64 0 0 23.3 0.672 32 0.54
1 89 66 23 94 28.1 0.167 21 -0.12
MODEL 2
RED/NOT RED ERR
PREDICTED
ERROR
0.05
-0.54
0.32
-0.22
MODEL 1
BLUE/NOT BLUE
Class BLUE
Probability
0.1
0.3
0.54
0.88
Class BLUE
ERROR
-0.1
0.7
-0.54
0.12
…and
repeat
for
each

class
at
each
iteration
…and
repeat
for
each

class
at
each
iteration
Iteration
1
Iteration
2

Boosting Classiﬁcation
DATASET
MODELS 1
per class
DATASETS 2
per class
MODELS 2
per class
PREDICTIONS 1
per class
PREDICTIONS 2
per class
PREDICTIONS 3
per class
PREDICTIONS 4
per class
Comb
PROBABILITY
per class
MODELS 3
per class
MODELS 4
per class
DATASETS 3
per class
DATASETS 4
per class
Iteration
1
Iteration
2
Iteration
3
Iteration
4

etc…

Ensembles Demo #5

Stacked Generalization
ENSEMBLE
LOGISTIC
REGRESSION
SOURCE DATASET
MODEL
BATCH
PREDICTION
BATCH
PREDICTION
BATCH
PREDICTION
EXTENDED
DATASET
EXTENDED
DATASET
EXTENDED
DATASET
LOGISTIC
REGRESSION

Which Ensemble Method
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially

Too Many Parameters?
• How many trees?
• How many nodes?
• Missing splits?
• Random candidates?
• Too many parameters?
SMACdown!

Summary
• Models have shortcomings: ability to ﬁt, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation

BigML, Inc 2
Logistic Regressions
Modeling probabilities
Poul Petersen
CIO, BigML, Inc

BigML, Inc 3Logistic Regressions
Logistic Regression
• Classiﬁcation implies a discrete objective. How can this be a
regression?
• Why do we need another classiﬁcation algorithm?
• more questions….
Logistic Regression is a classification algorithm
Potential Confusion:

Linear Regression

Polynomial Regression

Regression
• Linear Regression: 𝛽₀＋𝛽1·∙(INPUT)
≈
OBJECTIVE
• Quadratic Regression: 𝛽₀＋𝛽1·∙(INPUT)＋𝛽2·∙(INPUT)2
≈
OBJECTIVE
• Decision Tree Regression: DT(INPUT)
≈
OBJECTIVE
• Problem:
• What if we want to do a classiﬁcation problem: T/F or 1/0
• What function can we ﬁt to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away:

Discrete Data Function?

Discrete Data Function?
????

Logistic Function
𝑥➝-‐∞

𝑓(𝑥)➝0
• Looks promising, but still not "discrete"
• What about the "green" in the middle?
• Let’s change the problem…
𝑥➝∞

𝑓(𝑥)➝1
Goal
1

1
＋
𝒆−𝑥𝑓(𝑥)
＝

Logistic Function

Modeling Probabilities
𝑃≈0 𝑃≈10<𝑃<1

Logistic Regression
• Assumes that output is linearly related to "predictors"
• What? (hang in there…)
• Sometimes we can "ﬁx" this with feature engineering
• Question: how do we "ﬁt" the logistic function to real data?
LR is a classification algorithm … that uses a regression … 
to model the probability of the discrete objective
Clarification:
Caveats:

Logistic Regression
𝛽₀ is the "intercept"
𝛽₁ is the "coefﬁcient"
• In which case solving is now a linear regression
• But this is only one dimension, that is one feature 𝑥…
• Given training data consisting of inputs 𝑥, and probabilities 𝑃
• Solve for 𝛽₀ and 𝛽₁ to ﬁt the logistic function
• How? The inverse of the logistic function is called the "logit":
𝑃(𝑥)＝
1
1＋𝑒−(𝛽0＋𝛽1 𝑥)
𝑙𝑛( )𝑃(𝑥)
𝑃(𝑥ʹ′)
＝𝑙𝑛 ( )1－𝑃(𝑥ʹ′)
𝑃(𝑥)
＝𝛽0＋𝛽1 𝑥

Logistic Regression
For "𝑖" dimensions, 𝑿﹦[
𝑥1,
𝑥2,⋯,
𝑥𝑖
],
we solve
𝑃(𝑿)＝
1
1＋𝑒−𝑓(𝑿)
𝑓(𝑿)＝𝛽0＋𝞫·∙𝑿＝𝛽0＋𝛽1 𝑥1＋⋯＋𝛽𝑖 𝑥𝑖
where:

Interpreting Coefficients
• LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗
• negative 𝛽𝑗 → negatively correlated:
• positive 𝛽𝑗 → positively correlated:
• "larger" 𝛽𝑗 → more impact:
• "smaller" → less impact:
• 𝛽𝑗
"size" should not be confused with field importance
• Can include a coefficient for "missing" (if enabled)
• 𝑃(𝑿)
＝
𝛽0＋⋯＋𝛽𝑗 𝑥𝑗＋⋯

• Binary Classification (true/false) coefficients are complementary
• 𝑃(True)
≡
1−
𝑃(False)
＋𝛽𝑗+1[
𝑥𝑗
≡
Missing
]
𝑥𝑗↑
then
𝑃(𝑿)↓
𝑥𝑗↑
then
𝑃(𝑿)↑
𝑥𝑗≫
then
𝑃(𝑿)﹥
𝑥𝑗﹥then
𝑃(𝑿)≫

LR Demo #1

LR Parameters
1. Default Numeric: Replaces missing numeric values
2. Missing Numeric: Adds a field for missing numerics
3. Stats: Extended statistics, ex: p-value (runs slower)
4. Bias: Enables/Disables the intercept term - 𝛽₀
• Don’t disable this…
5. Regularization: Reduces over-fitting by minimizing 𝛽𝑗
• L1: prefers reducing individual coefficients
• L2 (default): prefers reducing all coefficients
6. Strength "C": Higher values reduce regularization
7. EPS: The minimum error between steps to stop
Larger values stop earlier but quality may be less
8. Auto-scaling: Ensures that all features contribute equally
• Don’t change this unless you have a specific reason

LR Questions
• How do we handle multiple classes?
• Binary class True/False only need to solve for one 
𝑃(True)
≡
1−
𝑃(False)

• What about non-numeric inputs?
• Text/Items ﬁelds
• Categorical ﬁelds
Questions:

LR - Multi Class
• Instead of a binary class ex: [ true, false ],  
we have multi-class ex: [ red, green, blue, … ]
• "𝑘" classes: 𝑪＝[𝑐1,
𝑐2,⋯,
𝑐 𝑘]
• solve one-vs-rest LR
• Result: 𝞫𝑗 for each class 𝑐𝑗
• apply combiner to ensure 
all probabilities add to 1
𝑙𝑛( )𝑃(𝑐1)
𝑃(𝑐1ʹ′)
＝𝛽1,0＋𝞫1·∙𝑿
𝑙𝑛( )𝑃(𝑐2)
𝑃(𝑐2ʹ′)
＝𝛽2,0＋𝞫2·∙𝑿
⋯
𝑙𝑛( )𝑃(𝑐 𝑘)
𝑃(𝑐 𝑘ʹ′)
＝𝛽 𝑘,0＋𝞫 𝑘·∙𝑿

LR - Field Codings
• LR is expecting numeric values to perform regression.
• How do we handle categorical values, or text?
Class color=red color=blue color=green color=NULL
red 1 0 0 0
blue 0 1 0 0
green 0 0 1 0
MISSING 0 0 0 1
One-hot encoding
• Only one feature is "hot" for each class
• This is the default

LR - Field Codings
Dummy Encoding
• Chooses a *reference class*
• requires one less degree of freedom
Class color_1 color_2 color_3
*red* 0 0 0
blue 1 0 0
green 0 1 0
MISSING 0 0 1

LR - Field Codings
Contrast Encoding
• Field values must sum to zero
• Allows comparison between classes
Class ﬁeld "inﬂuence"
red 0.5 positive
blue -0.25 negative
green -0.25 negative
MISSING 0 excluded

LR - Field Codings
Which one to use?
• One-hot is the default
• Use this unless you have a speciﬁc need
• Dummy
• Use when there is a control group in mind, which 
becomes the reference class
• Contrast
• Allows for testing speciﬁc hypothesis of relationships.
• Ex: customers give a "rating" of bad / ok / good
rating
Contrast
Encoding
bad -0.66
ok 0.33
good 0.33
Hypothesis is a
good and ok
review have the
same impact, but
a bad review has
a negative impact
twice as great.
rating
Contrast
Encoding
bad -0.5
ok 0
good 0.5
Hypothesis is that
a good and bad
review have an
equal but opposite
impact, while an
ok rating has no
impact.

LR - Field Codings
• Text/Items ﬁeld types are handled by creating a ﬁeld
for text token/item and setting 1 or 0
Text "hippo" "safari" "zebra"
“we saw hippos and
zebras…
1 0 1
“The best safari for
seeing zebras”
0 1 1
“The Oregon coast
is rainy in winter”
0 0 0
“Have you ever tried
a hippo burger”
1 0 0
Text / Items

LR Demo #2

Curvilinear LR
• Logistic Regression is expecting a linear relationship
between the features and the objective
• Remember - it’s a linear regression under the hood
• This is actually pretty common in natural datasets
• But non-linear relationships will impact model quality
• This can be addressed by adding non-linear
transformations to the features
• Knowing which transformations requires
• domain knowledge
• experimentation
• both

Curvilinear LR
Instead of
We could add a feature
Where
????
Possible to add any higher order terms or other functions to
match shape of data
𝛽0＋𝛽1 𝑥1
𝛽0＋𝛽1 𝑥1＋𝛽2 𝑥2
𝑥1
≡
𝑥2
2

LR Demo #3

LR vs DT
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong:  
regularization, scaling, codings
• Slightly less prone to over-fitting 
• Because fits a shape, might work
better when less data available. 
• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free 
• Slightly more prone to over-fitting 
• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree

LR Demo #4

Summary
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• How the algorithm works and why this is important
• Expects a linear relationship between the features
and the objective, and how to fix it
• Categorical encodings
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact
• Guidelines for comparing to Decision Trees

VSSML17 L2. Ensembles and Logistic Regressions

VSSML17 L2. Ensembles and Logistic Regressions

More Related Content

What's hot

Similar to VSSML17 L2. Ensembles and Logistic Regressions

More from BigML, Inc

Recently uploaded

VSSML17 L2. Ensembles and Logistic Regressions