Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017
BigML, Inc 2
Ensembles
Making trees unstoppable
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Ensembles
what is an Ensemble?
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?
BigML, Inc 4Ensembles
No Model is Perfect
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to fit a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…
BigML, Inc 5Ensembles
No Data is Perfect
• Not enough data!
• Always working with finite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overfitting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data
BigML, Inc 6Ensembles
Ensembles Techniques
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms
BigML, Inc 7Ensembles
Simple Example
BigML, Inc 8Ensembles
Simple Example
BigML, Inc 9Ensembles
Simple Example
Partition the data… then model each partition…
For predictions, use the model for the same partition
?
BigML, Inc 10Ensembles
Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER
BigML, Inc 11Ensembles
Ensembles Demo #1
BigML, Inc 12Ensembles
Decision Forest Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Number of models: How many trees to build
• Sampling options:
• Deterministic / Random
• Replacement:
• Allows sampling the same instance more than once
• Effectively the same as ≈ 63.21%
• “Full size” samples with zero covariance (good thing)
• At prediction time
• Combiner…
BigML, Inc 13Ensembles
Quick Review
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classification
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
BigML, Inc 14Ensembles
Ensemble Combiners
• Regression: Average of the predictions and expected error
• Classification:
• Plurality - majority wins.
• Confidence Weighted - majority wins but each vote is
weighted by the confidence.
• Probability Weighted - each tree votes the distribution at
it’s leaf node.
• K Threshold - only votes if the specified class and
required number of trees is met. For example, allowing a
“True” vote if and only if at least 9 out of 10 trees vote
“True”.
• Confidence Threshold - only votes the specified class if
the minimum confidence is met.
BigML, Inc 15Ensembles
Ensembles Demo #2
BigML, Inc 16Ensembles
Outlier Example
Diameter Color Shape Fruit
4 red round plum
5 red round apple
5 red round apple
6 red round plum
7 red round apple
All Data: “plum”
Sample 2: “apple”
Sample 3: “apple”
Sample 1: “plum”
}“apple”
What is a round, red 6cm fruit?
BigML, Inc 17Ensembles
Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
BigML, Inc 18Ensembles
RDF Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Decision Forest parameters still available
• Number of model, Sampling, etc
• Random candidates:
• The number of features to consider at each split
BigML, Inc 19Ensembles
Ensembles Demo #3
BigML, Inc 20Ensembles
Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 307500
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 600000
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44.594828 -123.269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44.643876 -123.238189 625
4748 NW
Veronica
5 3.5 4135 6098 2004 44.5929659 -123.306916 12500
411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393.83333
6879.67857
Why	
  stop	
  at	
  one	
  iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"
BigML, Inc 21Ensembles
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration	
  1
Iteration	
  2
Iteration	
  3
Iteration	
  4	
  
etc…
BigML, Inc 22Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
BigML, Inc 23Ensembles
Boosting Config
“OUT OF BAG”
SAMPLES
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration	
  1
Iteration	
  2
Iteration	
  3
Iteration	
  4	
  
etc…
BigML, Inc 24Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
BigML, Inc 25Ensembles
Iterations
Boosted	
  Ensemble	
  #1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Early Stop # Iterations
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Boosted	
  Ensemble	
  #2
Early Stop# Iterations
This is OK because the early stop means the iterative improvement is small

and we have "converged" before being forcibly stopped by the # iterations
This is NOT OK because the hard limit on iterations stopped improving the quality of the
boosting long before there was enough iterations to have achieved the best quality.
BigML, Inc 26Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
BigML, Inc 27Ensembles
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration	
  1
Iteration	
  2
Iteration	
  3
Iteration	
  4	
  
etc…
BigML, Inc 28Ensembles
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration	
  1
Iteration	
  2
Iteration	
  3
Iteration	
  4	
  
etc…
BigML, Inc 29Ensembles
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
BigML, Inc 30Ensembles
Ensembles Demo #4
BigML, Inc 31Ensembles
Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33.6 0.627 50 TRUE
1 85 66 29 0 26.6 0.351 31 FALSE
8 183 64 0 0 23.3 0.672 32 TRUE
1 89 66 23 94 28.1 0.167 21 FALSE
MODEL 1
predicted
diabetes
TRUE
TRUE
FALSE
FALSE
ERROR
?
?
?
?
…	
  what	
  about	
  classification?
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
MODEL 1
predicted
diabetes
1
1
0
0
ERROR
0
-1
1
0
…	
  we	
  could	
  try
BigML, Inc 32Ensembles
Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
predicted
favorite color
BLUE
GREEN
RED
GREEN
ERROR
?
?
?
?
…	
  but	
  then	
  what	
  about	
  multiple	
  classes?
BigML, Inc 33Ensembles
Boosting Classification
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33.6 0.627 50 RED
1 85 66 29 0 26.6 0.351 31 GREEN
8 183 64 0 0 23.3 0.672 32 BLUE
1 89 66 23 94 28.1 0.167 21 RED
MODEL 1
RED/NOT RED
Class RED
Probability
0.9
0.7
0.46
0.12
Class RED
ERROR
0.1
-0.7
0.54
-0.12
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age ERROR
6 148 72 35 0 33.6 0.627 50 0.1
1 85 66 29 0 26.6 0.351 31 -0.7
8 183 64 0 0 23.3 0.672 32 0.54
1 89 66 23 94 28.1 0.167 21 -0.12
MODEL 2
RED/NOT RED ERR
PREDICTED
ERROR
0.05
-0.54
0.32
-0.22
MODEL 1
BLUE/NOT BLUE
Class BLUE
Probability
0.1
0.3
0.54
0.88
Class BLUE
ERROR
-0.1
0.7
-0.54
0.12
…and	
  repeat	
  for	
  each	
  	
  
class	
  at	
  each	
  iteration
…and	
  repeat	
  for	
  each	
  	
  
class	
  at	
  each	
  iteration
Iteration	
  1
Iteration	
  2
BigML, Inc 34Ensembles
Boosting Classification
DATASET
MODELS 1
per class
DATASETS 2
per class
MODELS 2
per class
PREDICTIONS 1
per class
PREDICTIONS 2
per class
PREDICTIONS 3
per class
PREDICTIONS 4
per class
Comb
PROBABILITY
per class
MODELS 3
per class
MODELS 4
per class
DATASETS 3
per class
DATASETS 4
per class
Iteration	
  1
Iteration	
  2
Iteration	
  3
Iteration	
  4	
  
etc…
BigML, Inc 35Ensembles
Ensembles Demo #5
BigML, Inc 36Ensembles
Stacked Generalization
ENSEMBLE
LOGISTIC
REGRESSION
SOURCE DATASET
MODEL
BATCH
PREDICTION
BATCH
PREDICTION
BATCH
PREDICTION
EXTENDED
DATASET
EXTENDED
DATASET
EXTENDED
DATASET
LOGISTIC
REGRESSION
BigML, Inc 37Ensembles
Which Ensemble Method
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially
BigML, Inc 38Ensembles
Too Many Parameters?
• How many trees?
• How many nodes?
• Missing splits?
• Random candidates?
• Too many parameters?
SMACdown!
BigML, Inc 39Ensembles
Summary
• Models have shortcomings: ability to fit, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation
BigML, Inc 2
Logistic Regressions
Modeling probabilities
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Logistic Regressions
Logistic Regression
• Classification implies a discrete objective. How can this be a
regression?
• Why do we need another classification algorithm?
• more questions….
Logistic Regression is a classification algorithm
Potential Confusion:
BigML, Inc 4Logistic Regressions
Linear Regression
BigML, Inc 5Logistic Regressions
Linear Regression
BigML, Inc 6Logistic Regressions
Polynomial Regression
BigML, Inc 7Logistic Regressions
Regression
• Linear Regression: 𝛽₀+𝛽1·∙(INPUT)	
  ≈	
  OBJECTIVE
• Quadratic Regression: 𝛽₀+𝛽1·∙(INPUT)+𝛽2·∙(INPUT)2	
  ≈	
  OBJECTIVE
• Decision Tree Regression: DT(INPUT)	
  ≈	
  OBJECTIVE
• Problem:
• What if we want to do a classification problem: T/F or 1/0
• What function can we fit to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away:
BigML, Inc 8Logistic Regressions
Discrete Data Function?
BigML, Inc 9Logistic Regressions
Discrete Data Function?
????
BigML, Inc 10Logistic Regressions
Logistic Function
𝑥➝-­‐∞	
  	
  	
  
𝑓(𝑥)➝0
• Looks promising, but still not "discrete"
• What about the "green" in the middle?
• Let’s change the problem…
𝑥➝∞	
  
𝑓(𝑥)➝1
Goal
1	
  
1	
  +	
   𝒆−𝑥𝑓(𝑥)	
  =	
  
Logistic Function
BigML, Inc 11Logistic Regressions
Modeling Probabilities
𝑃≈0 𝑃≈10<𝑃<1
BigML, Inc 12Logistic Regressions
Logistic Regression
• Assumes that output is linearly related to "predictors"
• What? (hang in there…)
• Sometimes we can "fix" this with feature engineering
• Question: how do we "fit" the logistic function to real data?
LR is a classification algorithm … that uses a regression …

to model the probability of the discrete objective
Clarification:
Caveats:
BigML, Inc 13Logistic Regressions
Logistic Regression
𝛽₀ is the "intercept"
𝛽₁ is the "coefficient"
• In which case solving is now a linear regression
• But this is only one dimension, that is one feature 𝑥…
• Given training data consisting of inputs 𝑥, and probabilities 𝑃
• Solve for 𝛽₀ and 𝛽₁ to fit the logistic function
• How? The inverse of the logistic function is called the "logit":
𝑃(𝑥)=
1
1+𝑒−(𝛽0+𝛽1 𝑥)
𝑙𝑛( )𝑃(𝑥)
𝑃(𝑥ʹ′)
=𝑙𝑛 ( )1-𝑃(𝑥ʹ′)
𝑃(𝑥)
=𝛽0+𝛽1 𝑥
BigML, Inc 14Logistic Regressions
Logistic Regression
For "𝑖" dimensions, 𝑿﹦[	
   𝑥1,	
   𝑥2,⋯,	
   𝑥𝑖	
  ],	
  we solve
𝑃(𝑿)=
1
1+𝑒−𝑓(𝑿)
𝑓(𝑿)=𝛽0+𝞫·∙𝑿=𝛽0+𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖
where:
BigML, Inc 15Logistic Regressions
Interpreting Coefficients
• LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗
• negative 𝛽𝑗 → negatively correlated:
• positive 𝛽𝑗 → positively correlated:
• "larger" 𝛽𝑗 → more impact:
• "smaller" → less impact:
• 𝛽𝑗	
  "size" should not be confused with field importance
• Can include a coefficient for "missing" (if enabled)
• 𝑃(𝑿)	
  =	
   𝛽0+⋯+𝛽𝑗 𝑥𝑗+⋯	
  
• Binary Classification (true/false) coefficients are complementary
• 𝑃(True)	
  ≡	
  1−	
   𝑃(False)
+𝛽𝑗+1[	
   𝑥𝑗	
  ≡	
  Missing	
  ]
𝑥𝑗↑	
  then	
   𝑃(𝑿)↓
𝑥𝑗↑	
  then	
   𝑃(𝑿)↑
𝑥𝑗≫	
  then	
   𝑃(𝑿)﹥
𝑥𝑗﹥then	
   𝑃(𝑿)≫
BigML, Inc 16Logistic Regressions
LR Demo #1
BigML, Inc 17Logistic Regressions
LR Parameters
1. Default Numeric: Replaces missing numeric values
2. Missing Numeric: Adds a field for missing numerics
3. Stats: Extended statistics, ex: p-value (runs slower)
4. Bias: Enables/Disables the intercept term - 𝛽₀
• Don’t disable this…
5. Regularization: Reduces over-fitting by minimizing 𝛽𝑗
• L1: prefers reducing individual coefficients
• L2 (default): prefers reducing all coefficients
6. Strength "C": Higher values reduce regularization
7. EPS: The minimum error between steps to stop
Larger values stop earlier but quality may be less
8. Auto-scaling: Ensures that all features contribute equally
• Don’t change this unless you have a specific reason
BigML, Inc 18Logistic Regressions
LR Questions
• How do we handle multiple classes?
• Binary class True/False only need to solve for one

𝑃(True)	
  ≡	
  1−	
   𝑃(False)	
  
• What about non-numeric inputs?
• Text/Items fields
• Categorical fields
Questions:
BigML, Inc 19Logistic Regressions
LR - Multi Class
• Instead of a binary class ex: [ true, false ], 

we have multi-class ex: [ red, green, blue, … ]
• "𝑘" classes: 𝑪=[𝑐1,	
   𝑐2,⋯,	
   𝑐 𝑘]
• solve one-vs-rest LR
• Result: 𝞫𝑗 for each class 𝑐𝑗
• apply combiner to ensure

all probabilities add to 1
𝑙𝑛( )𝑃(𝑐1)
𝑃(𝑐1ʹ′)
=𝛽1,0+𝞫1·∙𝑿
𝑙𝑛( )𝑃(𝑐2)
𝑃(𝑐2ʹ′)
=𝛽2,0+𝞫2·∙𝑿
⋯
𝑙𝑛( )𝑃(𝑐 𝑘)
𝑃(𝑐 𝑘ʹ′)
=𝛽 𝑘,0+𝞫 𝑘·∙𝑿
BigML, Inc 20Logistic Regressions
LR - Field Codings
• LR is expecting numeric values to perform regression.
• How do we handle categorical values, or text?
Class color=red color=blue color=green color=NULL
red 1 0 0 0
blue 0 1 0 0
green 0 0 1 0
MISSING 0 0 0 1
One-hot encoding
• Only one feature is "hot" for each class
• This is the default
BigML, Inc 21Logistic Regressions
LR - Field Codings
Dummy Encoding
• Chooses a *reference class*
• requires one less degree of freedom
Class color_1 color_2 color_3
*red* 0 0 0
blue 1 0 0
green 0 1 0
MISSING 0 0 1
BigML, Inc 22Logistic Regressions
LR - Field Codings
Contrast Encoding
• Field values must sum to zero
• Allows comparison between classes
Class field "influence"
red 0.5 positive
blue -0.25 negative
green -0.25 negative
MISSING 0 excluded
BigML, Inc 23Logistic Regressions
LR - Field Codings
Which one to use?
• One-hot is the default
• Use this unless you have a specific need
• Dummy
• Use when there is a control group in mind, which

becomes the reference class
• Contrast
• Allows for testing specific hypothesis of relationships.
• Ex: customers give a "rating" of bad / ok / good
rating
Contrast
Encoding
bad -0.66
ok 0.33
good 0.33
Hypothesis is a
good and ok
review have the
same impact, but
a bad review has
a negative impact
twice as great.
rating
Contrast
Encoding
bad -0.5
ok 0
good 0.5
Hypothesis is that
a good and bad
review have an
equal but opposite
impact, while an
ok rating has no
impact.
BigML, Inc 24Logistic Regressions
LR - Field Codings
• Text/Items field types are handled by creating a field
for text token/item and setting 1 or 0
Text "hippo" "safari" "zebra"
“we saw hippos and
zebras…
1 0 1
“The best safari for
seeing zebras”
0 1 1
“The Oregon coast
is rainy in winter”
0 0 0
“Have you ever tried
a hippo burger”
1 0 0
Text / Items
BigML, Inc 25Logistic Regressions
LR Demo #2
BigML, Inc 26Logistic Regressions
Curvilinear LR
• Logistic Regression is expecting a linear relationship
between the features and the objective
• Remember - it’s a linear regression under the hood
• This is actually pretty common in natural datasets
• But non-linear relationships will impact model quality
• This can be addressed by adding non-linear
transformations to the features
• Knowing which transformations requires
• domain knowledge
• experimentation
• both
BigML, Inc 27Logistic Regressions
Curvilinear LR
Instead of
We could add a feature
Where
????
Possible to add any higher order terms or other functions to
match shape of data
𝛽0+𝛽1 𝑥1
𝛽0+𝛽1 𝑥1+𝛽2 𝑥2
𝑥1	
  ≡	
   𝑥2
2
BigML, Inc 28Logistic Regressions
LR Demo #3
BigML, Inc 29Logistic Regressions
LR vs DT
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong: 

regularization, scaling, codings
• Slightly less prone to over-fitting

• Because fits a shape, might work
better when less data available.

• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free

• Slightly more prone to over-fitting

• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree
BigML, Inc 30Logistic Regressions
LR Demo #4
BigML, Inc 31Logistic Regressions
Summary
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• How the algorithm works and why this is important
• Expects a linear relationship between the features
and the objective, and how to fix it
• Categorical encodings
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact
• Guidelines for comparing to Decision Trees
VSSML17 L2. Ensembles and Logistic Regressions

VSSML17 L2. Ensembles and Logistic Regressions

  • 1.
    Valencian Summer Schoolin Machine Learning 3rd edition September 14-15, 2017
  • 2.
    BigML, Inc 2 Ensembles Makingtrees unstoppable Poul Petersen CIO, BigML, Inc
  • 3.
    BigML, Inc 3Ensembles whatis an Ensemble? • Rather than build a single model… • Combine the output of several typically “weaker” models into a powerful ensemble… • Q1: Why is this necessary? • Q2: How do we build “weaker” models? • Q3: How do we “combine” models?
  • 4.
    BigML, Inc 4Ensembles NoModel is Perfect • A given ML algorithm may simply not be able to exactly model the “real solution” of a particular dataset. • Try to fit a line to a curve • Even if the model is very capable, the “real solution” may be elusive • DT/NN can model any decision boundary with enough training data, but the solution is NP-hard • Practical algorithms involve random processes and may arrive at different, yet equally good, “solutions” depending on the starting conditions, local optima, etc. • If that wasn’t bad enough…
  • 5.
    BigML, Inc 5Ensembles NoData is Perfect • Not enough data! • Always working with finite training data • Therefore, every “model” is an approximation of the “real solution” and there may be several good approximations. • Anomalies / Outliers • The model is trying to generalize from discrete training data. • Outliers can “skew” the model, by overfitting • Mistakes in your data • Does the model have to do everything for you? • But really, there is always mistakes in your data
  • 6.
    BigML, Inc 6Ensembles EnsemblesTechniques • Key Idea: • By combining several good “models”, the combination may be closer to the best possible “model” • we want to ensure diversity. It’s not useful to use an ensemble of 100 models that are all the same • Training Data Tricks • Build several models, each with only some of the data • Introduce randomness directly into the algorithm • Add training weights to “focus” the additional models on the mistakes made • Prediction Tricks • Model the mistakes • Model the output of several different algorithms
  • 7.
  • 8.
  • 9.
    BigML, Inc 9Ensembles SimpleExample Partition the data… then model each partition… For predictions, use the model for the same partition ?
  • 10.
    BigML, Inc 10Ensembles DecisionForest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION COMBINER
  • 11.
  • 12.
    BigML, Inc 12Ensembles DecisionForest Config • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc. • Number of models: How many trees to build • Sampling options: • Deterministic / Random • Replacement: • Allows sampling the same instance more than once • Effectively the same as ≈ 63.21% • “Full size” samples with zero covariance (good thing) • At prediction time • Combiner…
  • 13.
    BigML, Inc 13Ensembles QuickReview animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classification animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label
  • 14.
    BigML, Inc 14Ensembles EnsembleCombiners • Regression: Average of the predictions and expected error • Classification: • Plurality - majority wins. • Confidence Weighted - majority wins but each vote is weighted by the confidence. • Probability Weighted - each tree votes the distribution at it’s leaf node. • K Threshold - only votes if the specified class and required number of trees is met. For example, allowing a “True” vote if and only if at least 9 out of 10 trees vote “True”. • Confidence Threshold - only votes the specified class if the minimum confidence is met.
  • 15.
  • 16.
    BigML, Inc 16Ensembles OutlierExample Diameter Color Shape Fruit 4 red round plum 5 red round apple 5 red round apple 6 red round plum 7 red round apple All Data: “plum” Sample 2: “apple” Sample 3: “apple” Sample 1: “plum” }“apple” What is a round, red 6cm fruit?
  • 17.
    BigML, Inc 17Ensembles RandomDecision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 SAMPLE 1 PREDICTION COMBINER
  • 18.
    BigML, Inc 18Ensembles RDFConfig • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc. • Decision Forest parameters still available • Number of model, Sampling, etc • Random candidates: • The number of features to consider at each split
  • 19.
  • 20.
    BigML, Inc 20Ensembles Boosting ADDRESSBEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44.594828 -123.269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44.643876 -123.238189 307500 4748 NW Veronica 5 3.5 4135 6098 2004 44.5929659 -123.306916 600000 411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 435350 MODEL 1 PREDICTED SALE PRICE 360750 306875 587500 435350 ERROR 750 -625 -12500 0 ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE ERROR 1522 NW Jonquil 4 3 2424 5227 1991 44.594828 -123.269328 750 7360 NW Valley Vw 3 2 1785 25700 1979 44.643876 -123.238189 625 4748 NW Veronica 5 3.5 4135 6098 2004 44.5929659 -123.306916 12500 411 NW 16th 3 2825 4792 1938 44.570883 -123.272113 0 MODEL 2 PREDICTED ERROR 750 625 12393.83333 6879.67857 Why  stop  at  one  iteration? "Hey Model 1, what do you predict is the sale price of this home?" "Hey Model 2, how much error do you predict Model 1 just made?"
  • 21.
    BigML, Inc 21Ensembles Boosting DATASETMODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration  1 Iteration  2 Iteration  3 Iteration  4   etc…
  • 22.
    BigML, Inc 22Ensembles BoostingConfig • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples
  • 23.
    BigML, Inc 23Ensembles BoostingConfig “OUT OF BAG” SAMPLES DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration  1 Iteration  2 Iteration  3 Iteration  4   etc…
  • 24.
    BigML, Inc 24Ensembles BoostingConfig • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work.
  • 25.
    BigML, Inc 25Ensembles Iterations Boosted  Ensemble  #1 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Early Stop # Iterations 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Boosted  Ensemble  #2 Early Stop# Iterations This is OK because the early stop means the iterative improvement is small and we have "converged" before being forcibly stopped by the # iterations This is NOT OK because the hard limit on iterations stopped improving the quality of the boosting long before there was enough iterations to have achieved the best quality.
  • 26.
    BigML, Inc 26Ensembles BoostingConfig • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize
  • 27.
    BigML, Inc 27Ensembles BoostingRandomize DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration  1 Iteration  2 Iteration  3 Iteration  4   etc…
  • 28.
    BigML, Inc 28Ensembles BoostingRandomize DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration  1 Iteration  2 Iteration  3 Iteration  4   etc…
  • 29.
    BigML, Inc 29Ensembles BoostingConfig • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc.
  • 30.
  • 31.
    BigML, Inc 31Ensembles Waita Second… pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33.6 0.627 50 TRUE 1 85 66 29 0 26.6 0.351 31 FALSE 8 183 64 0 0 23.3 0.672 32 TRUE 1 89 66 23 94 28.1 0.167 21 FALSE MODEL 1 predicted diabetes TRUE TRUE FALSE FALSE ERROR ? ? ? ? …  what  about  classification? pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33.6 0.627 50 1 1 85 66 29 0 26.6 0.351 31 0 8 183 64 0 0 23.3 0.672 32 1 1 89 66 23 94 28.1 0.167 21 0 MODEL 1 predicted diabetes 1 1 0 0 ERROR 0 -1 1 0 …  we  could  try
  • 32.
    BigML, Inc 32Ensembles Waita Second… pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33.6 0.627 50 RED 1 85 66 29 0 26.6 0.351 31 GREEN 8 183 64 0 0 23.3 0.672 32 BLUE 1 89 66 23 94 28.1 0.167 21 RED MODEL 1 predicted favorite color BLUE GREEN RED GREEN ERROR ? ? ? ? …  but  then  what  about  multiple  classes?
  • 33.
    BigML, Inc 33Ensembles BoostingClassification pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33.6 0.627 50 RED 1 85 66 29 0 26.6 0.351 31 GREEN 8 183 64 0 0 23.3 0.672 32 BLUE 1 89 66 23 94 28.1 0.167 21 RED MODEL 1 RED/NOT RED Class RED Probability 0.9 0.7 0.46 0.12 Class RED ERROR 0.1 -0.7 0.54 -0.12 pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age ERROR 6 148 72 35 0 33.6 0.627 50 0.1 1 85 66 29 0 26.6 0.351 31 -0.7 8 183 64 0 0 23.3 0.672 32 0.54 1 89 66 23 94 28.1 0.167 21 -0.12 MODEL 2 RED/NOT RED ERR PREDICTED ERROR 0.05 -0.54 0.32 -0.22 MODEL 1 BLUE/NOT BLUE Class BLUE Probability 0.1 0.3 0.54 0.88 Class BLUE ERROR -0.1 0.7 -0.54 0.12 …and  repeat  for  each     class  at  each  iteration …and  repeat  for  each     class  at  each  iteration Iteration  1 Iteration  2
  • 34.
    BigML, Inc 34Ensembles BoostingClassification DATASET MODELS 1 per class DATASETS 2 per class MODELS 2 per class PREDICTIONS 1 per class PREDICTIONS 2 per class PREDICTIONS 3 per class PREDICTIONS 4 per class Comb PROBABILITY per class MODELS 3 per class MODELS 4 per class DATASETS 3 per class DATASETS 4 per class Iteration  1 Iteration  2 Iteration  3 Iteration  4   etc…
  • 35.
  • 36.
    BigML, Inc 36Ensembles StackedGeneralization ENSEMBLE LOGISTIC REGRESSION SOURCE DATASET MODEL BATCH PREDICTION BATCH PREDICTION BATCH PREDICTION EXTENDED DATASET EXTENDED DATASET EXTENDED DATASET LOGISTIC REGRESSION
  • 37.
    BigML, Inc 37Ensembles WhichEnsemble Method • The one that works best! • Ok, but seriously. Did you evaluate? • For "large" / "complex" datasets • Use DF/RDF with deeper node threshold • Even better, use Boosting with more iterations • For "noisy" data • Boosting may overfit • RDF preferred • For "wide" data • Randomize features (RDF) will be quicker • For "easy" data • A single model may be fine • Bonus: also has the best interpretability! • For classification with "large" number of classes • Boosting will be slower • For "general" data • DF/RDF likely better than a single model or Boosting. • Boosting will be slower since the models are processed serially
  • 38.
    BigML, Inc 38Ensembles TooMany Parameters? • How many trees? • How many nodes? • Missing splits? • Random candidates? • Too many parameters? SMACdown!
  • 39.
    BigML, Inc 39Ensembles Summary •Models have shortcomings: ability to fit, NP-hard, etc • Data has shortcomings: not enough, outliers, mistakes, etc • Ensemble Techniques can improve on single models • Sampling: partitioning, Decision Tree bagging • Adding Randomness: RDF • Modeling the Error: Boosting • Modeling the Models: Stacking • Guidelines for knowing which one might work best in a given situation
  • 40.
    BigML, Inc 2 LogisticRegressions Modeling probabilities Poul Petersen CIO, BigML, Inc
  • 41.
    BigML, Inc 3LogisticRegressions Logistic Regression • Classification implies a discrete objective. How can this be a regression? • Why do we need another classification algorithm? • more questions…. Logistic Regression is a classification algorithm Potential Confusion:
  • 42.
    BigML, Inc 4LogisticRegressions Linear Regression
  • 43.
    BigML, Inc 5LogisticRegressions Linear Regression
  • 44.
    BigML, Inc 6LogisticRegressions Polynomial Regression
  • 45.
    BigML, Inc 7LogisticRegressions Regression • Linear Regression: 𝛽₀+𝛽1·∙(INPUT)  ≈  OBJECTIVE • Quadratic Regression: 𝛽₀+𝛽1·∙(INPUT)+𝛽2·∙(INPUT)2  ≈  OBJECTIVE • Decision Tree Regression: DT(INPUT)  ≈  OBJECTIVE • Problem: • What if we want to do a classification problem: T/F or 1/0 • What function can we fit to discrete data? Regression is the process of "fitting" a function to the data Key Take-Away:
  • 46.
    BigML, Inc 8LogisticRegressions Discrete Data Function?
  • 47.
    BigML, Inc 9LogisticRegressions Discrete Data Function? ????
  • 48.
    BigML, Inc 10LogisticRegressions Logistic Function 𝑥➝-­‐∞       𝑓(𝑥)➝0 • Looks promising, but still not "discrete" • What about the "green" in the middle? • Let’s change the problem… 𝑥➝∞   𝑓(𝑥)➝1 Goal 1   1  +   𝒆−𝑥𝑓(𝑥)  =   Logistic Function
  • 49.
    BigML, Inc 11LogisticRegressions Modeling Probabilities 𝑃≈0 𝑃≈10<𝑃<1
  • 50.
    BigML, Inc 12LogisticRegressions Logistic Regression • Assumes that output is linearly related to "predictors" • What? (hang in there…) • Sometimes we can "fix" this with feature engineering • Question: how do we "fit" the logistic function to real data? LR is a classification algorithm … that uses a regression …
 to model the probability of the discrete objective Clarification: Caveats:
  • 51.
    BigML, Inc 13LogisticRegressions Logistic Regression 𝛽₀ is the "intercept" 𝛽₁ is the "coefficient" • In which case solving is now a linear regression • But this is only one dimension, that is one feature 𝑥… • Given training data consisting of inputs 𝑥, and probabilities 𝑃 • Solve for 𝛽₀ and 𝛽₁ to fit the logistic function • How? The inverse of the logistic function is called the "logit": 𝑃(𝑥)= 1 1+𝑒−(𝛽0+𝛽1 𝑥) 𝑙𝑛( )𝑃(𝑥) 𝑃(𝑥ʹ′) =𝑙𝑛 ( )1-𝑃(𝑥ʹ′) 𝑃(𝑥) =𝛽0+𝛽1 𝑥
  • 52.
    BigML, Inc 14LogisticRegressions Logistic Regression For "𝑖" dimensions, 𝑿﹦[   𝑥1,   𝑥2,⋯,   𝑥𝑖  ],  we solve 𝑃(𝑿)= 1 1+𝑒−𝑓(𝑿) 𝑓(𝑿)=𝛽0+𝞫·∙𝑿=𝛽0+𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖 where:
  • 53.
    BigML, Inc 15LogisticRegressions Interpreting Coefficients • LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗 • negative 𝛽𝑗 → negatively correlated: • positive 𝛽𝑗 → positively correlated: • "larger" 𝛽𝑗 → more impact: • "smaller" → less impact: • 𝛽𝑗  "size" should not be confused with field importance • Can include a coefficient for "missing" (if enabled) • 𝑃(𝑿)  =   𝛽0+⋯+𝛽𝑗 𝑥𝑗+⋯   • Binary Classification (true/false) coefficients are complementary • 𝑃(True)  ≡  1−   𝑃(False) +𝛽𝑗+1[   𝑥𝑗  ≡  Missing  ] 𝑥𝑗↑  then   𝑃(𝑿)↓ 𝑥𝑗↑  then   𝑃(𝑿)↑ 𝑥𝑗≫  then   𝑃(𝑿)﹥ 𝑥𝑗﹥then   𝑃(𝑿)≫
  • 54.
    BigML, Inc 16LogisticRegressions LR Demo #1
  • 55.
    BigML, Inc 17LogisticRegressions LR Parameters 1. Default Numeric: Replaces missing numeric values 2. Missing Numeric: Adds a field for missing numerics 3. Stats: Extended statistics, ex: p-value (runs slower) 4. Bias: Enables/Disables the intercept term - 𝛽₀ • Don’t disable this… 5. Regularization: Reduces over-fitting by minimizing 𝛽𝑗 • L1: prefers reducing individual coefficients • L2 (default): prefers reducing all coefficients 6. Strength "C": Higher values reduce regularization 7. EPS: The minimum error between steps to stop Larger values stop earlier but quality may be less 8. Auto-scaling: Ensures that all features contribute equally • Don’t change this unless you have a specific reason
  • 56.
    BigML, Inc 18LogisticRegressions LR Questions • How do we handle multiple classes? • Binary class True/False only need to solve for one
 𝑃(True)  ≡  1−   𝑃(False)   • What about non-numeric inputs? • Text/Items fields • Categorical fields Questions:
  • 57.
    BigML, Inc 19LogisticRegressions LR - Multi Class • Instead of a binary class ex: [ true, false ], 
 we have multi-class ex: [ red, green, blue, … ] • "𝑘" classes: 𝑪=[𝑐1,   𝑐2,⋯,   𝑐 𝑘] • solve one-vs-rest LR • Result: 𝞫𝑗 for each class 𝑐𝑗 • apply combiner to ensure
 all probabilities add to 1 𝑙𝑛( )𝑃(𝑐1) 𝑃(𝑐1ʹ′) =𝛽1,0+𝞫1·∙𝑿 𝑙𝑛( )𝑃(𝑐2) 𝑃(𝑐2ʹ′) =𝛽2,0+𝞫2·∙𝑿 ⋯ 𝑙𝑛( )𝑃(𝑐 𝑘) 𝑃(𝑐 𝑘ʹ′) =𝛽 𝑘,0+𝞫 𝑘·∙𝑿
  • 58.
    BigML, Inc 20LogisticRegressions LR - Field Codings • LR is expecting numeric values to perform regression. • How do we handle categorical values, or text? Class color=red color=blue color=green color=NULL red 1 0 0 0 blue 0 1 0 0 green 0 0 1 0 MISSING 0 0 0 1 One-hot encoding • Only one feature is "hot" for each class • This is the default
  • 59.
    BigML, Inc 21LogisticRegressions LR - Field Codings Dummy Encoding • Chooses a *reference class* • requires one less degree of freedom Class color_1 color_2 color_3 *red* 0 0 0 blue 1 0 0 green 0 1 0 MISSING 0 0 1
  • 60.
    BigML, Inc 22LogisticRegressions LR - Field Codings Contrast Encoding • Field values must sum to zero • Allows comparison between classes Class field "influence" red 0.5 positive blue -0.25 negative green -0.25 negative MISSING 0 excluded
  • 61.
    BigML, Inc 23LogisticRegressions LR - Field Codings Which one to use? • One-hot is the default • Use this unless you have a specific need • Dummy • Use when there is a control group in mind, which
 becomes the reference class • Contrast • Allows for testing specific hypothesis of relationships. • Ex: customers give a "rating" of bad / ok / good rating Contrast Encoding bad -0.66 ok 0.33 good 0.33 Hypothesis is a good and ok review have the same impact, but a bad review has a negative impact twice as great. rating Contrast Encoding bad -0.5 ok 0 good 0.5 Hypothesis is that a good and bad review have an equal but opposite impact, while an ok rating has no impact.
  • 62.
    BigML, Inc 24LogisticRegressions LR - Field Codings • Text/Items field types are handled by creating a field for text token/item and setting 1 or 0 Text "hippo" "safari" "zebra" “we saw hippos and zebras… 1 0 1 “The best safari for seeing zebras” 0 1 1 “The Oregon coast is rainy in winter” 0 0 0 “Have you ever tried a hippo burger” 1 0 0 Text / Items
  • 63.
    BigML, Inc 25LogisticRegressions LR Demo #2
  • 64.
    BigML, Inc 26LogisticRegressions Curvilinear LR • Logistic Regression is expecting a linear relationship between the features and the objective • Remember - it’s a linear regression under the hood • This is actually pretty common in natural datasets • But non-linear relationships will impact model quality • This can be addressed by adding non-linear transformations to the features • Knowing which transformations requires • domain knowledge • experimentation • both
  • 65.
    BigML, Inc 27LogisticRegressions Curvilinear LR Instead of We could add a feature Where ???? Possible to add any higher order terms or other functions to match shape of data 𝛽0+𝛽1 𝑥1 𝛽0+𝛽1 𝑥1+𝛽2 𝑥2 𝑥1  ≡   𝑥2 2
  • 66.
    BigML, Inc 28LogisticRegressions LR Demo #3
  • 67.
    BigML, Inc 29LogisticRegressions LR vs DT • Expects a "smooth" linear relationship with predictors. • LR is concerned with probability of a discrete outcome. • Lots of parameters to get wrong: 
 regularization, scaling, codings • Slightly less prone to over-fitting
 • Because fits a shape, might work better when less data available.
 • Adapts well to ragged non-linear relationships • No concern: classification, regression, multi-class all fine. • Virtually parameter free
 • Slightly more prone to over-fitting
 • Prefers surfaces parallel to parameter axes, but given enough data will discover any shape. Logistic Regression Decision Tree
  • 68.
    BigML, Inc 30LogisticRegressions LR Demo #4
  • 69.
    BigML, Inc 31LogisticRegressions Summary • Logistic Regression is a classification algorithm that models the probabilities of each class • How the algorithm works and why this is important • Expects a linear relationship between the features and the objective, and how to fix it • Categorical encodings • LR outputs a set of coefficients and how to interpret • Scale relates to impact • Sign relates to direction of impact • Guidelines for comparing to Decision Trees