N O V E M B E R 2 9 , 2 0 1 7
BigML, Inc 2
Ensembles
Making trees unstoppable
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Topic Models
what is an Ensemble?
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?
BigML, Inc 4Topic Models
No Model is Perfect
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to fit a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…
BigML, Inc 5Topic Models
No Data is Perfect
• Not enough data!
• Always working with finite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overfitting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data
BigML, Inc 6Topic Models
Ensemble Techniques
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms
BigML, Inc 7Topic Models
Simple Example
BigML, Inc 8Topic Models
Simple Example
BigML, Inc 9Topic Models
Simple Example
Partition the data… then model each partition…
For predictions, use the model for the same partition
?
BigML, Inc 10Topic Models
Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER
BigML, Inc 11
Ensembles Demo #1
BigML, Inc 12Topic Models
Decision Forest Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Number of models: How many trees to build
• Sampling options:
• Deterministic / Random
• Replacement:
• Allows sampling the same instance more than once
• Effectively the same as ≈ 63.21%
• “Full size” samples with zero covariance (good thing)
• At prediction time
• Combiner…
BigML, Inc 13Topic Models
Quick Review
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classification
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
BigML, Inc 14Topic Models
Ensemble Combiners
• Regression: Average of the predictions and expected error
• Classification:
• Plurality - majority wins.
• Confidence Weighted - majority wins but each vote is
weighted by the confidence.
• Probability Weighted - each tree votes the distribution at
it’s leaf node.
• K Threshold - only votes if the specified class and
required number of trees is met. For example, allowing a
“True” vote if and only if at least 9 out of 10 trees vote
“True”.
• Confidence Threshold - only votes the specified class if
the minimum confidence is met.
BigML, Inc 15
Ensembles Demo #2
BigML, Inc 16Topic Models
Outlier Example
Diameter Color Shape Fruit
4 red round plum
5 red round apple
5 red round apple
6 red round plum
7 red round apple
All Data: “plum”
Sample 2: “apple”
Sample 3: “apple”
Sample 1: “plum”
}“apple”
What is a round, red 6cm fruit?
BigML, Inc 17Topic Models
Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
BigML, Inc 18Topic Models
RDF Config
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
• Decision Forest parameters still available
• Number of model, Sampling, etc
• Random candidates:
• The number of features to consider at each split
BigML, Inc 19
Ensembles Demo #3
BigML, Inc 20Topic Models
Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 625
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 12500
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393,83333
6879,67857
Why stop at one iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"
BigML, Inc 21Topic Models
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 22Topic Models
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
BigML, Inc 23Topic Models
Boosting Config
“OUT OF BAG”
SAMPLES
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 24Topic Models
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
BigML, Inc 25Topic Models
Iterations
Boosted Ensemble #1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Early Stop # Iterations
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Boosted Ensemble #2
Early Stop# Iterations
This is OK because the early stop means the iterative improvement is small

and we have "converged" before being forcibly stopped by the # iterations
This is NOT OK because the hard limit on iterations stopped improving the quality of the
boosting long before there was enough iterations to have achieved the best quality.
BigML, Inc 26Topic Models
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
BigML, Inc 27Topic Models
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 28Topic Models
Boosting Randomize
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 29Topic Models
Boosting Config
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to
use a high number of iterations and let the early stopping
work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
• Individual tree parameters are still available
• Balanced objective, Missing splits, Node Depth, etc.
BigML, Inc 30
Ensembles Demo #4
BigML, Inc 31Topic Models
Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33,6 0,627 50 TRUE
1 85 66 29 0 26,6 0,351 31 FALSE
8 183 64 0 0 23,3 0,672 32 TRUE
1 89 66 23 94 28,1 0,167 21 FALSE
MODEL 1
predicted
diabetes
TRUE
TRUE
FALSE
FALSE
ERROR
?
?
?
?
… what about classification?
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33,6 0,627 50 1
1 85 66 29 0 26,6 0,351 31 0
8 183 64 0 0 23,3 0,672 32 1
1 89 66 23 94 28,1 0,167 21 0
MODEL 1
predicted
diabetes
1
1
0
0
ERROR
0
-1
1
0
… we could try
BigML, Inc 32Topic Models
Wait a Second…
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33,6 0,627 50 RED
1 85 66 29 0 26,6 0,351 31 GREEN
8 183 64 0 0 23,3 0,672 32 BLUE
1 89 66 23 94 28,1 0,167 21 RED
MODEL 1
predicted
favorite color
BLUE
GREEN
RED
GREEN
ERROR
?
?
?
?
… but then what about multiple classes?
BigML, Inc 33Topic Models
Boosting Classification
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33,6 0,627 50 RED
1 85 66 29 0 26,6 0,351 31 GREEN
8 183 64 0 0 23,3 0,672 32 BLUE
1 89 66 23 94 28,1 0,167 21 RED
MODEL 1
RED/NOT RED
Class RED
Probability
0,9
0,7
0,46
0,12
Class RED
ERROR
0,1
-0,7
0,54
-0,12
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age ERROR
6 148 72 35 0 33,6 0,627 50 0,1
1 85 66 29 0 26,6 0,351 31 -0,7
8 183 64 0 0 23,3 0,672 32 0,54
1 89 66 23 94 28,1 0,167 21 -0,12
MODEL 2
RED/NOT RED ERR
PREDICTED
ERROR
0,05
-0,54
0,32
-0,22
MODEL 1
BLUE/NOT BLUE
Class BLUE
Probability
0,1
0,3
0,54
0,88
Class BLUE
ERROR
-0,1
0,7
-0,54
0,12
…and repeat for each
class at each iteration
…and repeat for each
class at each iteration
Iteration 1
Iteration 2
BigML, Inc 34Topic Models
Boosting Classification
DATASET
MODELS 1
per class
DATASETS 2
per class
MODELS 2
per class
PREDICTIONS 1
per class
PREDICTIONS 2
per class
PREDICTIONS 3
per class
PREDICTIONS 4
per class
Comb
PROBABILITY
per class
MODELS 3
per class
MODELS 4
per class
DATASETS 3
per class
DATASETS 4
per class
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc 35
Ensembles Demo #5
BigML, Inc 36Topic Models
Stacked Generalization
ENSEMBLE
LOGISTIC
REGRESSION
SOURCE DATASET
MODEL
BATCH
PREDICTION
BATCH
PREDICTION
BATCH
PREDICTION
EXTENDED
DATASET
EXTENDED
DATASET
EXTENDED
DATASET
LOGISTIC
REGRESSION
BigML, Inc 37Topic Models
Which Ensemble Method
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially
BigML, Inc 38Topic Models
Too Many Parameters?
• How many trees?
• How many nodes?
• Missing splits?
• Random candidates?
• Too many parameters?
SMACdown!
BigML, Inc 39Topic Models
Summary
• Models have shortcomings: ability to fit, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation
BSSML17 - Ensembles

BSSML17 - Ensembles

  • 1.
    N O VE M B E R 2 9 , 2 0 1 7
  • 2.
    BigML, Inc 2 Ensembles Makingtrees unstoppable Poul Petersen CIO, BigML, Inc
  • 3.
    BigML, Inc 3TopicModels what is an Ensemble? • Rather than build a single model… • Combine the output of several typically “weaker” models into a powerful ensemble… • Q1: Why is this necessary? • Q2: How do we build “weaker” models? • Q3: How do we “combine” models?
  • 4.
    BigML, Inc 4TopicModels No Model is Perfect • A given ML algorithm may simply not be able to exactly model the “real solution” of a particular dataset. • Try to fit a line to a curve • Even if the model is very capable, the “real solution” may be elusive • DT/NN can model any decision boundary with enough training data, but the solution is NP-hard • Practical algorithms involve random processes and may arrive at different, yet equally good, “solutions” depending on the starting conditions, local optima, etc. • If that wasn’t bad enough…
  • 5.
    BigML, Inc 5TopicModels No Data is Perfect • Not enough data! • Always working with finite training data • Therefore, every “model” is an approximation of the “real solution” and there may be several good approximations. • Anomalies / Outliers • The model is trying to generalize from discrete training data. • Outliers can “skew” the model, by overfitting • Mistakes in your data • Does the model have to do everything for you? • But really, there is always mistakes in your data
  • 6.
    BigML, Inc 6TopicModels Ensemble Techniques • Key Idea: • By combining several good “models”, the combination may be closer to the best possible “model” • we want to ensure diversity. It’s not useful to use an ensemble of 100 models that are all the same • Training Data Tricks • Build several models, each with only some of the data • Introduce randomness directly into the algorithm • Add training weights to “focus” the additional models on the mistakes made • Prediction Tricks • Model the mistakes • Model the output of several different algorithms
  • 7.
    BigML, Inc 7TopicModels Simple Example
  • 8.
    BigML, Inc 8TopicModels Simple Example
  • 9.
    BigML, Inc 9TopicModels Simple Example Partition the data… then model each partition… For predictions, use the model for the same partition ?
  • 10.
    BigML, Inc 10TopicModels Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION COMBINER
  • 11.
  • 12.
    BigML, Inc 12TopicModels Decision Forest Config • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc. • Number of models: How many trees to build • Sampling options: • Deterministic / Random • Replacement: • Allows sampling the same instance more than once • Effectively the same as ≈ 63.21% • “Full size” samples with zero covariance (good thing) • At prediction time • Combiner…
  • 13.
    BigML, Inc 13TopicModels Quick Review animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classification animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label
  • 14.
    BigML, Inc 14TopicModels Ensemble Combiners • Regression: Average of the predictions and expected error • Classification: • Plurality - majority wins. • Confidence Weighted - majority wins but each vote is weighted by the confidence. • Probability Weighted - each tree votes the distribution at it’s leaf node. • K Threshold - only votes if the specified class and required number of trees is met. For example, allowing a “True” vote if and only if at least 9 out of 10 trees vote “True”. • Confidence Threshold - only votes the specified class if the minimum confidence is met.
  • 15.
  • 16.
    BigML, Inc 16TopicModels Outlier Example Diameter Color Shape Fruit 4 red round plum 5 red round apple 5 red round apple 6 red round plum 7 red round apple All Data: “plum” Sample 2: “apple” Sample 3: “apple” Sample 1: “plum” }“apple” What is a round, red 6cm fruit?
  • 17.
    BigML, Inc 17TopicModels Random Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 SAMPLE 1 PREDICTION COMBINER
  • 18.
    BigML, Inc 18TopicModels RDF Config • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc. • Decision Forest parameters still available • Number of model, Sampling, etc • Random candidates: • The number of features to consider at each split
  • 19.
  • 20.
    BigML, Inc 20TopicModels Boosting ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 307500 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 600000 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350 MODEL 1 PREDICTED SALE PRICE 360750 306875 587500 435350 ERROR 750 -625 -12500 0 ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE ERROR 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 750 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 625 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 12500 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0 MODEL 2 PREDICTED ERROR 750 625 12393,83333 6879,67857 Why stop at one iteration? "Hey Model 1, what do you predict is the sale price of this home?" "Hey Model 2, how much error do you predict Model 1 just made?"
  • 21.
    BigML, Inc 21TopicModels Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 22.
    BigML, Inc 22TopicModels Boosting Config • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples
  • 23.
    BigML, Inc 23TopicModels Boosting Config “OUT OF BAG” SAMPLES DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 24.
    BigML, Inc 24TopicModels Boosting Config • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work.
  • 25.
    BigML, Inc 25TopicModels Iterations Boosted Ensemble #1 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Early Stop # Iterations 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Boosted Ensemble #2 Early Stop# Iterations This is OK because the early stop means the iterative improvement is small and we have "converged" before being forcibly stopped by the # iterations This is NOT OK because the hard limit on iterations stopped improving the quality of the boosting long before there was enough iterations to have achieved the best quality.
  • 26.
    BigML, Inc 26TopicModels Boosting Config • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize
  • 27.
    BigML, Inc 27TopicModels Boosting Randomize DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 28.
    BigML, Inc 28TopicModels Boosting Randomize DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 29.
    BigML, Inc 29TopicModels Boosting Config • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize • Individual tree parameters are still available • Balanced objective, Missing splits, Node Depth, etc.
  • 30.
  • 31.
    BigML, Inc 31TopicModels Wait a Second… pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33,6 0,627 50 TRUE 1 85 66 29 0 26,6 0,351 31 FALSE 8 183 64 0 0 23,3 0,672 32 TRUE 1 89 66 23 94 28,1 0,167 21 FALSE MODEL 1 predicted diabetes TRUE TRUE FALSE FALSE ERROR ? ? ? ? … what about classification? pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33,6 0,627 50 1 1 85 66 29 0 26,6 0,351 31 0 8 183 64 0 0 23,3 0,672 32 1 1 89 66 23 94 28,1 0,167 21 0 MODEL 1 predicted diabetes 1 1 0 0 ERROR 0 -1 1 0 … we could try
  • 32.
    BigML, Inc 32TopicModels Wait a Second… pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33,6 0,627 50 RED 1 85 66 29 0 26,6 0,351 31 GREEN 8 183 64 0 0 23,3 0,672 32 BLUE 1 89 66 23 94 28,1 0,167 21 RED MODEL 1 predicted favorite color BLUE GREEN RED GREEN ERROR ? ? ? ? … but then what about multiple classes?
  • 33.
    BigML, Inc 33TopicModels Boosting Classification pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33,6 0,627 50 RED 1 85 66 29 0 26,6 0,351 31 GREEN 8 183 64 0 0 23,3 0,672 32 BLUE 1 89 66 23 94 28,1 0,167 21 RED MODEL 1 RED/NOT RED Class RED Probability 0,9 0,7 0,46 0,12 Class RED ERROR 0,1 -0,7 0,54 -0,12 pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age ERROR 6 148 72 35 0 33,6 0,627 50 0,1 1 85 66 29 0 26,6 0,351 31 -0,7 8 183 64 0 0 23,3 0,672 32 0,54 1 89 66 23 94 28,1 0,167 21 -0,12 MODEL 2 RED/NOT RED ERR PREDICTED ERROR 0,05 -0,54 0,32 -0,22 MODEL 1 BLUE/NOT BLUE Class BLUE Probability 0,1 0,3 0,54 0,88 Class BLUE ERROR -0,1 0,7 -0,54 0,12 …and repeat for each class at each iteration …and repeat for each class at each iteration Iteration 1 Iteration 2
  • 34.
    BigML, Inc 34TopicModels Boosting Classification DATASET MODELS 1 per class DATASETS 2 per class MODELS 2 per class PREDICTIONS 1 per class PREDICTIONS 2 per class PREDICTIONS 3 per class PREDICTIONS 4 per class Comb PROBABILITY per class MODELS 3 per class MODELS 4 per class DATASETS 3 per class DATASETS 4 per class Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 35.
  • 36.
    BigML, Inc 36TopicModels Stacked Generalization ENSEMBLE LOGISTIC REGRESSION SOURCE DATASET MODEL BATCH PREDICTION BATCH PREDICTION BATCH PREDICTION EXTENDED DATASET EXTENDED DATASET EXTENDED DATASET LOGISTIC REGRESSION
  • 37.
    BigML, Inc 37TopicModels Which Ensemble Method • The one that works best! • Ok, but seriously. Did you evaluate? • For "large" / "complex" datasets • Use DF/RDF with deeper node threshold • Even better, use Boosting with more iterations • For "noisy" data • Boosting may overfit • RDF preferred • For "wide" data • Randomize features (RDF) will be quicker • For "easy" data • A single model may be fine • Bonus: also has the best interpretability! • For classification with "large" number of classes • Boosting will be slower • For "general" data • DF/RDF likely better than a single model or Boosting. • Boosting will be slower since the models are processed serially
  • 38.
    BigML, Inc 38TopicModels Too Many Parameters? • How many trees? • How many nodes? • Missing splits? • Random candidates? • Too many parameters? SMACdown!
  • 39.
    BigML, Inc 39TopicModels Summary • Models have shortcomings: ability to fit, NP-hard, etc • Data has shortcomings: not enough, outliers, mistakes, etc • Ensemble Techniques can improve on single models • Sampling: partitioning, Decision Tree bagging • Adding Randomness: RDF • Modeling the Error: Boosting • Modeling the Models: Stacking • Guidelines for knowing which one might work best in a given situation