Our Winter 2017 release brings Boosted Trees, the latest resource that helps you easily solve classification and regression problems. This Machine Learning technique allows each tree model to concentrate on the wrong predictions of the previously grown tree to correct and improve on any mistakes made in those previous iterations. Boosted Trees are accessible from the BigML Dashboard as well as the API. Together with Bagging and Random Decision Forests, Boosted Trees complete the set of ensemble-based strategies on the BigML platform.
2. BigML, Inc 2Winter Release Webinar - March 2017
Winter 2017 Release
POUL PETERSEN (CIO)
Enter questions into chat box – we’ll
answer some via chat; others at the end of
the session
https://bigml.com/releases
ATAKAN CETINSOY, (VP Predictive Applications)
Resources
Moderator
Speaker
Contact info@bigml.com
Twitter @bigmlcom
Questions
3. BigML, Inc 3Winter Release Webinar - March 2017
Boosted Trees
• Ensemble method
• Supervised Learning
• Classification & Regression
• Now "BigML Easy"!
NUMBER OF MODELS
1
NUMBER OF MODELS
10
NUMBER OF MODELS
20
4. BigML, Inc 4Winter Release Webinar - March 2017
Questions…
• Novice: Wait, what is Supervised Learning
again?
• Intermediate: When are ensemble methods
useful and how are Boosted Trees different
from the ensemble methods already
available in BigML?
• Power User: How does Boosting work and
what do all these knobs do?
5. BigML, Inc 5Winter Release Webinar - March 2017
Instances
Supervised Learning
4113 NW
Peppertree
4 2 1876 7841 1977 44,598141 -123,29689 366000
2450 NW 13th 3 2 1502 9148 1964 44,593322 -123,267435 265000
3545 NE
Canterbury
3 1,5 1172 10019 1972 44,603042 -123,236332 245000
1245 NW
Kainui
4 2 1864 49658 1974 44,651826 -123,250615 341500
Features
ADDRESS BEDS BATHS SQFT LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
2975 NW
Stewart
3 1,5 1327 9148 1973 44,597319 -123,25452 245000
LAST SALE
PRICE
360000
307500
600000
435350
245000
366000
265000
245000
341500
L a b
Supervised Learning:
There is a column of data, the label, we want to predict.
6. BigML, Inc 6Winter Release Webinar - March 2017
Classification & Regression
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classification
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
label
7. BigML, Inc 7Winter Release Webinar - March 2017
Evaluation
Instances Features
ADDRESS BEDS BATHS SQFT LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
2975 NW
Stewart
3 1,5 1327 9148 1973 44,597319 -123,25452 245000
LAST SALE
PRICE
360000
307500
600000
435350
245000
L a b
4113 NW
Peppertree
4 2 1876 7841 1977 44,598141 -123,29689 366000
2450 NW 13th 3 2 1502 9148 1964 44,593322 -123,267435 265000
3545 NE
Canterbury
3 1,5 1172 10019 1972 44,603042 -123,236332 245000
1245 NW
Kainui
4 2 1864 49658 1974 44,651826 -123,250615 341500
366000
265000
245000
341500
MODEL
EVALUATION
9. BigML, Inc 9Winter Release Webinar - March 2017
Why Ensembles?
MODEL PREDICTIONDATASET
• Model can "overfit" - especially if there are
outliers. Consider: mansions
• Model might "underfit" - especially if there is
not enough training data to discover a
single good representation. Consider: No
mid-size homes
• The data may be too "complicated" for the
model to represent at all. Consider: Trying to
fit a line to exponential pricing
• Models might be "unlucky" - algorithms
typically rely on stochastic processes. This
is more subtle…
Problems: Ensemble Intuition:
• Build many models, each with only some of
the data and outliers will be averaged out
• Built many models instead of relying on one
guess.
• Built many models, allowing each to focus
on part of the data thus increasing
complexity of the solution.
• Built many models to avoid unlucky initial
conditions.
10. BigML, Inc 10Winter Release Webinar - March 2017
Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
VOTE
aka "Bagging"
14. BigML, Inc 14Winter Release Webinar - March 2017
Boosting
Question:
• Instead of sampling tricks… can we incrementally improve a single
model?
MODEL
BATCH
PREDICTION
Intuition:
• The training data has the correct answer, so we know the errors
exactly. Can we model the errors?
DATASET
label
DATASET
label
prediction
DATASET
label
prediction
error
15. BigML, Inc 15Winter Release Webinar - March 2017
Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 625
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 12500
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393,83333
6879,67857
Why stop at one iteration?
• "Hey Model 1, what do you predict is the sale price of this home?"
• "Hey Model 2, how much error do you predict Model 1 just made?"
16. BigML, Inc 16Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
17. BigML, Inc 17Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
19. BigML, Inc 19Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
20. BigML, Inc 20Winter Release Webinar - March 2017
“OUT OF BAG”
SAMPLES
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
21. BigML, Inc 21Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to use a
high number of iterations and let the early stopping work.
22. BigML, Inc 22Winter Release Webinar - March 2017
Iterations
Boosted Ensemble #1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Early Stop # Iterations
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50
Boosted Ensemble #2
Early Stop# Iterations
This is OK because the early stop means the iterative improvement is small
and we have "converged" before being forcibly stopped by the # iterations
This is NOT OK because the hard limit on iterations stopped improving the quality of the
boosting long before there was enough iterations to have achieved the best quality.
23. BigML, Inc 23Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to use a
high number of iterations and let the early stopping work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
24. BigML, Inc 24Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
25. BigML, Inc 25Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
26. BigML, Inc 26Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models for DF/RDF
• Iterations can be limited with Early Stopping:
• Early out of bag: tests with the out-of-bag samples
• Early holdout: tests with a portion of the dataset
• None: performs all iterations. Note: In general, it is better to use a
high number of iterations and let the early stopping work.
• Learning Rate: Controls how aggressively boosting will fit the data:
• Larger values ~ maybe quicker fit, but risk of overfitting
• You can combine sampling with Boosting!
• Samples with Replacement
• Add Randomize
• All other tree based options are the same:
• weights: for unbalanced classes
• missing splits: to learn from missing data
• node threshold: increase to allow "deeper" trees
28. BigML, Inc 28Winter Release Webinar - March 2017
Wait a second….
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33,6 0,627 50 RED
1 85 66 29 0 26,6 0,351 31 GREEN
8 183 64 0 0 23,3 0,672 32 BLUE
1 89 66 23 94 28,1 0,167 21 RED
MODEL 1
predicted
favorite color
BLUE
GREEN
RED
GREEN
ERROR
?
?
?
?
… but then what about multiple classes?
29. BigML, Inc 29Winter Release Webinar - March 2017
Boosting Classification
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age
favorite
color
6 148 72 35 0 33,6 0,627 50 RED
1 85 66 29 0 26,6 0,351 31 GREEN
8 183 64 0 0 23,3 0,672 32 BLUE
1 89 66 23 94 28,1 0,167 21 RED
MODEL 1
RED/NOT RED
Class RED
Probability
0,9
0,7
0,46
0,12
Class RED
ERROR
0,1
-0,7
0,54
-0,12
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age ERROR
6 148 72 35 0 33,6 0,627 50 0,1
1 85 66 29 0 26,6 0,351 31 -0,7
8 183 64 0 0 23,3 0,672 32 0,54
1 89 66 23 94 28,1 0,167 21 -0,12
MODEL 2
RED/NOT RED ERR
PREDICTED
ERROR
0,05
-0,54
0,32
-0,22
MODEL 1
BLUE/NOT BLUE
Class BLUE
Probability
0,1
0,3
0,54
0,88
Class BLUE
ERROR
-0,1
0,7
-0,54
0,12
…and repeat for each
class at each iteration
…and repeat for each
class at each iteration
Iteration 1
Iteration 2
30. BigML, Inc 30Winter Release Webinar - March 2017
Boosting Classification
DATASET
MODELS 1
per class
DATASETS 2
per class
MODELS 2
per class
PREDICTIONS 1
per class
PREDICTIONS 2
per class
PREDICTIONS 3
per class
PREDICTIONS 4
per class
Comb
PROBABILITY
per class
MODELS 3
per class
MODELS 4
per class
DATASETS 3
per class
DATASETS 4
per class
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
32. BigML, Inc 32Winter Release Webinar - March 2017
Programmable
API First!
Bindings!
WhizzML!
33. BigML, Inc 33Winter Release Webinar - March 2017
What to use?
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially