Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introducing
Boosted Trees
BigML Winter 2017 Release
BigML, Inc 2Winter Release Webinar - March 2017
Winter 2017 Release
POUL PETERSEN (CIO)
Enter questions into chat box – we...
BigML, Inc 3Winter Release Webinar - March 2017
Boosted Trees
• Ensemble method
• Supervised Learning
• Classification & Re...
BigML, Inc 4Winter Release Webinar - March 2017
Questions…
• Novice: Wait, what is Supervised Learning
again?
• Intermedia...
BigML, Inc 5Winter Release Webinar - March 2017
Instances
Supervised Learning
4113 NW
Peppertree
4 2 1876 7841 1977 44,598...
BigML, Inc 6Winter Release Webinar - March 2017
Classification & Regression
animal state … proximity action
tiger hungry … ...
BigML, Inc 7Winter Release Webinar - March 2017
Evaluation
Instances Features
ADDRESS BEDS BATHS SQFT LOT
SIZE
YEAR
BUILT
...
BigML, Inc 8Winter Release Webinar - March 2017
Demo #1
BigML, Inc 9Winter Release Webinar - March 2017
Why Ensembles?
MODEL PREDICTIONDATASET
• Model can "overfit" - especially i...
BigML, Inc 10Winter Release Webinar - March 2017
Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL...
BigML, Inc 11Winter Release Webinar - March 2017
Demo #2
BigML, Inc 12Winter Release Webinar - March 2017
Random Decision Forest
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE ...
BigML, Inc 13Winter Release Webinar - March 2017
Demo #3
BigML, Inc 14Winter Release Webinar - March 2017
Boosting
Question:
• Instead of sampling tricks… can we incrementally imp...
BigML, Inc 15Winter Release Webinar - March 2017
Boosting
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
L...
BigML, Inc 16Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MO...
BigML, Inc 17Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MO...
BigML, Inc 18Winter Release Webinar - March 2017
Demo #4
BigML, Inc 19Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models ...
BigML, Inc 20Winter Release Webinar - March 2017
“OUT OF BAG”
SAMPLES
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3...
BigML, Inc 21Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models ...
BigML, Inc 22Winter Release Webinar - March 2017
Iterations
Boosted Ensemble #1
1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46...
BigML, Inc 23Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models ...
BigML, Inc 24Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MO...
BigML, Inc 25Winter Release Webinar - March 2017
Boosting
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MO...
BigML, Inc 26Winter Release Webinar - March 2017
Boosting Parameters
• Number of iterations - similar to number of models ...
BigML, Inc 27Winter Release Webinar - March 2017
Wait a second….
pregnancies
plasma
glucose
blood
pressure
triceps skin
th...
BigML, Inc 28Winter Release Webinar - March 2017
Wait a second….
pregnancies
plasma
glucose
blood
pressure
triceps skin
th...
BigML, Inc 29Winter Release Webinar - March 2017
Boosting Classification
pregnancies
plasma
glucose
blood
pressure
triceps ...
BigML, Inc 30Winter Release Webinar - March 2017
Boosting Classification
DATASET
MODELS 1
per class
DATASETS 2
per class
MO...
BigML, Inc 31Winter Release Webinar - March 2017
Demo #5
BigML, Inc 32Winter Release Webinar - March 2017
Programmable
API First!
Bindings!
WhizzML!
BigML, Inc 33Winter Release Webinar - March 2017
What to use?
• The one that works best!
• Ok, but seriously. Did you eval...
Questions?
Twitter: @bigmlcom
Mail: info@bigml.com
Docs: https://bigml.com/releases
Upcoming SlideShare
Loading in …5
×

BigML Winter 2017 Release

328 views

Published on

Our Winter 2017 release brings Boosted Trees, the latest resource that helps you easily solve classification and regression problems. This Machine Learning technique allows each tree model to concentrate on the wrong predictions of the previously grown tree to correct and improve on any mistakes made in those previous iterations. Boosted Trees are accessible from the BigML Dashboard as well as the API. Together with Bagging and Random Decision Forests, Boosted Trees complete the set of ensemble-based strategies on the BigML platform.

Published in: Data & Analytics
  • Be the first to comment

BigML Winter 2017 Release

  1. 1. Introducing Boosted Trees BigML Winter 2017 Release
  2. 2. BigML, Inc 2Winter Release Webinar - March 2017 Winter 2017 Release POUL PETERSEN (CIO) Enter questions into chat box – we’ll answer some via chat; others at the end of the session https://bigml.com/releases ATAKAN CETINSOY, (VP Predictive Applications) Resources Moderator Speaker Contact info@bigml.com Twitter @bigmlcom Questions
  3. 3. BigML, Inc 3Winter Release Webinar - March 2017 Boosted Trees • Ensemble method • Supervised Learning • Classification & Regression • Now "BigML Easy"! NUMBER OF MODELS 1 NUMBER OF MODELS 10 NUMBER OF MODELS 20
  4. 4. BigML, Inc 4Winter Release Webinar - March 2017 Questions… • Novice: Wait, what is Supervised Learning again? • Intermediate: When are ensemble methods useful and how are Boosted Trees different from the ensemble methods already available in BigML? • Power User: How does Boosting work and what do all these knobs do?
  5. 5. BigML, Inc 5Winter Release Webinar - March 2017 Instances Supervised Learning 4113 NW Peppertree 4 2 1876 7841 1977 44,598141 -123,29689 366000 2450 NW 13th 3 2 1502 9148 1964 44,593322 -123,267435 265000 3545 NE Canterbury 3 1,5 1172 10019 1972 44,603042 -123,236332 245000 1245 NW Kainui 4 2 1864 49658 1974 44,651826 -123,250615 341500 Features ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 307500 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 600000 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350 2975 NW Stewart 3 1,5 1327 9148 1973 44,597319 -123,25452 245000 LAST SALE PRICE 360000 307500 600000 435350 245000 366000 265000 245000 341500 L a b Supervised Learning: There is a column of data, the label, we want to predict.
  6. 6. BigML, Inc 6Winter Release Webinar - March 2017 Classification & Regression animal state … proximity action tiger hungry … close run elephant happy … far take picture … … … … … Classification animal state … proximity min_kmh tiger hungry … close 70 hippo angry … far 10 … …. … … … Regression label
  7. 7. BigML, Inc 7Winter Release Webinar - March 2017 Evaluation Instances Features ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 307500 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 600000 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350 2975 NW Stewart 3 1,5 1327 9148 1973 44,597319 -123,25452 245000 LAST SALE PRICE 360000 307500 600000 435350 245000 L a b 4113 NW Peppertree 4 2 1876 7841 1977 44,598141 -123,29689 366000 2450 NW 13th 3 2 1502 9148 1964 44,593322 -123,267435 265000 3545 NE Canterbury 3 1,5 1172 10019 1972 44,603042 -123,236332 245000 1245 NW Kainui 4 2 1864 49658 1974 44,651826 -123,250615 341500 366000 265000 245000 341500 MODEL EVALUATION
  8. 8. BigML, Inc 8Winter Release Webinar - March 2017 Demo #1
  9. 9. BigML, Inc 9Winter Release Webinar - March 2017 Why Ensembles? MODEL PREDICTIONDATASET • Model can "overfit" - especially if there are outliers. Consider: mansions • Model might "underfit" - especially if there is not enough training data to discover a single good representation. Consider: No mid-size homes • The data may be too "complicated" for the model to represent at all. Consider: Trying to fit a line to exponential pricing • Models might be "unlucky" - algorithms typically rely on stochastic processes. This is more subtle… Problems: Ensemble Intuition: • Build many models, each with only some of the data and outliers will be averaged out • Built many models instead of relying on one guess. • Built many models, allowing each to focus on part of the data thus increasing complexity of the solution. • Built many models to avoid unlucky initial conditions.
  10. 10. BigML, Inc 10Winter Release Webinar - March 2017 Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION VOTE aka "Bagging"
  11. 11. BigML, Inc 11Winter Release Webinar - March 2017 Demo #2
  12. 12. BigML, Inc 12Winter Release Webinar - March 2017 Random Decision Forest MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 SAMPLE 1 PREDICTION VOTE
  13. 13. BigML, Inc 13Winter Release Webinar - March 2017 Demo #3
  14. 14. BigML, Inc 14Winter Release Webinar - March 2017 Boosting Question: • Instead of sampling tricks… can we incrementally improve a single model? MODEL BATCH PREDICTION Intuition: • The training data has the correct answer, so we know the errors exactly. Can we model the errors? DATASET label DATASET label prediction DATASET label prediction error
  15. 15. BigML, Inc 15Winter Release Webinar - March 2017 Boosting ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 307500 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 600000 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350 MODEL 1 PREDICTED SALE PRICE 360750 306875 587500 435350 ERROR 750 -625 -12500 0 ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE ERROR 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 750 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 625 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 12500 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0 MODEL 2 PREDICTED ERROR 750 625 12393,83333 6879,67857 Why stop at one iteration? • "Hey Model 1, what do you predict is the sale price of this home?" • "Hey Model 2, how much error do you predict Model 1 just made?"
  16. 16. BigML, Inc 16Winter Release Webinar - March 2017 Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  17. 17. BigML, Inc 17Winter Release Webinar - March 2017 Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  18. 18. BigML, Inc 18Winter Release Webinar - March 2017 Demo #4
  19. 19. BigML, Inc 19Winter Release Webinar - March 2017 Boosting Parameters • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples
  20. 20. BigML, Inc 20Winter Release Webinar - March 2017 “OUT OF BAG” SAMPLES Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  21. 21. BigML, Inc 21Winter Release Webinar - March 2017 Boosting Parameters • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work.
  22. 22. BigML, Inc 22Winter Release Webinar - March 2017 Iterations Boosted Ensemble #1 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Early Stop # Iterations 1 2 3 4 5 6 7 8 9 10 ….. 41 42 43 44 45 46 47 48 49 50 Boosted Ensemble #2 Early Stop# Iterations This is OK because the early stop means the iterative improvement is small and we have "converged" before being forcibly stopped by the # iterations This is NOT OK because the hard limit on iterations stopped improving the quality of the boosting long before there was enough iterations to have achieved the best quality.
  23. 23. BigML, Inc 23Winter Release Webinar - March 2017 Boosting Parameters • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize
  24. 24. BigML, Inc 24Winter Release Webinar - March 2017 Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  25. 25. BigML, Inc 25Winter Release Webinar - March 2017 Boosting DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  26. 26. BigML, Inc 26Winter Release Webinar - March 2017 Boosting Parameters • Number of iterations - similar to number of models for DF/RDF • Iterations can be limited with Early Stopping: • Early out of bag: tests with the out-of-bag samples • Early holdout: tests with a portion of the dataset • None: performs all iterations. Note: In general, it is better to use a high number of iterations and let the early stopping work. • Learning Rate: Controls how aggressively boosting will fit the data: • Larger values ~ maybe quicker fit, but risk of overfitting • You can combine sampling with Boosting! • Samples with Replacement • Add Randomize • All other tree based options are the same: • weights: for unbalanced classes • missing splits: to learn from missing data • node threshold: increase to allow "deeper" trees
  27. 27. BigML, Inc 27Winter Release Webinar - March 2017 Wait a second…. pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33,6 0,627 50 TRUE 1 85 66 29 0 26,6 0,351 31 FALSE 8 183 64 0 0 23,3 0,672 32 TRUE 1 89 66 23 94 28,1 0,167 21 FALSE MODEL 1 predicted diabetes TRUE TRUE FALSE FALSE ERROR ? ? ? ? … what about classification? pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33,6 0,627 50 1 1 85 66 29 0 26,6 0,351 31 0 8 183 64 0 0 23,3 0,672 32 1 1 89 66 23 94 28,1 0,167 21 0 MODEL 1 predicted diabetes 1 1 0 0 ERROR 0 -1 1 0 … we could try
  28. 28. BigML, Inc 28Winter Release Webinar - March 2017 Wait a second…. pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33,6 0,627 50 RED 1 85 66 29 0 26,6 0,351 31 GREEN 8 183 64 0 0 23,3 0,672 32 BLUE 1 89 66 23 94 28,1 0,167 21 RED MODEL 1 predicted favorite color BLUE GREEN RED GREEN ERROR ? ? ? ? … but then what about multiple classes?
  29. 29. BigML, Inc 29Winter Release Webinar - March 2017 Boosting Classification pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age favorite color 6 148 72 35 0 33,6 0,627 50 RED 1 85 66 29 0 26,6 0,351 31 GREEN 8 183 64 0 0 23,3 0,672 32 BLUE 1 89 66 23 94 28,1 0,167 21 RED MODEL 1 RED/NOT RED Class RED Probability 0,9 0,7 0,46 0,12 Class RED ERROR 0,1 -0,7 0,54 -0,12 pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age ERROR 6 148 72 35 0 33,6 0,627 50 0,1 1 85 66 29 0 26,6 0,351 31 -0,7 8 183 64 0 0 23,3 0,672 32 0,54 1 89 66 23 94 28,1 0,167 21 -0,12 MODEL 2 RED/NOT RED ERR PREDICTED ERROR 0,05 -0,54 0,32 -0,22 MODEL 1 BLUE/NOT BLUE Class BLUE Probability 0,1 0,3 0,54 0,88 Class BLUE ERROR -0,1 0,7 -0,54 0,12 …and repeat for each class at each iteration …and repeat for each class at each iteration Iteration 1 Iteration 2
  30. 30. BigML, Inc 30Winter Release Webinar - March 2017 Boosting Classification DATASET MODELS 1 per class DATASETS 2 per class MODELS 2 per class PREDICTIONS 1 per class PREDICTIONS 2 per class PREDICTIONS 3 per class PREDICTIONS 4 per class Comb PROBABILITY per class MODELS 3 per class MODELS 4 per class DATASETS 3 per class DATASETS 4 per class Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  31. 31. BigML, Inc 31Winter Release Webinar - March 2017 Demo #5
  32. 32. BigML, Inc 32Winter Release Webinar - March 2017 Programmable API First! Bindings! WhizzML!
  33. 33. BigML, Inc 33Winter Release Webinar - March 2017 What to use? • The one that works best! • Ok, but seriously. Did you evaluate? • For "large" / "complex" datasets • Use DF/RDF with deeper node threshold • Even better, use Boosting with more iterations • For "noisy" data • Boosting may overfit • RDF preferred • For "wide" data • Randomize features (RDF) will be quicker • For "easy" data • A single model may be fine • Bonus: also has the best interpretability! • For classification with "large" number of classes • Boosting will be slower • For "general" data • DF/RDF likely better than a single model or Boosting. • Boosting will be slower since the models are processed serially
  34. 34. Questions? Twitter: @bigmlcom Mail: info@bigml.com Docs: https://bigml.com/releases

×