Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Model Selection and
Tuning at Scale
March 2016
About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on
Kaggle
Former VP, Science @ AIG
P...
Agenda
● Introduction
● Case-study Criteo 1TB
● Conclusion / Discussion
Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validatio...
Model Complexity & Overfitting
More data to the rescue?
Underfitting or Overfitting?
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees
○ Learning rate
○ Tre...
Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf...
Hyperparameter Optimization
● Grid Search
● Random Search
● Bayesian optimization
Challenges at Scale
● Why learning with more data is harder?
○ Paradox: we could use more complex models due to more data ...
A case study -- binary classification on 1TB of data
● Criteo click through data
● Down sampled ads impression data on 24 ...
Big Data?
Data size:
● ~46GB/day
● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)
● ~...
Where to start?
● 70GB (~260,000,000 data points) is still a lot of data
● Let’s take a tiny slice of that to experiment
○...
GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish
A “Fairer” Way of Comparing Models
A better model
when time is the
constraint
Can We Extrapolate?
?
Where We (can) do
better than generic
Bayesian
Optimization
Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optim...
What about VW?
● Highly efficient online learning algorithm
● Support adaptive learning rate
● Inherently linear, user nee...
Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random
Split
T2s
Tms
Random
Shuffle
Concat +
Interleave
It takes long...
VW Results
Without
With Count + Count*Numeric
Interaction
1% Data
10% Data
100% Data
Putting it All Together 1 Hour 1 Day
Do We Really “Tune/Select Model @ Scale”?
● What we claim we do:
○ Model tuning and selection on big data
● We we actually...
Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accurac...
DataRobot Essentials
April 7-8 London
April 28-29 San Francisco
May 17-18 Atlanta
June 23-24 Boston
datarobot.com/training...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Winning data science competitions
Next
Download to read offline and view in fullscreen.

8

Share

Download to read offline

Model selection and tuning at scale

Download to read offline

Presentation Peter Prettenhoffer (SKLearn core contributor) and I made at 2016 Strata/Hadoop at San Jose, CA.

Model selection and tuning at scale

  1. 1. Model Selection and Tuning at Scale March 2016
  2. 2. About us Owen Zhang Chief Product Officer @ DataRobot Former #1 ranked Data Scientist on Kaggle Former VP, Science @ AIG Peter Prettenhofer Software Engineer @ DataRobot Scikit-learn core developer
  3. 3. Agenda ● Introduction ● Case-study Criteo 1TB ● Conclusion / Discussion
  4. 4. Model Selection ● Estimating the performance of different models in order to choose the best one. ● K-Fold Cross-validation ● The devil is in the detail: ○ Partitioning ○ Leakage ○ Sample size ○ Stacked-models require nested layers Train Validation Holdout 1 2 3 4 5
  5. 5. Model Complexity & Overfitting
  6. 6. More data to the rescue?
  7. 7. Underfitting or Overfitting? http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
  8. 8. Model Tuning ● Optimizing the performance of a model ● Example: Gradient Boosted Trees ○ Nr of trees ○ Learning rate ○ Tree depth / Nr of leaf nodes ○ Min leaf size ○ Example subsampling rate ○ Feature subsampling rate
  9. 9. Search Space Hyperparameter GBRT (naive) GBRT RandomForest Nr of trees 5 1 1 Learning rate 5 5 - Tree depth 5 5 1 Min leaf size 3 3 3 Example subsample rate 3 1 1 Feature subsample rate 2 2 5 Total 2250 150 15
  10. 10. Hyperparameter Optimization ● Grid Search ● Random Search ● Bayesian optimization
  11. 11. Challenges at Scale ● Why learning with more data is harder? ○ Paradox: we could use more complex models due to more data but we cannot because of computational constraints* ○ => we need more efficient ways for creating complex models! ● Need to account for the combined cost: model fitting + model selection / tuning ○ Smart hyperparameter tuning tries to decrease the # of model fits ○ … we can accomplish this with fewer hyperparameters too** * Pedro Domingos, A few useful things to know about machine learning, 2012. ** Practitioners often favor algorithms with few hyperparameters such as RandomForest or AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)
  12. 12. A case study -- binary classification on 1TB of data ● Criteo click through data ● Down sampled ads impression data on 24 days ● Fully anonymized dataset: ○ 1 target ○ 13 integer features ○ 26 hashed categorical features ● Experiment setup: ○ Using day 0 - day 22 data for training, day 23 data for testing
  13. 13. Big Data? Data size: ● ~46GB/day ● ~180,000,000/day However it is very imbalanced (even after downsampling non-events) ● ~3.5% events rate Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB ● Will fit into a single node under “optimal” conditions ● Loss of model accuracy is negligible in most situations Assuming 0.1% raw event (click through) rate: Raw Data: 35TB@.1% Data: 1TB@3.5% Data: 70GB@50%
  14. 14. Where to start? ● 70GB (~260,000,000 data points) is still a lot of data ● Let’s take a tiny slice of that to experiment ○ Take 0.25%, then .5%, then 1%, and do grid search on them Time (Seconds) RF ASVM Regularized Regression GBM (with Count) GBM (without Count)Better
  15. 15. GBM is the way to go, let’s go up to 10% data # of Trees Sample Size/Depth of Tree/Time to Finish
  16. 16. A “Fairer” Way of Comparing Models A better model when time is the constraint
  17. 17. Can We Extrapolate? ? Where We (can) do better than generic Bayesian Optimization
  18. 18. Tree Depth vs Data Size ● A natural heuristic -- increment tree depth by 1 every time data size doubles 1% 2% 4% 10% Optimal Depth = a + b * log(DataSize)
  19. 19. What about VW? ● Highly efficient online learning algorithm ● Support adaptive learning rate ● Inherently linear, user needs to specify non-linear feature or interactions explicitly ● 2-way and 3-way interactions can be generated on the fly ● Supports “every k” validation ● The only “tuning” REQUIRED is specification of interactions ○ Due to availability of progressive validation, bad interactions can be detected immediately thus don’t waste time:
  20. 20. Data pipeline for VW Training Test T1 T2 Tm Test T1s Random Split T2s Tms Random Shuffle Concat + Interleave It takes longer to prep the data than to run the model!
  21. 21. VW Results Without With Count + Count*Numeric Interaction 1% Data 10% Data 100% Data
  22. 22. Putting it All Together 1 Hour 1 Day
  23. 23. Do We Really “Tune/Select Model @ Scale”? ● What we claim we do: ○ Model tuning and selection on big data ● We we actually do: ○ Model tuning and selection on small data ○ Re-run the model and expect/hope performance/hyper parameters extrapolate as expected ● If you start the model tuning/selection process with GBs (even 100s of MBs) of data, you are doing it wrong!
  24. 24. Some Interesting Observations ● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise) non-linear model, even with much larger data ● There is meaningful structure in the hyper parameter space ● When we have limited time (relative to data size), running “deeper” models on smaller data sample may actually yield better results ● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We need models that has # of parameters that can scale with # of data points ○ GBM can have any many parameters as we want ○ So does factorization machines ● For any data any model we will run into a “diminishing return” issue, as data get bigger and bigger
  25. 25. DataRobot Essentials April 7-8 London April 28-29 San Francisco May 17-18 Atlanta June 23-24 Boston datarobot.com/training © DataRobot, Inc. All rights reserved. Thanks / Questions?
  • TomohikoSengoku1

    Mar. 30, 2021
  • vishals13

    Feb. 27, 2018
  • ruben.diaz

    Dec. 21, 2017
  • linekin

    Feb. 12, 2017
  • IanSu8

    Sep. 4, 2016
  • ScottJacobs14

    Aug. 14, 2016
  • PetroRudenko

    May. 30, 2016
  • crisliu16

    May. 25, 2016

Presentation Peter Prettenhoffer (SKLearn core contributor) and I made at 2016 Strata/Hadoop at San Jose, CA.

Views

Total views

2,430

On Slideshare

0

From embeds

0

Number of embeds

96

Actions

Downloads

130

Shares

0

Comments

0

Likes

8

×