Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Winning Data Science 
Competitions 
Some (hopefully) useful pointers 
Owen Zhang 
Data Scientist
A plug for myself 
Current 
● Chief Product Officer 
Previous 
● VP, Science
Agenda 
● Structure of a Data Science Competition 
● Philosophical considerations 
● Sources of competitive advantage 
● S...
Structure of a Data Science Competition 
Build model using Training Data to 
predict outcomes on Private LB Data 
Training...
A little “philosophy” 
● There are many ways to overfit 
● Beware of “multiple comparison fallacy” 
○ There is a cost in “...
Sources of Competitive Advantage 
● Discipline (once bitten twice shy) 
○ Proper validation framework 
● Effort 
● (Some) ...
Technical Tricks -- GBM 
● My confession: I (over)use GBM 
○ When in doubt, use GBM 
● GBM automatically approximate 
○ No...
Technical Tricks -- GBM needs TLC too 
● Tuning parameters 
○ Learning rate + number of trees 
■ Usually small learning ra...
Technical Tricks -- when GBM needs help 
● High cardinality features 
○ Convert into numerical with preprocessing -- out-o...
Technical Tricks -- feature engineering in GBM 
● GBM only APPROXIMATE interactions and non-linear 
transformations 
● Str...
Technical Tricks -- Glmnet 
● From a methodology perspective, the opposite of 
GBM 
● Captures (log/logistic) linear relat...
Technical Tricks -- Text mining 
● tau package in R 
● Python’s sklearn 
● L2 penalty a must 
● N-grams work well. 
● Don’...
Technical Tricks -- Blending 
● All models are wrong, but some are useful (George 
Box) 
○ The hope is that they are wrong...
Technical Tricks -- blending continued 
● Try to build “diverse” models 
○ Different tools -- GBM, Glmnet, RF, SVM, etc. 
...
Apply what we learn outside of competitions 
● Competitions give us really good models, but we also need to 
○ Select the ...
Acknowledgement 
● My fellow competitors and data scientists at large 
○ You have taught me (almost) everything 
● Xavier ...
Upcoming SlideShare
Loading in …5
×

Winning data science competitions

6,894 views

Published on

My BDF 2014 kick-off talk.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Winning data science competitions

  1. 1. Winning Data Science Competitions Some (hopefully) useful pointers Owen Zhang Data Scientist
  2. 2. A plug for myself Current ● Chief Product Officer Previous ● VP, Science
  3. 3. Agenda ● Structure of a Data Science Competition ● Philosophical considerations ● Sources of competitive advantage ● Some technical tips ● Apply what we learn out of competitions Philosophy Strategy Technique
  4. 4. Structure of a Data Science Competition Build model using Training Data to predict outcomes on Private LB Data Training Public LB (validation) Private LB (holdout) Quick but often misleading feedback Data Science Competitions remind us that the purpose of a predictive model is to predict on data that we have NOT seen.
  5. 5. A little “philosophy” ● There are many ways to overfit ● Beware of “multiple comparison fallacy” ○ There is a cost in “peaking at the answer”, ○ Usually the first idea (if it works) is the best “Think” more, “try” less
  6. 6. Sources of Competitive Advantage ● Discipline (once bitten twice shy) ○ Proper validation framework ● Effort ● (Some) Domain knowledge ● Feature engineering ● The “right” model structure ● Machine/statistical learning packages ● Coding/data manipulation efficiency ● Luck
  7. 7. Technical Tricks -- GBM ● My confession: I (over)use GBM ○ When in doubt, use GBM ● GBM automatically approximate ○ Non-linear transformations ○ Subtle and deep interactions ● GBM gracefully treats missing values ● GBM is invariant to monotonic transformation of features
  8. 8. Technical Tricks -- GBM needs TLC too ● Tuning parameters ○ Learning rate + number of trees ■ Usually small learning rate + many trees work well. I target 1000 trees and tune learning rate ○ Number of obs in leaf ■ How many obs you need to get a good mean estimate? ○ Interaction depth ■ Don’t be afraid to use 10+, this is (roughly) the number of leaf nodes
  9. 9. Technical Tricks -- when GBM needs help ● High cardinality features ○ Convert into numerical with preprocessing -- out-of- fold average, counts, etc. ○ Use Ridge regression (or similar) and ■ use out-of-fold prediction as input to GBM ■ or blend ○ Be brave, use N-way interactions ■ I used 7-way interaction in the Amazon competition. ● GBM with out-of-fold treatment of high-cardinality feature performs very well
  10. 10. Technical Tricks -- feature engineering in GBM ● GBM only APPROXIMATE interactions and non-linear transformations ● Strong interactions benefit from being explicitly defined ○ Especially ratios/sums/differences among features ● GBM cannot capture complex features such as “average sales in the previous period for this type of product”
  11. 11. Technical Tricks -- Glmnet ● From a methodology perspective, the opposite of GBM ● Captures (log/logistic) linear relationship ● Work with very small # of rows (a few hundred or even less) ● Complements GBM very well in a blend ● Need a lot of more work ○ missing values, outliers, transformations (log?), interactions ● The sparsity assumption -- L1 vs L2
  12. 12. Technical Tricks -- Text mining ● tau package in R ● Python’s sklearn ● L2 penalty a must ● N-grams work well. ● Don’t forget the “trivial features”: length of text, number of words, etc. ● Many “text-mining” competitions on kaggle are actually dominated by structured fields -- KDD2014
  13. 13. Technical Tricks -- Blending ● All models are wrong, but some are useful (George Box) ○ The hope is that they are wrong in different ways ● When in doubt, use average blender ● Beware of temptation to overfit public leaderboard ○ Use public LB + training CV ● The strongest individual model does not necessarily make the best blend ○ Sometimes intentionally built weak models are good blending candidates -- Liberty Mutual Competition
  14. 14. Technical Tricks -- blending continued ● Try to build “diverse” models ○ Different tools -- GBM, Glmnet, RF, SVM, etc. ○ Different model specifications -- Linear, lognormal, poisson, 2 stage, etc. ○ Different subsets of features ○ Subsampled observations ○ Weighted/unweighted ○ … ● But, do not try “blindly” -- think more, try less
  15. 15. Apply what we learn outside of competitions ● Competitions give us really good models, but we also need to ○ Select the right problem and structure it correctly ○ Find good (at least useful) data ○ Make sure models are used the right way Competitions help us ● Understand how much “signal” exists in the data ● Identify flaws in data or data creation process ● Build generalizable models ● Broaden our technical horizon ● …
  16. 16. Acknowledgement ● My fellow competitors and data scientists at large ○ You have taught me (almost) everything ● Xavier Conort -- my colleague at DataRobot ○ Thanks for collaboration and inspiration for some material ● Kaggle ○ Thanks for the all the fun we have competing!

×