SlideShare a Scribd company logo
Winning Data Science
Competitions
Some (hopefully) useful pointers
Owen Zhang
Data Scientist
A plug for myself
Current
● Chief Product Officer
Previous
● VP, Science
Agenda
● Structure of a Data Science Competition
● Philosophical considerations
● Sources of competitive advantage
● Some technical tips
● Two cases -- Amazon Allstate
● Apply what we learn out of competitions
Technique
Strategy
Philosophy
Data Science Competitions remind us that the purpose of a
predictive model is to predict on data that we have NOT seen.
Training Public LB
(validation)
Private LB
(holdout)
Build model using Training Data to
predict outcomes on Private LB Data
Structure of a Data Science Competition
Quick but often misleading feedback
A little “philosophy”
● There are many ways to overfit
● Beware of “multiple comparison fallacy”
○ There is a cost in “peeking at the answer”,
○ Usually the first idea (if it works) is the best
“Think” more, “try” less
Sources of Competitive Advantage (the Secret)
● Discipline (once bitten twice shy)
○ Proper validation framework
● Effort
● (Some) Domain knowledge
● Feature engineering
● The “right” model structure
● Machine/statistical learning packages
● Coding/data manipulation efficiency
● Luck
Be Disciplined
+
Work Hard
+
Learn from
everyone
+
Luck
Technical Tricks -- GBM
● My confession: I (over)use GBM
○ When in doubt, use GBM
● GBM automatically approximate
○ Non-linear transformations
○ Subtle and deep interactions
● GBM gracefully treats missing values
● GBM is invariant to monotonic transformation of
features
Technical Tricks -- GBM Tuning
● Tuning parameters
○ Learning rate + number of trees
■ Usually small learning rate + many trees work
well. I target 1000 trees and tune learning rate
○ Number of obs in leaf
■ How many obs you need to get a good mean
estimate?
○ Interaction depth
■ Don’t be afraid to use 10+, this is (roughly) the
number of leaf nodes
Technical Tricks -- when GBM needs help
● High cardinality features
○ These are very commonly encounterd -- zip code, injury type,
ICD9, text, etc.
○ Convert into numerical with preprocessing -- out-of-fold average,
counts, etc.
○ Use Ridge regression (or similar) and
■ use out-of-fold prediction as input to GBM
■ or blend
○ Be brave, use N-way interactions
■ I used 7-way interaction in the Amazon competition.
● GBM with out-of-fold treatment of high-cardinality feature performs
very well
Technical Tricks -- Stacking
● Basic idea -- use one model’s output as the next model’s input
● It is NOT a good idea to use in sample prediction for stacking
○ The problem is over-fitting
○ The more “over-fit” prediction1 is , the more weight it will get in
Model 2
Text Features
Model 2
GBM
Prediction 1
Model 1
Ridge
Regression
Final
Prediction
Num Features
Technical Tricks -- Stacking -- OOS / CV
● Use out of sample predictions
○ Take half of the training data to build model 1
○ Apply model 1 on the rest of the training data,
use the output as input to model 2
● Use cross-validation partitioning when data limited
○ Partition training data into K partitions
○ For each of the K partition, compute “prediction
1” by building a model with OTHER partitions
Technical Tricks -- feature engineering in GBM
● GBM only APPROXIMATE interactions and non-
linear transformations
● Strong interactions benefit from being explicitly
defined
○ Especially ratios/sums/differences among
features
● GBM cannot capture complex features such as
“average sales in the previous period for this type of
product”
Technical Tricks -- Glmnet
● From a methodology perspective, the opposite of
GBM
● Captures (log/logistic) linear relationship
● Work with very small # of rows (a few hundred or
even less)
● Complements GBM very well in a blend
● Need a lot of more work
○ missing values, outliers, transformations (log?),
interactions
● The sparsity assumption -- L1 vs L2
Technical Tricks -- Text mining
● tau package in R
● Python’s sklearn
● L2 penalty a must
● N-grams work well.
● Don’t forget the “trivial features”: length of text,
number of words, etc.
● Many “text-mining” competitions on kaggle are
actually dominated by structured fields -- KDD2014
Technical Tricks -- Blending
● All models are wrong, but some are useful (George
Box)
○ The hope is that they are wrong in different ways
● When in doubt, use average blender
● Beware of temptation to overfit public leaderboard
○ Use public LB + training CV
● The strongest individual model does not necessarily
make the best blend
○ Sometimes intentionally built weak models are good blending
candidates -- Liberty Mutual Competition
Technical Tricks -- blending continued
● Try to build “diverse” models
○ Different tools -- GBM, Glmnet, RF, SVM, etc.
○ Different model specifications -- Linear,
lognormal, poisson, 2 stage, etc.
○ Different subsets of features
○ Subsampled observations
○ Weighted/unweighted
○ …
● But, do not “peek at answers” (at least not too much)
Apply what we learn outside of competitions
● Competitions give us really good models, but we also need to
○ Select the right problem and structure it correctly
○ Find good (at least useful) data
○ Make sure models are used the right way
Competitions help us
● Understand how much “signal” exists in the data
● Identify flaws in data or data creation process
● Build generalizable models
● Broaden our technical horizon
● …
Case 1 -- Amazon User Access competition
● One of the most popular competitions on Kaggle to date
○ 1687 teams
● Use anonymized features to predict if employee access
request would be granted or denied
● All categorical features
○ Resource ID / Mgr ID / User ID / Dept ID …
○ Many features have high cardinality
● But I want to use GBM
Case 1 -- Amazon User Access competition
● Encode categorical features using observation counts
○ This is even available for holdout data!
● Encode categorical features using average response
○ Average all but one (example on next slide)
○ Add noise to the training features
● Build different kind of trees + ENET
○ GBM + ERT + ENET + RF + GBM2 + ERT2
● I didn't know VW (or similar), otherwise might have got better
results.
● https://github.com/owenzhang/Kaggle-AmazonChallenge2013
Case 1 -- Amazon User Access competition
“Leave-one-out” encoding of categorical features:
Split User ID Y mean(Y) random Exp_UID
Training A1 0 .667 1.05 0.70035
Training A1 1 .333 .97 0.32301
Training A1 1 .333 .98 0.32634
Training A1 0 .667 1.02 0.68034
Test A1 - .5 1 .5
Test A1 - .5 1 .5
Training A2 0
Case 2 -- Allstate User Purchase Option Prediction
● Predict final purchased product options based on earlier
transactions.
○ 7 correlated targets
● This turns out to be very difficult because:
○ The evaluation criteria is all-or-nothing: all 7 predictions
need to be correct
○ The baseline “last quoted” is very hard to beat.
■ Last quoted 53.269%
■ #3 (me) : 53.713% (+0.444%)
■ #1 solution 53.743% (+0.474%)
● Key challenges -- capture correlation, and not to lose to
baseline
Case 2 -- Allstate User Purchase Option Prediction
● Dependency -- Chained models
○ First build stand-alone model for F
○ Then model for G, given F
○ F => G => B => A => C => E => D
○ “Free models” first, “dependent” model later
○ In training time, use actual data
○ In prediction time, use most likely predicted value
● Not to lose to baseline -- 2 stage models
○ One model to predict which one to use: chained prediction,
or baseline
Useful Resources
● http://www.kaggle.com/competitions
● http://www.kaggle.com/forums
● http://statweb.stanford.edu/~tibs/ElemStatLearn/
● http://scikit-learn.org/
● http://cran.r-project.org/
● https://github.com/JohnLangford/vowpal_wabbit/wiki
● ….

More Related Content

What's hot

Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
Akira Shibata
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 
Xgboost
XgboostXgboost
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitions
Owen Zhang
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
Jeong-Yoon Lee
 
Xgboost
XgboostXgboost
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
Shuai Zhang
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
Michael BENESTY
 

What's hot (20)

Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Winning data science competitions
Winning data science competitionsWinning data science competitions
Winning data science competitions
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Xgboost
XgboostXgboost
Xgboost
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 

Viewers also liked

Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
SKPETRO – Fundamental Analysis FY16
SKPETRO – Fundamental Analysis FY16SKPETRO – Fundamental Analysis FY16
SKPETRO – Fundamental Analysis FY16
lcchong76
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Vivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang
 
LC's Forex Trading System
LC's Forex Trading SystemLC's Forex Trading System
LC's Forex Trading System
lcchong76
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Vivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Vivian S. Zhang
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Vivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
MAYBANK – Fundamental Analysis FY15
MAYBANK – Fundamental Analysis FY15MAYBANK – Fundamental Analysis FY15
MAYBANK – Fundamental Analysis FY15
lcchong76
 

Viewers also liked (14)

Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
SKPETRO – Fundamental Analysis FY16
SKPETRO – Fundamental Analysis FY16SKPETRO – Fundamental Analysis FY16
SKPETRO – Fundamental Analysis FY16
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
LC's Forex Trading System
LC's Forex Trading SystemLC's Forex Trading System
LC's Forex Trading System
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
MAYBANK – Fundamental Analysis FY15
MAYBANK – Fundamental Analysis FY15MAYBANK – Fundamental Analysis FY15
MAYBANK – Fundamental Analysis FY15
 

Similar to Winning data science competitions, presented by Owen Zhang

Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
freshdatabos
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
Berker Kozan
 
Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project PresentationCan Köklü
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
BigML, Inc
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
BigML, Inc
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
Thoughtworks
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
Rebecca Bilbro
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and Fusions
BigML, Inc
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
Smartphone Activity Prediction
Smartphone Activity PredictionSmartphone Activity Prediction
Smartphone Activity PredictionTriskelion_Kaggle
 

Similar to Winning data science competitions, presented by Owen Zhang (20)

Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang)  - 2014 Boston Data FestivalWinning Data Science Competitions (Owen Zhang)  - 2014 Boston Data Festival
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project Presentation
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and Fusions
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Smartphone Activity Prediction
Smartphone Activity PredictionSmartphone Activity Prediction
Smartphone Activity Prediction
 

More from Vivian S. Zhang

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Vivian S. Zhang
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Vivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Vivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Vivian S. Zhang
 

More from Vivian S. Zhang (16)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Recently uploaded

Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 

Recently uploaded (20)

Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 

Winning data science competitions, presented by Owen Zhang

  • 1. Winning Data Science Competitions Some (hopefully) useful pointers Owen Zhang Data Scientist
  • 2. A plug for myself Current ● Chief Product Officer Previous ● VP, Science
  • 3. Agenda ● Structure of a Data Science Competition ● Philosophical considerations ● Sources of competitive advantage ● Some technical tips ● Two cases -- Amazon Allstate ● Apply what we learn out of competitions Technique Strategy Philosophy
  • 4. Data Science Competitions remind us that the purpose of a predictive model is to predict on data that we have NOT seen. Training Public LB (validation) Private LB (holdout) Build model using Training Data to predict outcomes on Private LB Data Structure of a Data Science Competition Quick but often misleading feedback
  • 5. A little “philosophy” ● There are many ways to overfit ● Beware of “multiple comparison fallacy” ○ There is a cost in “peeking at the answer”, ○ Usually the first idea (if it works) is the best “Think” more, “try” less
  • 6. Sources of Competitive Advantage (the Secret) ● Discipline (once bitten twice shy) ○ Proper validation framework ● Effort ● (Some) Domain knowledge ● Feature engineering ● The “right” model structure ● Machine/statistical learning packages ● Coding/data manipulation efficiency ● Luck Be Disciplined + Work Hard + Learn from everyone + Luck
  • 7. Technical Tricks -- GBM ● My confession: I (over)use GBM ○ When in doubt, use GBM ● GBM automatically approximate ○ Non-linear transformations ○ Subtle and deep interactions ● GBM gracefully treats missing values ● GBM is invariant to monotonic transformation of features
  • 8. Technical Tricks -- GBM Tuning ● Tuning parameters ○ Learning rate + number of trees ■ Usually small learning rate + many trees work well. I target 1000 trees and tune learning rate ○ Number of obs in leaf ■ How many obs you need to get a good mean estimate? ○ Interaction depth ■ Don’t be afraid to use 10+, this is (roughly) the number of leaf nodes
  • 9. Technical Tricks -- when GBM needs help ● High cardinality features ○ These are very commonly encounterd -- zip code, injury type, ICD9, text, etc. ○ Convert into numerical with preprocessing -- out-of-fold average, counts, etc. ○ Use Ridge regression (or similar) and ■ use out-of-fold prediction as input to GBM ■ or blend ○ Be brave, use N-way interactions ■ I used 7-way interaction in the Amazon competition. ● GBM with out-of-fold treatment of high-cardinality feature performs very well
  • 10. Technical Tricks -- Stacking ● Basic idea -- use one model’s output as the next model’s input ● It is NOT a good idea to use in sample prediction for stacking ○ The problem is over-fitting ○ The more “over-fit” prediction1 is , the more weight it will get in Model 2 Text Features Model 2 GBM Prediction 1 Model 1 Ridge Regression Final Prediction Num Features
  • 11. Technical Tricks -- Stacking -- OOS / CV ● Use out of sample predictions ○ Take half of the training data to build model 1 ○ Apply model 1 on the rest of the training data, use the output as input to model 2 ● Use cross-validation partitioning when data limited ○ Partition training data into K partitions ○ For each of the K partition, compute “prediction 1” by building a model with OTHER partitions
  • 12. Technical Tricks -- feature engineering in GBM ● GBM only APPROXIMATE interactions and non- linear transformations ● Strong interactions benefit from being explicitly defined ○ Especially ratios/sums/differences among features ● GBM cannot capture complex features such as “average sales in the previous period for this type of product”
  • 13. Technical Tricks -- Glmnet ● From a methodology perspective, the opposite of GBM ● Captures (log/logistic) linear relationship ● Work with very small # of rows (a few hundred or even less) ● Complements GBM very well in a blend ● Need a lot of more work ○ missing values, outliers, transformations (log?), interactions ● The sparsity assumption -- L1 vs L2
  • 14. Technical Tricks -- Text mining ● tau package in R ● Python’s sklearn ● L2 penalty a must ● N-grams work well. ● Don’t forget the “trivial features”: length of text, number of words, etc. ● Many “text-mining” competitions on kaggle are actually dominated by structured fields -- KDD2014
  • 15. Technical Tricks -- Blending ● All models are wrong, but some are useful (George Box) ○ The hope is that they are wrong in different ways ● When in doubt, use average blender ● Beware of temptation to overfit public leaderboard ○ Use public LB + training CV ● The strongest individual model does not necessarily make the best blend ○ Sometimes intentionally built weak models are good blending candidates -- Liberty Mutual Competition
  • 16. Technical Tricks -- blending continued ● Try to build “diverse” models ○ Different tools -- GBM, Glmnet, RF, SVM, etc. ○ Different model specifications -- Linear, lognormal, poisson, 2 stage, etc. ○ Different subsets of features ○ Subsampled observations ○ Weighted/unweighted ○ … ● But, do not “peek at answers” (at least not too much)
  • 17. Apply what we learn outside of competitions ● Competitions give us really good models, but we also need to ○ Select the right problem and structure it correctly ○ Find good (at least useful) data ○ Make sure models are used the right way Competitions help us ● Understand how much “signal” exists in the data ● Identify flaws in data or data creation process ● Build generalizable models ● Broaden our technical horizon ● …
  • 18. Case 1 -- Amazon User Access competition ● One of the most popular competitions on Kaggle to date ○ 1687 teams ● Use anonymized features to predict if employee access request would be granted or denied ● All categorical features ○ Resource ID / Mgr ID / User ID / Dept ID … ○ Many features have high cardinality ● But I want to use GBM
  • 19. Case 1 -- Amazon User Access competition ● Encode categorical features using observation counts ○ This is even available for holdout data! ● Encode categorical features using average response ○ Average all but one (example on next slide) ○ Add noise to the training features ● Build different kind of trees + ENET ○ GBM + ERT + ENET + RF + GBM2 + ERT2 ● I didn't know VW (or similar), otherwise might have got better results. ● https://github.com/owenzhang/Kaggle-AmazonChallenge2013
  • 20. Case 1 -- Amazon User Access competition “Leave-one-out” encoding of categorical features: Split User ID Y mean(Y) random Exp_UID Training A1 0 .667 1.05 0.70035 Training A1 1 .333 .97 0.32301 Training A1 1 .333 .98 0.32634 Training A1 0 .667 1.02 0.68034 Test A1 - .5 1 .5 Test A1 - .5 1 .5 Training A2 0
  • 21. Case 2 -- Allstate User Purchase Option Prediction ● Predict final purchased product options based on earlier transactions. ○ 7 correlated targets ● This turns out to be very difficult because: ○ The evaluation criteria is all-or-nothing: all 7 predictions need to be correct ○ The baseline “last quoted” is very hard to beat. ■ Last quoted 53.269% ■ #3 (me) : 53.713% (+0.444%) ■ #1 solution 53.743% (+0.474%) ● Key challenges -- capture correlation, and not to lose to baseline
  • 22. Case 2 -- Allstate User Purchase Option Prediction ● Dependency -- Chained models ○ First build stand-alone model for F ○ Then model for G, given F ○ F => G => B => A => C => E => D ○ “Free models” first, “dependent” model later ○ In training time, use actual data ○ In prediction time, use most likely predicted value ● Not to lose to baseline -- 2 stage models ○ One model to predict which one to use: chained prediction, or baseline
  • 23. Useful Resources ● http://www.kaggle.com/competitions ● http://www.kaggle.com/forums ● http://statweb.stanford.edu/~tibs/ElemStatLearn/ ● http://scikit-learn.org/ ● http://cran.r-project.org/ ● https://github.com/JohnLangford/vowpal_wabbit/wiki ● ….