Data Science Competition

Data Science Competition
2. 25. 2017
The 27th Annual KSEA South-Western Regional Conference
Jeong-Yoon Lee, Ph.D.

Chief Data Scientist, Conversion Logic
Ph.D. in Computer Science, USC
M.S. in Electrical Engineering, USC
B.S. in Electrical Engineering, SNU
KDD Cup Winner 2012 & 2015
Top 10, Kaggle 2015
Jeong-Yoon Lee, Ph.D.

Why Compete
• For fun
• For experience
• For learning
• For networking
4

Fun
• Competing with others
• Incremental improvement
5

Data Science Competitions
Since 1997
2006 - 2009
Since 2010

Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score

Kaggle
• 250+ competitions since 2010
• 500K+ users
• 50K+ competitors
• $3MM+ prize paid out

Misconceptions on Competitions

Misconceptions on Competitions
• No ETL
• No EDA
• Not worth it
• Not for production
18

No ETL? - Deloitte Western Australia Rental Prices
19

No ETL? - Outbrain Click Prediction
20
2B page views. 16.9MM clicks. 700MM users. 560 sites

No ETL? - YouTube-8M Video Understanding Challenge
21
1.7TB feature-level data. 31GB video-level data.

No EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDA
o People decode age, states, time intervals, income, etc.
23

No EDA?
• Anonymized data - more creative EDA
24

Not worth it?
• Performance matters
• You walk easier when you can run
25

Not for Production?
• Kaggle Kernel
o Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
26

Ensemble Pipeline at Conversion Logic
27

Best Practices
• Feature Engineering
• Algorithms
• Cross Validation
• Ensemble
29

Feature Engineering
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode, TF-IDF (text), Weight-of-
Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD, Autoencoder
* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
30

Algorithms
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
31

Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
32

Ensemble
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/
34

Why Competition
• For fun
• For experiences
• For learning
• For networking
36

One Last Thing
37
Google: 20K applications per week
Conversion Logic: 200 applications per week

Data Science Competition

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Data Science Competition

Similar to Data Science Competition (20)

Recently uploaded

Recently uploaded (20)

Data Science Competition

Editor's Notes