Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Winning Data Science Competitions

13,571 views

Published on

Presented by Jeong-Yoon Lee at Microsoft on 3/29/2017.

Published in: Data & Analytics
  • Sex in your area is here: ❤❤❤ http://bit.ly/2F4cEJi ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ http://bit.ly/2F4cEJi ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Winning Data Science Competitions

  1. 1. Winning Data Science Competitions 3. 29. 2017 Jeong-Yoon Lee, Ph.D.
  2. 2. Chief Data Scientist, Conversion Logic 70+ Competitions 6 Times Prize Winner (KDD Cup 2012 & 2015) 8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual) Top 10, Kaggle 2015 Father of 4 boys Jeong-Yoon Lee, Ph.D.
  3. 3. About Conversion Logic 3 Advanced Marketing Attribution For Diverse Customers
  4. 4. Why Data Science Competition
  5. 5. Why Compete For fun For experience For learning For networking 5
  6. 6. Fun Competing with others Continuous improvement 6
  7. 7. Experience 7
  8. 8. Learning 8
  9. 9. Learning 9
  10. 10. Networking 10
  11. 11. 11
  12. 12. Data Science Competitions
  13. 13. Data Science Competitions Since 1997 2006 - 2009 Since 2010
  14. 14. Competition Structure Training Data Test Data Feature Label Provided Submission Public LB Score Private LB Score
  15. 15. Kaggle 250+ competitions since 2010 900K users 50K+ competitors $3MM+ prize paid out
  16. 16. Kaggle
  17. 17. Kaggle
  18. 18. Misconceptions on Competitions
  19. 19. Misconceptions on Competitions No ETL No EDA Not worth it Not for production 19
  20. 20. No ETL? - Deloitte Western Australia Rental Prices 20
  21. 21. No ETL? - Outbrain Click Prediction 21 2B page views. 16.9MM clicks. 700MM users. 560 sites
  22. 22. No ETL? - YouTube-8M Video Understanding Challenge 22 1.7TB feature-level data. 31GB video-level data.
  23. 23. No ETL? 23
  24. 24. No EDA? Most of competitions provide actual labels - typical EDA Anonymized data - more creative EDA o People decode age, states, time intervals, income, etc. 24
  25. 25. No EDA? Anonymized data - more creative EDA 25
  26. 26. Not worth it? Performance matters You walk easier when you can run 26
  27. 27. Not for Production? Kaggle Kernel o Max execution time:10 minutes o Max file output: 500MB o Memory limit: 8GB 27
  28. 28. Ensemble Pipeline at Conversion Logic 28
  29. 29. Best Practices
  30. 30. Best Practices Feature Engineering Diverse Algorithms Cross Validation Ensemble Collaboration 30
  31. 31. Feature Engineering 31 Types Note Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP Network Graph Degree, Closeness, Betweenness, PageRank Numerical/ Timeseries Convert to categorical features using RF/GBM Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick * More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
  32. 32. Diverse Algorithms Algorithm Tool Note Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions Random Forests Scikit-Learn, randomForest Used to be popular before GBM Extremely Random Trees Scikit-Learn Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu) 32
  33. 33. Cross Validation Training data are split into five folds where the sample size and dropout rate are preserved (stratified). 33
  34. 34. Ensemble - Stacking * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/ 35
  35. 35. KDDCup 2015 Solution 36
  36. 36. Collaboration
  37. 37. Collaboration – Git Repo + S3/Dropbox 38
  38. 38. Collaboration – Common Validation 39
  39. 39. Collaboration – Internal Leaderboard 40
  40. 40. Best Practices For fun For experiences For learning For networking 41 Feature Engineering Diverse Algorithms Cross Validation Ensemble Collaboration Why Competition
  41. 41. Things That Help 42 Keep competition journals and repos – both during and after competitions Build and improve the automated pipeline and library for competitions • https://github.com/jeongyoonlee/Kaggler • https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master • http://kaggler.com/kagglers-toolbox-setup/ Be humble, and ready to try and learn something new Make a commitment and work on competitions no matter what on a regular basis
  42. 42. Resources 43 No Free Hunch by Kaggle Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova) Feature Engineering, mlwave.com by HJ van Veen (Triskelion) fastml.com by Zygmunt Zając (Foxtrot) kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet Gilberto Titericz Junior in San Francisco - #1 at Kaggle
  43. 43. Active Competitions 44 Kaggle – 6 Featured, 1 Job Competitions KDD Cup 2017 RecSys Challenge 2017 CIKM AnalytiCup 2017
  44. 44. Thank You

×