Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15

1,443 views

Published on

Lessons learned from Running Hundreds of Kaggle Competitions: At Kaggle, we've run hundreds of machine learning competitions and seen over 80,000 data scientists make submissions. One thing is clear: winning competitions isn't random. We've learned that certain tools and methodologies work consistently well on different types of problems. Many participants make common mistakes (such as overfitting) that should be actively avoided. Similarly, competition hosts have their own set of pitfalls (such as data leakage).

In this talk, I'll share what goes into a winning competition toolkit along with some war stories on what to avoid. Additionally, I’ll share what we’re seeing on the collaborative side of competitions. Our community is showing an increasing amount of collaboration in developing machine learning models and analytic solutions. I'll showcase examples of this and discuss how these types of collaboration will improve how data science is learned and applied.

Published in: Technology

Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15

  1. 1. @benhamnerPhoto by mikebaird, www.flickr.com/photos/mikebaird Lessons from ML Competitions Ben Hamner ben.hamner@kaggle.com November 13, 2015
  2. 2. @benhamner Kaggle runs machine learning competitions
  3. 3. @benhamner We release challenging machine learning problems to our community of 410,000 data scientists
  4. 4. @benhamner 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Our community makes 100k submissions per month on these competitions
  5. 5. @benhamner@benhamner Examples of Machine Learning Competitions
  6. 6. @benhamner Automatically grading student-written essays 197 entrants 155 teams 2,499 submissions over 80 days $100,000 in prizes Human-level performance www.kaggle.com/c/asap-aes 21,000+ essays
  7. 7. @benhamner Predicting compounds toxicity given its molecular structure 796 entrants 703 teams 8,841 submissions over 91 days $20,000 in prizes 25.6% improvement over previous accuracy benchmark www.kaggle.com/c/BioResponse
  8. 8. @benhamner Personalizing web search results 261 entrants 194 teams 3570 submissions over 91 days $9,000 in prizes www.kaggle.com/c/yandex-personalized-web-search-challenge 167,000,000+ logs
  9. 9. @benhamner Detecting diabetic retinopathy www.kaggle.com/c/diabetic-retinopathy-detection 88,000+ retina images 854 entrants 661 teams 6999 submissions Over 160 days $100,000 in prizes 85% agreement with a human rater (quadratic weighted kappa)
  10. 10. @benhamner@benhamner How do machine learning competitions work?
  11. 11. @benhamner We take a dataset with a target variable – something we’re trying to predict SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Predicting the sale price of a home
  12. 12. @benhamner Training Test Split the data into two sets, a training set and a test set Solution “Ground Truth” SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1
  13. 13. @benhamner Training Test Our community gets everything but the solution on the test set Solution “Ground Truth” SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 ??? 582 HOME 0.61 1 1 ??? 1640 APT 2 3 ??? 3546 HOME 0.4 4 4 ??? 903 APT 2 2 ??? 1096 HOME 0.04 3 4 ??? 1280 HOME 0.15 2 2 ??? 1139 APT 1 1
  14. 14. @benhamner Competition participants use the training set to learn the relation between the data and the target
  15. 15. @benhamner Training Test Competition participants apply their models to make predictions on the test set SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 ??? 582 HOME 0.61 1 1 ??? 1640 APT 2 3 ??? 3546 HOME 0.4 4 4 ??? 903 APT 2 2 ??? 1096 HOME 0.04 3 4 ??? 1280 HOME 0.15 2 2 ??? 1139 APT 1 1 Submission Predicted $41k $165k $280k $76k $128k $115k $94k
  16. 16. @benhamner Training Test Kaggle compares the submission to the ground truth SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Submission Predicted $41k $165k $380k $76k $128k $115k $94k Delta -$9k $20k -$14k -$6k $13k -$14k -$12k
  17. 17. @benhamner Training Test Kaggle calculates two scores, one for the public leaderboard and one for the private leaderboard SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Submission Predicted $41k $165k $380k $76k $128k $115k $94k MeanError Public Leaderboard $14k Private Leaderboard $15k Delta -$9k $20k -$14k -$6k $13k -$14k -$12k
  18. 18. @benhamner The participant immediately sees their public score on the public leaderboard
  19. 19. @benhamner Participants explore the problem and iterate on their models to improve them
  20. 20. @benhamner At the end, the participant with the best score on the private leaderboard wins
  21. 21. @benhamner@benhamner Competition leaderboards
  22. 22. @benhamner The leaderboard is a powerful mechanism to drive competition
  23. 23. @benhamner The leaderboard is objective and meritocratic
  24. 24. @benhamner The leaderboard encourages leapfrogging
  25. 25. @benhamner The leaderboard encourages iterative improvements over many submissions
  26. 26. @benhamner This causes the competition to approach the frontier of what’s possible given the data
  27. 27. @benhamner Many competitions quickly approach a frontier; the most challenging ones take longer
  28. 28. @benhamner Some applied ML research looks like competitions running over years instead of months www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/
  29. 29. @benhamner One long-running research competition is ImageNet (not hosted on Kaggle) www.image-net.org
  30. 30. @benhamner We see a similar progression in ImageNet performance over time as we do in Kaggle competitions www.image-net.org
  31. 31. @benhamner Can we do better than competition results?
  32. 32. @benhamner@benhamner Looking holistically across all the competitions
  33. 33. @benhamner At Kaggle, we’ve run hundreds of public machine learning competitions
  34. 34. @benhamner And over 600 in-class competitions for university students
  35. 35. @benhamner These competitions have generated over 2,000,000 submissions from around the world
  36. 36. @benhamner Most of the competitions we’ve run have involved supervised classification or regression
  37. 37. @benhamner@benhamner Doing well in competitions
  38. 38. @benhamner Setup your environment to enable rapid iteration and experimentation Extract and Select Features Train Models Evaluate and Visualize Results Identify & Handle Data Oddities Data Preprocessing
  39. 39. @benhamner As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models http://jeffreydf.github.io/diabetic-retinopathy-detection/
  40. 40. @benhamner Successful users invest time, thought, and creativity in problem structure and feature extraction
  41. 41. @benhamner Random Forests / GBM’s work very well for many common classification and regression tasks (Verikas et al. 2011)
  42. 42. @benhamner Deep learning has been very effective in computer vision competitions we’ve hosted caffe, theano, torch7, and keras are four popular open source libraries that facilitate this
  43. 43. @benhamner XGBoost and Keras — two ML libraries with great power:effort ratios Competition Type Winning ML Algorithm Liberty Mutual Regression XGBoost Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest Diabetic Retinopathy Image SparseConvNet + RF Avito CTR XGBoost Taxi Trajectory 2 Geostats Classic neural net Grasp and Lift EEG Keras + XGBoost + other CNN Otto Group Classification Stacked ensemble of 35 models Facebook IV Classification sklearn GBM
  44. 44. @benhamner XGBoost and Keras — two ML libraries with great power:effort ratios Competition Type Winning ML Algorithm Liberty Mutual Regression XGBoost Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest Diabetic Retinopathy Image SparseConvNet + RF Avito CTR XGBoost Taxi Trajectory 2 Geostats Classic neural net Grasp and Lift EEG Keras + XGBoost + other CNN Otto Group Classification Stacked ensemble of 35 models Facebook IV Classification sklearn GBM
  45. 45. @benhamner The Boruta feature selection algorithm is robust and reliable • Wrapper method around Random Forest and its calculated variable importance • Iteratively trains RF’s and runs statistical tests to identify features as important or not important • Widely used in competition-winning models to select a small subset of features for use in training more complex models • library(boruta) in R
  46. 46. @benhamner Model ensembling usually results in marginal but significant performance gains
  47. 47. @benhamner Data leakage is our (and our user’s) #1 challenge http://www.navy.mil/view_image.asp?id=12495
  48. 48. @benhamner@benhamner We’ve also seen some things that competitions aren’t effective at
  49. 49. @benhamner Competitions don’t typically yield simple and theoretically elegant solutions *exception – Factorization Machines in KDD Cup 2012
  50. 50. @benhamner Competitions don’t typically yield production code http://ora-00001.blogspot.ru/2011/07/mythbusters-stored-procedures-edition.html
  51. 51. @benhamner Competitions don’t always yield computationally efficient solutions • Rewards performance without computational and complexity constraints http://iinustechtips.com/main/topic/193045-need-help-underclocking-d/
  52. 52. @benhamner@benhamner Competitions tend to be highly effective at
  53. 53. @benhamner Optimizing a quantifiable evaluation metric by exploring an enormously broad range of approaches
  54. 54. @benhamner Fairly and consistently evaluating a variety of approaches on the same problem • Implementation details matter, which can make it tough to reproduce results in other settings where data and/or code is not open source • “A quick, simple way to apply machine learning successfully? In your domain, find the stupid baseline that new methods consistently claim to beat. Implement that stupid baseline”
  55. 55. @benhamner Identifying data quality and leakage issues Check that ID column isn’t informative “Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.” - “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al Time series are tricky Essay: “This essay got good marks, but as far as I can tell, it's gibberish.” Human Scores: 5/5, 4/5
  56. 56. @benhamner Exposing a specific domain problem to many new communities around the world
  57. 57. @benhamner@benhamner Where Kaggle’s going
  58. 58. @benhamner Kaggle’s mission is to help the world learn from data http://data-arts.appspot.com/globe/
  59. 59. @benhamner We’re building a public platform for collaborating on data and analytics results People CodeData
  60. 60. @benhamner An early alpha version of this is released as Kaggle Scripts
  61. 61. @benhamner It enables users to immediately access R/Python/Julia environments with data preloaded
  62. 62. @benhamner Everything created on Kaggle Scripts is published as soon as it’s run www.kaggle.com/scripts
  63. 63. @benhamner Reproducing and building on another’s work is simply a click away
  64. 64. @benhamner We’re starting to enable users to do this on non-competition datasets
  65. 65. @benhamner Soon, any user will be able to publish data through Kaggle for analysis
  66. 66. @benhamner@benhamner Thank you! head to www.kaggle.com/scripts to check out code, visualizations, and results from our community

×