Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to get started in Kaggle competition


Published on

How to get started in Kaggle competition for data scientists.

Published in: Data & Analytics

How to get started in Kaggle competition

  1. 1. How to get started in Merja Kajava September 1, 2015 Helsinki Data Analytics and Science Meetup
  2. 2. is a competition platform for (aspiring) data scientists
  3. 3. Why participate in Kaggle The data Competition steps Tips
  4. 4. Why participate in Kaggle competition?
  5. 5. 1 Learn from the best. Forums Scripts Solutions from prize winners
  6. 6. 2 Work with cool datasets. Flights in GE Flights Quest Driver telematic analysis Amazon employee access rights
  7. 7. +1 You can also win money
  8. 8. What kinds of competitions Kaggle has?
  9. 9. Public In-class Private
  10. 10. What languages can you use? Any open-source language (sometimes also sponsor’s proprietary languages) Gnu Octave (no Matlab)
  11. 11. What is the data like?
  12. 12. Data comes from companies and non-profit organizations
  13. 13. Data sizes vary Zip ~1 MB Zip ~6 GB
  14. 14. Data comes in all shapes Customer data Log files Timeseries HTML pages Images Documents
  15. 15. How does Kaggle competition work?
  16. 16. Competition flow Duration typically 4 to 8 weeks Max. 5 entries per day
  17. 17. test Model train submission.csvPredict Build prediction model Calculate CV to cross-validate
  18. 18. Evaluate submission Typical evaluations Area under the ROC curve Normalized Gini coefficient RMSLE …
  19. 19. Public leaderboard Private leaderboard ~10-30% of test data submission.csv Submit entry
  20. 20. Choose two entries for final Practical choice Best entry in public leaderboard + Best CV from local entries
  21. 21. Tips
  22. 22. Look at data. Visualize it. Source
  23. 23. Focus on feature engineering Feature selection Feature construction Dates Locations Categories Segmentation Statistics
  24. 24. Build different models 3 target variables 4 cities = Build 12 models Source
  25. 25. Try different algorithms Random forest Vowpal Wabbit GBM Xgboost
  26. 26. Build ensembles Average of submissions Weighted average of submissions Ranked average of submissions Stacked generalization Blending
  27. 27. Keep track of your submissions Submission id
  28. 28. Next steps
  29. 29. Start competing Create Kaggle account Choose competition Go for it!
  30. 30. Useful links Kaggle Blog Kaggle Competitions: Where to begin Kaggle Feature Engineering engineer-features-and-how-to-get-good-at-it/ Kaggle Ensembling Guide