stories behind kaggle competitions
wendy kan, data scientist
wendy@kaggle.com
@wendykan
5/19/2015 @
kaggle runs public machine learning
competitions
we worked with clients/hosts on various
types of problems and data of different sizes
my job as a data scientist at kaggle
“data science is not just
kaggle competitions”
whyyyy???
machine learning processes
● Business Problem
● Collect Data
● Transform Data
● Dataset Splitting
● Evaluation Metric
● Feature Extraction
● Feature Selection
● Model Training
● Model Ensembling
● Methodology Selection
● Production System
● Ongoing Optimization
not every problem can be
turned into a kaggle
competition
size matters! where bigger is
better (most of the time)
data cleaning/formatting:
● easy to make a quick submission
● boosts participation
● (too) clean data kills creativity
data privacy/anonymization
metric: how do you measure
success?
● Classification - AUC/ Logarithmic Loss/Accuracy
● Regression - RMSE/MAE
● Ranking - MAP/NDCG
● Other / Custom
https://www.kaggle.com/wiki/Metrics
the design of a competition shapes how
people are going to solve a problem
Splitting dataset
● training/test
● public/private
Time series data
data leakage
“Deemed ‘one of the top ten data mining mistakes’, leakage is
essentially the introduction of information about the data mining target,
which should not be legitimately available to mine from”
“the concept of identifying and harnessing leakage has been openly
addressed as one of three key aspects for winning data mining
competitions”
“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
do you have thousands of
people reviewing your
performance at work 24/7?
I do.
1. people make mistakes.
honesty is the best policy.
2. crowdsourcing is powerful.
anything that can go wrong
will go wrong.
Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

  • 1.
    stories behind kagglecompetitions wendy kan, data scientist wendy@kaggle.com @wendykan 5/19/2015 @
  • 2.
    kaggle runs publicmachine learning competitions
  • 3.
    we worked withclients/hosts on various types of problems and data of different sizes
  • 4.
    my job asa data scientist at kaggle
  • 5.
    “data science isnot just kaggle competitions” whyyyy???
  • 6.
    machine learning processes ●Business Problem ● Collect Data ● Transform Data ● Dataset Splitting ● Evaluation Metric ● Feature Extraction ● Feature Selection ● Model Training ● Model Ensembling ● Methodology Selection ● Production System ● Ongoing Optimization
  • 7.
    not every problemcan be turned into a kaggle competition
  • 9.
    size matters! wherebigger is better (most of the time)
  • 10.
    data cleaning/formatting: ● easyto make a quick submission ● boosts participation ● (too) clean data kills creativity
  • 11.
  • 12.
    metric: how doyou measure success? ● Classification - AUC/ Logarithmic Loss/Accuracy ● Regression - RMSE/MAE ● Ranking - MAP/NDCG ● Other / Custom https://www.kaggle.com/wiki/Metrics
  • 13.
    the design ofa competition shapes how people are going to solve a problem
  • 14.
  • 15.
  • 16.
    data leakage “Deemed ‘oneof the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from” “the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions” “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
  • 17.
    do you havethousands of people reviewing your performance at work 24/7? I do.
  • 18.
    1. people makemistakes. honesty is the best policy.
  • 19.
    2. crowdsourcing ispowerful. anything that can go wrong will go wrong.

Editor's Notes

  • #4 diabetic-retinopathy-detection
  • #12 anything you need to do to your own company data to pursuade your boss ot release it to 100,000s of people