How to get into Kaggle?
Philipp Singer & Dmitry Gordeev
Vienna Data Science Meetup Vienna,
Dec 5th 2019
Who we are
● Philipp
○ Data scientist at UNIQA
○ PhD in CS at TU Graz
○ Profound experience in ML research and applications
○ Kaggle competition master currently ranked 36th
● Dmitry
○ Data scientist at UNIQA
○ Master’s degree in data mining
○ In-depth experience of ML applications in financial institutes
○ Kaggle competition grandmaster currently ranked 34th
● Competing successfully together on Kaggle for 1 year: The Zoo
2
What is Kaggle?
● “Your home for Data Science”
○ Online community of data scientists and machine learners
○ Founded in 2010
○ Acquired by Google in 2017
● Data science competitions
● Share notebooks, datasets, and discussions
● Courses and tutorials
● Free notebook infrastructure with CPUs and GPUs
3
How big is Kaggle
● The most popular ML competition platform
● The largest ML community
125 000+ users
350 completed competitions
up to 10 000 users per competition
Usually 20,000 $ - 100,000 $ prize fund
4
Kaggle survey results
5
Kaggle survey results
6
Kaggle survey results
7
Kaggle survey results
8
Competitions on Kaggle
● Usually hosted by companies or research institutes
● Main goal: prediction
● Wide range of different types of competitions
○ Different types of domains (e.g., financial, medical, sports, …)
○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …)
○ Different types of objectives (e.g., classification, regression, segmentation, …)
○ Different goals of competitions (featured, research, playground, in-class)
● Built-in progression system with medals and ranks
● Top spots usually receive prize money
9
Competition medals
10
User ranking + titles
11
How competitions usually work
12https://mc.ai/pseudo-labeling/
● Started competing under the team name “The Zoo” exactly one year ago
● Little prior experience on Kaggle
● Participated in 7 competitions
● Strategy: diversify types of competitions for learning purposes
The Zoo
13
Our Journey
14
Quora
Develop models that identify
and flag insincere questions.
1 306 122 labelled
questions
6.2% insincere questions
4 037 teams
2 hours to fit and predict
15
Quora - sincere/insincere
How can I become a data scientist?
How come Trump is so stupid?
Is it possible for a vegan who does crossfit to go 10 minutes without telling
someone about it?
Everytime I slap myself in the face, it hurts. How can I prevent this?
16
Quora - solution
17
Quora - final standings
18
Santander
19
Identify which customers will
make a specific transaction in
the future
200 000 transactions
8 802 teams
2 months duration
Santander - the mysterious data
20
Santander - solution
21
Santander - final standings
22
LANL Earthquake Prediction
Predict the time remaining before
laboratory earthquakes occur
from real-time seismic data.
629 145 480 data points
4 200 trainings segments
4 540 teams
30 minutes to fit and predict
23
LANL - the physics
24
LANL - solution
● Derived handful of features from the data capturing peaks
and volatility of the acoustic signal
● Combination (ensemble) of two state-of-the-art modeling approaches
○ Gradient Boosting Regression Trees
○ Neural Network (Deep Learning)
● Novel statistical data adjustment to account for different earthquake cycles
25
LANL - final standings
26
APTOS Blindness Detection
Detect diabetic retinopathy to
stop blindness before it's too late!
3 662 retina images
0 - 4 retinopathy levels
2 943 teams
15 000 evaluation images
27
Diabetic retinopathy is the leading cause of blindness in
the working-age population of the developed world. It is
estimated to affect over 93 million people.
APTOS
28
https://www.eyeops.com/contents/our-services/eye-diseases/diabetic-retinopathy; https://www.vequill.com/how-to-cure-temporary-blindness/
APTOS - solution
● Careful image pre-processing to remove any
kind of bias (e.g., device)
● Combination of several current best deep
neural networks
● Models are pre-trained on large collection of
image data (imagenet + extra retina images)
29
APTOS - final standings
30
Quiz
● Did I have relevant experience to enter this competition?
31
Data: Atomic elements (H for hydrogen, C for carbon
etc.) and their X, Y, Z cartesian coordinates.
Task: Develop an algorithm that can predict the
magnetic interaction between two atoms in a
molecule.
Why should you start on Kaggle?
● Doing is the best way to learn
● Get in touch with data and use cases
outside your main domain
● Keep up-to-date with state-of-the-art methods
● Learn from others
● Measure yourself and know where you stand
● Hardware and software is provided by Kaggle
32
Easy start
33
How can you start on Kaggle?
● Don’t be afraid! Just do it!
● Overcome self-handicapping behavior
● You gain points regardless of the result
● “Getting started” competitions
● Pick a competition that sounds exciting to you, don’t be afraid to pick one
where you have no prior experience
● Research similar previous competitions and read solutions
● Follow published notebooks and discussions
34
Learn from the community
35
How to approach a competition?
● Choose a programming language (usually python or R)
● Understand the problem setting, get a feeling for the data and the metric
● Exploratory Data Analysis (EDA)
● Implement basic script / notebook from scratch doing training and prediction
OR just fork someone’s model ;-)
● Think hard about robust CV setup
● Keep up-to-date on discussions and developments of competition
● Experiment a lot and iterate quickly
36
Try more, fail fast
37
Baseline
model
Final
model
Thanks!
Get in touch with us! We are open to any inquiries.
me@philippsinger.com
dott1718@gmail.com
@ph_singer @dott1718
38Vienna Data Science Meetup Vienna,
Dec 5th 2019

How to get into Kaggle? by Philipp Singer and Dmitry Gordeev

  • 1.
    How to getinto Kaggle? Philipp Singer & Dmitry Gordeev Vienna Data Science Meetup Vienna, Dec 5th 2019
  • 2.
    Who we are ●Philipp ○ Data scientist at UNIQA ○ PhD in CS at TU Graz ○ Profound experience in ML research and applications ○ Kaggle competition master currently ranked 36th ● Dmitry ○ Data scientist at UNIQA ○ Master’s degree in data mining ○ In-depth experience of ML applications in financial institutes ○ Kaggle competition grandmaster currently ranked 34th ● Competing successfully together on Kaggle for 1 year: The Zoo 2
  • 3.
    What is Kaggle? ●“Your home for Data Science” ○ Online community of data scientists and machine learners ○ Founded in 2010 ○ Acquired by Google in 2017 ● Data science competitions ● Share notebooks, datasets, and discussions ● Courses and tutorials ● Free notebook infrastructure with CPUs and GPUs 3
  • 4.
    How big isKaggle ● The most popular ML competition platform ● The largest ML community 125 000+ users 350 completed competitions up to 10 000 users per competition Usually 20,000 $ - 100,000 $ prize fund 4
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Competitions on Kaggle ●Usually hosted by companies or research institutes ● Main goal: prediction ● Wide range of different types of competitions ○ Different types of domains (e.g., financial, medical, sports, …) ○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …) ○ Different types of objectives (e.g., classification, regression, segmentation, …) ○ Different goals of competitions (featured, research, playground, in-class) ● Built-in progression system with medals and ranks ● Top spots usually receive prize money 9
  • 10.
  • 11.
    User ranking +titles 11
  • 12.
    How competitions usuallywork 12https://mc.ai/pseudo-labeling/
  • 13.
    ● Started competingunder the team name “The Zoo” exactly one year ago ● Little prior experience on Kaggle ● Participated in 7 competitions ● Strategy: diversify types of competitions for learning purposes The Zoo 13
  • 14.
  • 15.
    Quora Develop models thatidentify and flag insincere questions. 1 306 122 labelled questions 6.2% insincere questions 4 037 teams 2 hours to fit and predict 15
  • 16.
    Quora - sincere/insincere Howcan I become a data scientist? How come Trump is so stupid? Is it possible for a vegan who does crossfit to go 10 minutes without telling someone about it? Everytime I slap myself in the face, it hurts. How can I prevent this? 16
  • 17.
  • 18.
    Quora - finalstandings 18
  • 19.
    Santander 19 Identify which customerswill make a specific transaction in the future 200 000 transactions 8 802 teams 2 months duration
  • 20.
    Santander - themysterious data 20
  • 21.
  • 22.
    Santander - finalstandings 22
  • 23.
    LANL Earthquake Prediction Predictthe time remaining before laboratory earthquakes occur from real-time seismic data. 629 145 480 data points 4 200 trainings segments 4 540 teams 30 minutes to fit and predict 23
  • 24.
    LANL - thephysics 24
  • 25.
    LANL - solution ●Derived handful of features from the data capturing peaks and volatility of the acoustic signal ● Combination (ensemble) of two state-of-the-art modeling approaches ○ Gradient Boosting Regression Trees ○ Neural Network (Deep Learning) ● Novel statistical data adjustment to account for different earthquake cycles 25
  • 26.
    LANL - finalstandings 26
  • 27.
    APTOS Blindness Detection Detectdiabetic retinopathy to stop blindness before it's too late! 3 662 retina images 0 - 4 retinopathy levels 2 943 teams 15 000 evaluation images 27 Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
  • 28.
  • 29.
    APTOS - solution ●Careful image pre-processing to remove any kind of bias (e.g., device) ● Combination of several current best deep neural networks ● Models are pre-trained on large collection of image data (imagenet + extra retina images) 29
  • 30.
    APTOS - finalstandings 30
  • 31.
    Quiz ● Did Ihave relevant experience to enter this competition? 31 Data: Atomic elements (H for hydrogen, C for carbon etc.) and their X, Y, Z cartesian coordinates. Task: Develop an algorithm that can predict the magnetic interaction between two atoms in a molecule.
  • 32.
    Why should youstart on Kaggle? ● Doing is the best way to learn ● Get in touch with data and use cases outside your main domain ● Keep up-to-date with state-of-the-art methods ● Learn from others ● Measure yourself and know where you stand ● Hardware and software is provided by Kaggle 32
  • 33.
  • 34.
    How can youstart on Kaggle? ● Don’t be afraid! Just do it! ● Overcome self-handicapping behavior ● You gain points regardless of the result ● “Getting started” competitions ● Pick a competition that sounds exciting to you, don’t be afraid to pick one where you have no prior experience ● Research similar previous competitions and read solutions ● Follow published notebooks and discussions 34
  • 35.
    Learn from thecommunity 35
  • 36.
    How to approacha competition? ● Choose a programming language (usually python or R) ● Understand the problem setting, get a feeling for the data and the metric ● Exploratory Data Analysis (EDA) ● Implement basic script / notebook from scratch doing training and prediction OR just fork someone’s model ;-) ● Think hard about robust CV setup ● Keep up-to-date on discussions and developments of competition ● Experiment a lot and iterate quickly 36
  • 37.
    Try more, failfast 37 Baseline model Final model
  • 38.
    Thanks! Get in touchwith us! We are open to any inquiries. me@philippsinger.com dott1718@gmail.com @ph_singer @dott1718 38Vienna Data Science Meetup Vienna, Dec 5th 2019