How To Win Any
Machine Learning Competition
by Kaggle Grandmaster
Pavel Pleskov
Bio
Education
●
MS in Math from MSU (2010)
●
MA in Econ from NES (2012)
Work
●
Financial Consultant (1 year)
●
Quantitative Researcher (2 years)
●
HFT Fund Partner (2 years)
Content
●
What is Kaggle and why it is important
●
Why participate in competitions
●
How to choose a platform
●
Winning tricks and hacks
●
What’s next
●
Q&A
What is Kaggle
●
Founded in April 2010,
acquired by Google in
March 2017
●
World's largest
community of data
scientists and machine
learners
●
1 mln registered users,
93K ranked users
Why Kaggle is unique
●
Kaggle's community has
thousands of public
datasets and code
snippets (called
"kernels" on Kaggle)
Kaggle as a forum
●
Many of the researchers
publish papers in peer-
reviewed journals based
on their performance in
Kaggle competitions.
Why business in so interested
IEEE's Signal Processing Society
●
Sony NEX-7
●
Motorola Moto X
●
Motorola Nexus 6
●
Motorola DROID MAXX
●
LG Nexus 5x
●
Apple iPhone 6
●
Apple iPhone 4s
●
HTC One M7
●
Samsung Galaxy S4
●
Samsung Galaxy Note 3
Toxic Comment Classification
●
toxic
●
severe_toxic
●
obscene
●
threat
●
insult
●
identity_hate
AdTracking Fraud Detection
●
ip
●
app
●
device
●
os
●
channel
●
click_time
●
attributed_time
●
is_attributed
Why participate in competitions: pros
●
Rapid knowledge and experience growth - x10
faster than Coursera
●
Practical business tasks
●
Portfolio improvement
●
Money
Why participate in competitions: cons
●
Hard to sell achievements
●
Far from real work
●
Takes a lot of time
●
Harsh competition
●
It’s addictive
●
Cheating
How to choose a platform
How to choose a contest
●
Check the rules for eligibility
(Crimea, Cuba, Iran, Syria, North
Korea, Sudan)
●
The harder to enter – the better
(large data sets, complex
registration, etc)
●
Fewer participants – less
competition (aim for less than
500 people)
●
Look at the prize sizes
How not to choose a contest
Red flags:
●
Easy registration
●
Anonymized features
●
Small/no private data set
●
Non automated/too many submissions
●
Binary classification with small AUC
Winning tricks and hacks: ods.ai
●
Slack channel
with 20K+
Russian
speaking data
scientists
Winning tricks and hacks: fast.ai
Jeremy Howard
●
Born in London, moved to Melbourne
●
8 years in consulting, 3 successful startups
●
#1 at Kaggle in 2010-2011, President of
Kaggle until the end of 2013
●
Was earning $200,000 while still a 19-year-
old student at McKinsey
●
The youngest Engagement Manager world-
wide at AT Kearney
●
Created a new global practice which is now
referred to as Big Data
●
Developed a new system for learning
Chinese, learned it in 1 year
Winning tricks and hacks: lectures
●
How to Win a Data Science Competition:
Learn from Top Kagglers
●
ML Yandex Training by
Stanislav Semenov
Winning tricks and hacks: software
Programming language OS Deep learning framework
Winning tricks and hacks: hardware
●
4x1080Ti Nvidia
GPUs
●
32 threads AMD
Ryzen Threadripper
1950X CPU
●
64 PCI-E lanes
motherboard x299
●
128Gb RAM
●
3Tb M.2 SSD
●
Full-sized tower
Winning tricks and hacks: teamwork
Winning tricks and hacks: leakages
Should not be confused with
data breach
Data leaks cannot be useful in
production
Examples:
●
Meta info for images (size, date
of creation, name)
●
Looking into the future for time
series
Winning tricks and hacks: stacking
Examples:
●
kaz-Anova StackNet
(Marios Michailidis)
●
Giba presentation
(Gilberto Titericz)
Winning tricks and hacks: DL/ML
●
TTA (test time augmentation)
●
Cross-validation (folds)
●
Data cleaning (outliers)
●
Looking at the data/errors
●
Parameters tuning (hyperopt)
What’s next
•
Write a blog post (fb, twitter,
medium, linkedin)
•
Make a video
•
Share your code on github
•
Upgrade your computer using prize
money
•
Choose the next competition
(yes, it’s addictive!)
Q&A
THANKS!

How to win a machine learning competition pavel pleskov

  • 1.
    How To WinAny Machine Learning Competition by Kaggle Grandmaster Pavel Pleskov
  • 2.
    Bio Education ● MS in Mathfrom MSU (2010) ● MA in Econ from NES (2012) Work ● Financial Consultant (1 year) ● Quantitative Researcher (2 years) ● HFT Fund Partner (2 years)
  • 3.
    Content ● What is Kaggleand why it is important ● Why participate in competitions ● How to choose a platform ● Winning tricks and hacks ● What’s next ● Q&A
  • 4.
    What is Kaggle ● Foundedin April 2010, acquired by Google in March 2017 ● World's largest community of data scientists and machine learners ● 1 mln registered users, 93K ranked users
  • 5.
    Why Kaggle isunique ● Kaggle's community has thousands of public datasets and code snippets (called "kernels" on Kaggle)
  • 6.
    Kaggle as aforum ● Many of the researchers publish papers in peer- reviewed journals based on their performance in Kaggle competitions.
  • 7.
    Why business inso interested
  • 8.
    IEEE's Signal ProcessingSociety ● Sony NEX-7 ● Motorola Moto X ● Motorola Nexus 6 ● Motorola DROID MAXX ● LG Nexus 5x ● Apple iPhone 6 ● Apple iPhone 4s ● HTC One M7 ● Samsung Galaxy S4 ● Samsung Galaxy Note 3
  • 9.
  • 10.
  • 11.
    Why participate incompetitions: pros ● Rapid knowledge and experience growth - x10 faster than Coursera ● Practical business tasks ● Portfolio improvement ● Money
  • 12.
    Why participate incompetitions: cons ● Hard to sell achievements ● Far from real work ● Takes a lot of time ● Harsh competition ● It’s addictive ● Cheating
  • 13.
    How to choosea platform
  • 14.
    How to choosea contest ● Check the rules for eligibility (Crimea, Cuba, Iran, Syria, North Korea, Sudan) ● The harder to enter – the better (large data sets, complex registration, etc) ● Fewer participants – less competition (aim for less than 500 people) ● Look at the prize sizes
  • 15.
    How not tochoose a contest Red flags: ● Easy registration ● Anonymized features ● Small/no private data set ● Non automated/too many submissions ● Binary classification with small AUC
  • 16.
    Winning tricks andhacks: ods.ai ● Slack channel with 20K+ Russian speaking data scientists
  • 17.
    Winning tricks andhacks: fast.ai Jeremy Howard ● Born in London, moved to Melbourne ● 8 years in consulting, 3 successful startups ● #1 at Kaggle in 2010-2011, President of Kaggle until the end of 2013 ● Was earning $200,000 while still a 19-year- old student at McKinsey ● The youngest Engagement Manager world- wide at AT Kearney ● Created a new global practice which is now referred to as Big Data ● Developed a new system for learning Chinese, learned it in 1 year
  • 18.
    Winning tricks andhacks: lectures ● How to Win a Data Science Competition: Learn from Top Kagglers ● ML Yandex Training by Stanislav Semenov
  • 19.
    Winning tricks andhacks: software Programming language OS Deep learning framework
  • 20.
    Winning tricks andhacks: hardware ● 4x1080Ti Nvidia GPUs ● 32 threads AMD Ryzen Threadripper 1950X CPU ● 64 PCI-E lanes motherboard x299 ● 128Gb RAM ● 3Tb M.2 SSD ● Full-sized tower
  • 21.
    Winning tricks andhacks: teamwork
  • 22.
    Winning tricks andhacks: leakages Should not be confused with data breach Data leaks cannot be useful in production Examples: ● Meta info for images (size, date of creation, name) ● Looking into the future for time series
  • 23.
    Winning tricks andhacks: stacking Examples: ● kaz-Anova StackNet (Marios Michailidis) ● Giba presentation (Gilberto Titericz)
  • 24.
    Winning tricks andhacks: DL/ML ● TTA (test time augmentation) ● Cross-validation (folds) ● Data cleaning (outliers) ● Looking at the data/errors ● Parameters tuning (hyperopt)
  • 25.
    What’s next • Write ablog post (fb, twitter, medium, linkedin) • Make a video • Share your code on github • Upgrade your computer using prize money • Choose the next competition (yes, it’s addictive!)
  • 26.