Kaggle competitions, new friends, new skills and new opportunities
1. Kaggle Competitions, New Friends, New
Skills and New Opportunities
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulus
2. About Me
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o EngD Research
• 2015 - Present
• Data Scientist
o Virgin Media
o Domino Data Lab
o H2O.ai
2
3. About This Talk
• What happened
o Things I did since I
started participating in
Kaggle competitions.
o New opportunities –
results of new skills and
friends.
3
4. First MOOC Experience
• One of the first Massive
Open Online Courses.
o Met some new friends.
o Decided to collaborate
for fun.
o “How about Kaggle?”
o “What is Kaggle?”
4
5. First Kaggle Experience
• First time in my life
o Supervised learning
• Random Forest
• Support Vector Machine
• Neural Networks
o Train, Validate & Predict.
o “Is it black magic?”
5
6. First Kaggle Experience
• Problems
o “Hey Joe, you are a nice
guy but we can’t work
together.”
o “You love MATLAB so
much. You even call
yourself @matlabulous
on twitter!”
o “We prefer R/Python.”
• Results
o I kept using MATLAB
o Lone wolf
o No collaboration
6
7. Identifying Skills Gap
• Obvious skills gap:
o Open-source
Programming langauges
o Machine learning
techniques
o Collaboration
• Kind of related
o Data visualisation
o Handling large datasets
o Explaining results
• That competition was a good wake up call.
7
8. From MATLAB to R/Python
MATLAB Python R
Neural Networks ✔️ ✔️ ✔️
Random Forest ✔️ ✔️ ✔️
SVM ✔️ ✔️ ✔️
Other Machine
Learning Libraries
Toolboxes
(commercial + open
source)
Scikit-learn and
many more
CRAN, GitHub
(A LOT!)
Data Visualisation I wasn’t good at it
anyway …
Matplotlib
(plus a lot more
since then)
ggplot2 (WOW!)
(plus a lot more
since then)
8
9. What can people do with R?
9
James Cheshire, UCL
Link
Paul Butler, Facebook
Link
10. Filling the Skills Gap
• More MOOC
o Machine Learning
• Andrew Ng (Coursera)
o Data Analysis
• Jeff Leek (Coursera)
• R
o Intro to Programming
• Dave Evans (Udacity)
• Python
• Things I also picked up:
o Linux (Ubuntu)
o Git
o Cloud computing
o HTML / CSS
10
11. Learning from other Kagglers
• Continuous learning
o Kaggle’s forums and blogs.
o New tools and tricks.
o Many things you cannot
learn from school.
o I am standing on the
shoulders of many
Kagglers.
11
12. Side Project 1 – Crime Data Viz
shiny::runGitHub("rApps", "woobe",
subdir = "crimemap")
12
13. Other Side Projects
• Colour Palette
o github.com/
woobe/rPlotter
13
• World Cup 2014 Correct
Score Prediction
o ML vs. my friends
o 10 out of 64 (15.6%)
o Friends’ avg. = 4 (6.3%)
o github.com/
woobe/wc2014
14. Open Up Myself
• Before Kaggle/MOOC
o I was drawing a circle
around myself.
o Fear of change.
o Domain-specific problem
solving.
• After Kaggle/MOOC
o Data-driven approach.
o Not a subject matter
expert? No worries
o Free to try new tools, to
learn and to create.
14
15. New Opportunities
• LondonR
o First presentation
outside water industry /
academia.
o Very positive feedback.
o Led to other projects.
o bit.ly
/londonr_crimemap
15
16. New Opportunities
• useR! 2014 (UCLA)
o Presented a poster.
o Met new friends.
o Life-changing event.
o github.com/
woobe/useR_2014
16
19. More Opportunities
• First blog post about
H2O
o Things to try after useR!
– Part1: Deep Learning
with H2O
19
20. More Opportunities
• Blog post about Domino
and H2O
o I did it for fun. I did not
have any expectation.
o It helped attract
customers to both
Domino and H2O.
20
22. London Kagglers Assemble
• London Kaggle Meetup
o Sep 2015
o I met my Kaggle buddy
Mickael Le Gal
o He is a product data
scientist at Tictrac
22
23. London Kagglers Assemble
• Rossmann Store Sales
o We got stuck at top 10% for a long
period.
o Mickael had a breakthrough in feature
engineering with 48 hours to go.
o I re-trained all models and completed
model stacking just a few hours before
the deadline (thanks to Domino Data
Lab).
o Top 2% finish (our best result so far).
23
25. Summary of Benefits
• Direct
o Identify data science
skills gap.
o Learn quickly from the
community.
o Expand your network.
o Prepare yourself for real-
life data challenges.
• Indirect
o You also learn non-ML
skills along the way.
o You learn to build small
data products (e.g.
graph, web app, REST
API) and help others gain
insight.
25
26. Strata Hadoop London Talks
• “Model stacking and
parameters tuning.”
o Wednesday 1 June
o 7pm
o LondonR at Strata
o Free event
• “Generalised Low Rank
Models.”
o Thursday 2 June
o 2pm
o Strata Hadoop main
conference event
26
27. Big Thank You
• London Kaggle Meetup
Organisers
• Mickael
• Marios
• You (Kagglers)
• Prof. Dragan Savic
(supervisor at uni. of
Exeter)
• Mango Solutions
• Virgin Media
• Domino Data Lab
• H2O.ai
27
28. Any Questions?
• Contact
o joe@h2o.ai
o @matlabulous
o github.com/woobe
• Links (All Slides)
o github.com/h2oai/h2o-
meetups
• H2O in London
o Coming soon!
• Meetups
• Office
o We’re hiring!
o www.h2o.ai/careers
28