kaggle_meet_up

Journey to #1
It’s not the destination…it’s the
journey!
What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
Was curious, wanted to learn
more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
Joined dunnhumy … and Kaggle!
• Joined dunnhumby’s science team
• They had hosted 2 Kaggle contests already !
• Was curious about Kaggle.
• Joined a few contests and learned lots  .
• The community was very open to sharing and
collaboration.
Learned to do Image Classification!
CAT OR DOG
…and Sound classification!
And text Classification and Sentiment
Identify the writer….
Who wrote this? ‘To be, or not to be’
Shakespeare or Molière
Detect sentiment….
‘The Burger is Not Bad’
Negative Bigram = Positive comment
3 Years of modelling competitions
• Over 75 competitions
• Participated with 35 different teams
• 21 top 10 finishes
• 8 times prize winner
• 3 different modelling platforms
• Ranked 1st
out of 480,000 data scientists
What's next
• Data science within dunnhumby
• PhD (UCL) about recommender systems.
• Kaggling for fun 
Amazon.com - Employee Access Challenge
• Link:
https://www.kaggle.com/c/amazon-employee-access
• Objective: Predict if employee will require special
accesses (like manual access transactions ).
• Lessons learned:
1. Logistic Regression can be great when combined
with regularization to deal with high dimensionality
(e.g. many variables-features)
2. Keeping the data in Sparse Format speeds up things
a lot.
3. Sharing is caring! Great Participation, positive
attitude towards helping others. Lots of help from
the forum . Kaggle is the way to learn and improve!
4. Scikit-learn + Python is great!
RecSys Challenge 2013: Yelp business rating prediction
• Link:
https://www.kaggle.com/c/yelp-recsys-2013
• Objective: Predict what rating a customer will
give to a business
• Lessons learned:
1. Factorization machines and specifically Libfm (
http://www.libfm.org/) are great for summarizing
the relationship between a customer and a
business as well as combining many other factors.
2. Basic data manipulation (like joins, merges,
aggregations) as well as Feature engineering is
important.
3. Simpler/Linear models did well for this task.
Cause-effect pairs
• Link: https://www.kaggle.com/c/cause-effect-pairs
• Objective: "correlation does not mean
causation “. Out of 2 series of numbers find
which one is causing the other!
• Lessons learned:
1. In General it seems that the series’ causing the
other , has higher chance to be able to predict it
better with a nonlinear model, given some noise
2. Gradient boosting machines Can be great for
this task.
StumbleUpon Evergreen Classification Challenge
• Link: https://www.kaggle.com/c/stumbleupon
• Objective: Build a classifier to categorize
webpages as evergreen (contain timeless quality)
or non-evergreen
• Lessons learned:
1. Some Over fitting again (cv process not right yet).
Better safe than sorry from now on!
2. Impressive how tf-idf gives such a good
classification from the contents of the webpage as
text.
3. Dimensionality reduction with singular Value
decomposition on sparse data (in a way that ‘topics’
are created) is very powerful too.
Multi-label Bird Species Classification - NIPS 2013
• Link:
https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013
• Objective: Identify which of 87 classes of birds and
amphibians are present into 1000 continuous wild sound
recordings
• Lessons learned:
1. Converting the sound clips to numbers via using Mel
Frequency Cepstral Coefficients (MFCC) and then
creating some basic aggregate features based on them
was more than enough to get a good score.
2. This was good tutorial
http://practicalcryptography.com/miscellaneous/machine-l
3. Meta-modelling gave a good boost. As in using some
models’ predictions as features to new models.
4. I can make predictions in a field I literally know nothing
about!
March Machine Learning Mania
• Link:
https://www.kaggle.com/c/march-machine-learning-mania
• Objective: predict the 2014 NCAA Tournament
• Lessons learned:
1. Combine Pleasure with data = double pleasure! (
I am a huge NBA fan)! Was also my first top 10
finish!
2. Trust the rating agencies – They do a great job
and they have more data than you!
3. Simple models worked well
The Allen AI Science Challenge
• Link: https://www.kaggle.com/c/the-allen-ai-science-challenge
Objective: make a model that predicts the right
answer in an 8th
-grade science examination test
• Lessons learned:
1. Lucene (http://
www.docjar.com/html/api/org/apache/lucene/benchmark/utils/ExtractWikip
)was very efficient at indexing Wikipedia and
calculating distances among question and answers
2. Gensim word2vec (https://
radimrehurek.com/gensim/models/word2vec.html)
helped operations via representing each word with
a sequence of numbers.
Higgs Boson Machine Learning Challenge
• Link: https://www.kaggle.com/c/higgs-boson Objective:
Use the ATLAS (data collected by the Large Hadron
Collider) experiment to identify the Higgs boson
• Lessons learned:
1. XGBOOST! OOU! (https://github.com/dmlc/xgboost) Extreme
gradient boosting. I knew this tool was going to
make a huge impact in future. Multithreaded,
sparse data, super accuracy, many objective
functions.
2. Deep learning showing some teeth
3. RGF (http://stat.rutgers.edu/home/tzhang/software/rgf/) was
good.
4. Physics’ was probably useful.
Driver Telematics Analysis
• Link: https://www.kaggle.com/c/axa-driver-telematics-analysis
• Objective: Use telematic data to identify a driver
signature
• Lessons learned:
1. Geospatial stats were useful
2. Extracting features like average speed or
acceleration where critical.
3. Treating this as supervised problem seemed to
help.
Microsoft Malware Classification Challenge (BIG 2015)
• Link: https://www.kaggle.com/c/malware-classification
• Objective: Classify virus based on file contents
• Report :
http://blog.kaggle.com/2015/05/11/microsoft-malware-winners-intervie
/
Lessons learned:
1. Treating this problem as NLP (with bytes being
the words) worked really well as there were
certain sequences of bytes more prone to be
infected by certain viruses.
2. Information from different compression
techniques was also indicative of the virus.
Otto Group Product Classification Challenge
• Link: https://
www.kaggle.com/c/otto-group-product-classification-challenge
• Objective: Classify products into the correct
category with anonymized features
Lessons learned:
1. Deep learning very good for this task
2. Lasagne (Theno based :
http://lasagne.readthedocs.org/en/latest/ ) was
a very good tool.
3. Multi-level meta modelling gave a boost.
4. Pretty much every common model family
contributed!
Click-Through Rate Prediction
• Link: https://www.kaggle.com/c/avazu-ctr-prediction
• Objective: Predict whether a mobile ad will be
clicked
• Lessons learned:
1. Follow The Regularized Leader (FTRL) , which
uses the hashing trick was extremely efficient in
making good predictions using less 1 MB of Ram
on 40+milion data rows with thousand different
categories.
2. Same tricks old tricks (woe, algorithms on sparse
data, meta stacking)
Truly Native?
• Link: https://www.kaggle.com/c/dato-native
• Objective: Predict which web pages served by
StumbleUpon are sponsored
• Report:
http://blog.kaggle.com/2015/12/03/dato-winners-interview-
/
• Lessons learned:
1. Modelling with a trained corpus of over 4 grams
was vital for winning.
2. Meta fully connected modelling Level 4 (Stack net).
3. Used many different input formats (zipped data or
not).
4. Generating over 40 models allowed for greater
generalization.
Homesite Quote Conversion
• Link: https://www.kaggle.com/c/homesite-quote-conversion
Objective: Which customers will purchase a
quoted insurance plan?
• Lessons learned:
1. Generating a large pool of (500) models was
really useful in exploiting AUC to the maximum.
2. Feature engineering with XGBfi and with noise
imputation.
3. Exploring about 4-way interactions.
4. Retraining already trained models.
5. Dynamic collaboration is best!
So… what wins competitions?
In short:
•Understand the problem
•Discipline
•try problem-specific things or new approaches
•The hours you put in
•the right tools
•Collaboration
•Experience
•Ensembling
•Luck
© dunnhumby 2014 | Confidential2727
Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are good
at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.
© dunnhumby 2014 | Confidential2828
● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2
© dunnhumby 2014 | Confidential2929
Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models
© dunnhumby 2014 | Confidential3030
● LibFm www.libfm.org
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff.
https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets
out there. Difficult to install and requires GPU with NVDIA Graphics
card. https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking
algorithms (e.g. rank products for customers) that supports
optimization fucntions like NDCG.
people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and
Lasagne(https://github.com/Lasagne/Lasagne ) for nets.
Tools vol 2
© dunnhumby 2014 | Confidential3131
Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .
© dunnhumby 2014 | Confidential3232
Where to go next
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .
© dunnhumby 2014 | Confidential3333
Big Thanks to my Kaggle buddies for the #1
A data Science Hero
• Me: “ Don’t get Stressed”
• Lucas: “ I want to. “ I want to win ” (20/04/2016)
for the Santander competition.
• Passed away 4 days afterwards (24/04/2016) after battling with cancer
for 2.5 years.
• Find Lucas’ winning solutions (and post competition
threads) and learn from the best!
1 of 34

Recommended

How to Win Machine Learning Competitions ? by
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
60.1K views27 slides
Tips and tricks to win kaggle data science competitions by
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
14.4K views38 slides
Winning data science competitions, presented by Owen Zhang by
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
40K views23 slides
General Tips for participating Kaggle Competitions by
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
85.7K views74 slides
Make Sense Out of Data with Feature Engineering by
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
6.3K views23 slides
Winning Kaggle 101: Introduction to Stacking by
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
8.8K views21 slides

More Related Content

What's hot

Hacking Predictive Modeling - RoadSec 2018 by
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
1.2K views74 slides
Managing machine learning by
Managing machine learningManaging machine learning
Managing machine learningDavid Murgatroyd
1.6K views38 slides
Primer to Machine Learning by
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
352 views186 slides
Machine Learning - Challenges, Learnings & Opportunities by
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
7.3K views162 slides
Machine Learning 101 | Essential Tools for Machine Learning by
Machine Learning 101 | Essential Tools for Machine LearningMachine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine LearningHafiz Muhammad Attaullah
47 views13 slides
Top 10 Data Science Practitioner Pitfalls by
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
1.5K views24 slides

What's hot(20)

Hacking Predictive Modeling - RoadSec 2018 by HJ van Veen
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen1.2K views
Primer to Machine Learning by Jeff Tanner
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
Jeff Tanner352 views
Machine Learning - Challenges, Learnings & Opportunities by CodePolitan
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan7.3K views
Top 10 Data Science Practitioner Pitfalls by Sri Ambati
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati1.5K views
Le Machine Learning de A à Z by Alexia Audevart
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à Z
Alexia Audevart5.7K views
Data Science, Machine Learning and Neural Networks by BICA Labs
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs9K views
Introduction to Machine Learning by Raveen Perera
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Raveen Perera642 views
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel by Sri Ambati
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati10.3K views
Online Machine Learning: introduction and examples by Felipe
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
Felipe 12.2K views
machine learning by Mounisha A
machine learningmachine learning
machine learning
Mounisha A270 views
10 Lessons Learned from Building Machine Learning Systems by Xavier Amatriain
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain378.3K views
Understanding Basics of Machine Learning by Pranav Ainavolu
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
Pranav Ainavolu473 views
H2O World - Ensembles with Erin LeDell by Sri Ambati
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
Sri Ambati3.2K views
Machine learning on Hadoop data lakes by DataWorks Summit
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
DataWorks Summit2.9K views
Top 10 Data Science Practioner Pitfalls - Mark Landry by Sri Ambati
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati714 views
Making Machine Learning Work in Practice - StampedeCon 2014 by StampedeCon
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
StampedeCon2.6K views
Brief introduction to Machine Learning by CodeForFrankfurt
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
CodeForFrankfurt1.7K views

Viewers also liked

Kaggle slide by
Kaggle slideKaggle slide
Kaggle slidedatacamp
3.2K views1 slide
Precision Market Insights: Unprecedented Analytics by
Precision Market Insights: Unprecedented AnalyticsPrecision Market Insights: Unprecedented Analytics
Precision Market Insights: Unprecedented AnalyticsVivastream
1.3K views2 slides
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK by
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSKGraham Nolan
4.8K views13 slides
Stacey Cunningham and Jerome Cultrera for Glamour Mexico by
Stacey Cunningham and Jerome Cultrera for Glamour MexicoStacey Cunningham and Jerome Cultrera for Glamour Mexico
Stacey Cunningham and Jerome Cultrera for Glamour MexicoSEE Management
320 views9 slides
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEY by
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEYWHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEY
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEYBig Data Week
416 views26 slides
Simon Nixon by
Simon NixonSimon Nixon
Simon NixonKateWhite1234
665 views2 slides

Viewers also liked(20)

Kaggle slide by datacamp
Kaggle slideKaggle slide
Kaggle slide
datacamp3.2K views
Precision Market Insights: Unprecedented Analytics by Vivastream
Precision Market Insights: Unprecedented AnalyticsPrecision Market Insights: Unprecedented Analytics
Precision Market Insights: Unprecedented Analytics
Vivastream1.3K views
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK by Graham Nolan
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK
#DMXDUBLIN 10 DATA INSIGHTS FROM CKSK
Graham Nolan4.8K views
Stacey Cunningham and Jerome Cultrera for Glamour Mexico by SEE Management
Stacey Cunningham and Jerome Cultrera for Glamour MexicoStacey Cunningham and Jerome Cultrera for Glamour Mexico
Stacey Cunningham and Jerome Cultrera for Glamour Mexico
SEE Management320 views
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEY by Big Data Week
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEYWHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEY
WHY YOUR DATA ASSET IS YOUR GOLD MINE - JOHN ABBEY
Big Data Week416 views
Achieving Personalization at Scale by dunnhumby
Achieving Personalization at Scale Achieving Personalization at Scale
Achieving Personalization at Scale
dunnhumby8.8K views
grasp category vantage point 2017 by Robin Norton
grasp category vantage point 2017grasp category vantage point 2017
grasp category vantage point 2017
Robin Norton225 views
The Convergence of Retailers and Publishers by dunnhumby
The Convergence of Retailers and PublishersThe Convergence of Retailers and Publishers
The Convergence of Retailers and Publishers
dunnhumby18.8K views
5 Practical Steps to Energize your Organization and Bring Customer First Enga... by dunnhumby
5 Practical Steps to Energize your Organization and Bring Customer First Enga...5 Practical Steps to Energize your Organization and Bring Customer First Enga...
5 Practical Steps to Energize your Organization and Bring Customer First Enga...
dunnhumby692 views
It's All About Me: How Metro Is Personalizing Its Way To Growth by dunnhumby
It's All About Me: How Metro Is Personalizing Its Way To GrowthIt's All About Me: How Metro Is Personalizing Its Way To Growth
It's All About Me: How Metro Is Personalizing Its Way To Growth
dunnhumby16.7K views
DeepdiveintodatainsightsJFY30Sept2014 by Jacques Farcy
DeepdiveintodatainsightsJFY30Sept2014DeepdiveintodatainsightsJFY30Sept2014
DeepdiveintodatainsightsJFY30Sept2014
Jacques Farcy372 views
CRMC 2013 "Power to the People" by dunnhumby
CRMC 2013   "Power to the People"CRMC 2013   "Power to the People"
CRMC 2013 "Power to the People"
dunnhumby4.9K views
Will Bigger Data Mean Bigger Problems or Opportunities? by dunnhumby
Will Bigger Data Mean Bigger Problems or Opportunities?Will Bigger Data Mean Bigger Problems or Opportunities?
Will Bigger Data Mean Bigger Problems or Opportunities?
dunnhumby15.4K views
Kaggle presentation by HJ van Veen
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen13.4K views
dunnhumby and Metro Canada - Digital Personalization and Client Strategy by dunnhumby
dunnhumby and Metro Canada - Digital Personalization and Client Strategydunnhumby and Metro Canada - Digital Personalization and Client Strategy
dunnhumby and Metro Canada - Digital Personalization and Client Strategy
dunnhumby19.9K views
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author by Vivian S. Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang89.2K views

Similar to kaggle_meet_up

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ... by
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
167 views28 slides
Recsys 2016 by
Recsys 2016Recsys 2016
Recsys 2016Mindaugas Zickus
3.5K views86 slides
Architecting for Data Science by
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data ScienceJohann Schleier-Smith
1.3K views114 slides
Pemanfaatan Big Data Dalam Riset 2023.pptx by
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
1 view47 slides
Building High Available and Scalable Machine Learning Applications by
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
1.6K views66 slides
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ... by
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
3.4K views27 slides

Similar to kaggle_meet_up(20)

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ... by Lucidworks
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks167 views
Pemanfaatan Big Data Dalam Riset 2023.pptx by elisarosa29
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
elisarosa291 view
Building High Available and Scalable Machine Learning Applications by Yalçın Yenigün
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
Yalçın Yenigün1.6K views
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ... by Srinath Perera
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera3.4K views
Data Science Accelerator Program by GoDataDriven
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
GoDataDriven581 views
Multi task learning stepping away from narrow expert models 7.11.18 by Cloudera, Inc.
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
Cloudera, Inc.854 views
How to Build an AI/ML Product and Sell it by SalesChoice CPO by Product School
How to Build an AI/ML Product and Sell it by SalesChoice CPOHow to Build an AI/ML Product and Sell it by SalesChoice CPO
How to Build an AI/ML Product and Sell it by SalesChoice CPO
Product School489 views
Cloudera Data Science Challenge 3 Solution by Doug Needham by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
Doug Needham1.7K views
Predictive Analytics: Context and Use Cases by Kimberley Mitchell
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
Kimberley Mitchell19.2K views
Multi-modal sources for predictive modeling using deep learning by Sanghamitra Deb
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
Sanghamitra Deb40 views
H2O World - Intro to Data Science with Erin Ledell by Sri Ambati
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati3K views
Machine learning and big data by Poo Kuan Hoong
Machine learning and big dataMachine learning and big data
Machine learning and big data
Poo Kuan Hoong838 views
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z... by Maurice Nsabimana
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
High time to add machine learning to your information security stack by Minhaz A V
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
Minhaz A V297 views
ACRL 2011 Data-Driven Library Web Design by Amanda Dinscore
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web Design
Amanda Dinscore533 views

kaggle_meet_up

  • 1. Journey to #1 It’s not the destination…it’s the journey!
  • 2. What is kaggle • world's biggest predictive modelling competition platform • Half a million members • Companies host data challenges. • Usual tasks include: – Predict topic or sentiment from text. – Predict species/type from image. – Predict store/product/area sales – Marketing response
  • 3. Inspired by Horse races…! • At the University of Southampton, an entrepreneur talked to us about how he was able to predict the horse races with regression!
  • 4. Was curious, wanted to learn more • learned statistical tools (Like SAS, SPSS, R) • I became more passionate! • Picked up programming skills
  • 5. Built KazAnova • Generated a couple of algorithms and data techniques and decided to make them public so that others can gain from it. • I released it at (http://www.kazanovaforanalytics.com/) • Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.
  • 6. Joined dunnhumy … and Kaggle! • Joined dunnhumby’s science team • They had hosted 2 Kaggle contests already ! • Was curious about Kaggle. • Joined a few contests and learned lots  . • The community was very open to sharing and collaboration.
  • 7. Learned to do Image Classification! CAT OR DOG
  • 9. And text Classification and Sentiment Identify the writer…. Who wrote this? ‘To be, or not to be’ Shakespeare or Molière Detect sentiment…. ‘The Burger is Not Bad’ Negative Bigram = Positive comment
  • 10. 3 Years of modelling competitions • Over 75 competitions • Participated with 35 different teams • 21 top 10 finishes • 8 times prize winner • 3 different modelling platforms • Ranked 1st out of 480,000 data scientists
  • 11. What's next • Data science within dunnhumby • PhD (UCL) about recommender systems. • Kaggling for fun 
  • 12. Amazon.com - Employee Access Challenge • Link: https://www.kaggle.com/c/amazon-employee-access • Objective: Predict if employee will require special accesses (like manual access transactions ). • Lessons learned: 1. Logistic Regression can be great when combined with regularization to deal with high dimensionality (e.g. many variables-features) 2. Keeping the data in Sparse Format speeds up things a lot. 3. Sharing is caring! Great Participation, positive attitude towards helping others. Lots of help from the forum . Kaggle is the way to learn and improve! 4. Scikit-learn + Python is great!
  • 13. RecSys Challenge 2013: Yelp business rating prediction • Link: https://www.kaggle.com/c/yelp-recsys-2013 • Objective: Predict what rating a customer will give to a business • Lessons learned: 1. Factorization machines and specifically Libfm ( http://www.libfm.org/) are great for summarizing the relationship between a customer and a business as well as combining many other factors. 2. Basic data manipulation (like joins, merges, aggregations) as well as Feature engineering is important. 3. Simpler/Linear models did well for this task.
  • 14. Cause-effect pairs • Link: https://www.kaggle.com/c/cause-effect-pairs • Objective: "correlation does not mean causation “. Out of 2 series of numbers find which one is causing the other! • Lessons learned: 1. In General it seems that the series’ causing the other , has higher chance to be able to predict it better with a nonlinear model, given some noise 2. Gradient boosting machines Can be great for this task.
  • 15. StumbleUpon Evergreen Classification Challenge • Link: https://www.kaggle.com/c/stumbleupon • Objective: Build a classifier to categorize webpages as evergreen (contain timeless quality) or non-evergreen • Lessons learned: 1. Some Over fitting again (cv process not right yet). Better safe than sorry from now on! 2. Impressive how tf-idf gives such a good classification from the contents of the webpage as text. 3. Dimensionality reduction with singular Value decomposition on sparse data (in a way that ‘topics’ are created) is very powerful too.
  • 16. Multi-label Bird Species Classification - NIPS 2013 • Link: https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013 • Objective: Identify which of 87 classes of birds and amphibians are present into 1000 continuous wild sound recordings • Lessons learned: 1. Converting the sound clips to numbers via using Mel Frequency Cepstral Coefficients (MFCC) and then creating some basic aggregate features based on them was more than enough to get a good score. 2. This was good tutorial http://practicalcryptography.com/miscellaneous/machine-l 3. Meta-modelling gave a good boost. As in using some models’ predictions as features to new models. 4. I can make predictions in a field I literally know nothing about!
  • 17. March Machine Learning Mania • Link: https://www.kaggle.com/c/march-machine-learning-mania • Objective: predict the 2014 NCAA Tournament • Lessons learned: 1. Combine Pleasure with data = double pleasure! ( I am a huge NBA fan)! Was also my first top 10 finish! 2. Trust the rating agencies – They do a great job and they have more data than you! 3. Simple models worked well
  • 18. The Allen AI Science Challenge • Link: https://www.kaggle.com/c/the-allen-ai-science-challenge Objective: make a model that predicts the right answer in an 8th -grade science examination test • Lessons learned: 1. Lucene (http:// www.docjar.com/html/api/org/apache/lucene/benchmark/utils/ExtractWikip )was very efficient at indexing Wikipedia and calculating distances among question and answers 2. Gensim word2vec (https:// radimrehurek.com/gensim/models/word2vec.html) helped operations via representing each word with a sequence of numbers.
  • 19. Higgs Boson Machine Learning Challenge • Link: https://www.kaggle.com/c/higgs-boson Objective: Use the ATLAS (data collected by the Large Hadron Collider) experiment to identify the Higgs boson • Lessons learned: 1. XGBOOST! OOU! (https://github.com/dmlc/xgboost) Extreme gradient boosting. I knew this tool was going to make a huge impact in future. Multithreaded, sparse data, super accuracy, many objective functions. 2. Deep learning showing some teeth 3. RGF (http://stat.rutgers.edu/home/tzhang/software/rgf/) was good. 4. Physics’ was probably useful.
  • 20. Driver Telematics Analysis • Link: https://www.kaggle.com/c/axa-driver-telematics-analysis • Objective: Use telematic data to identify a driver signature • Lessons learned: 1. Geospatial stats were useful 2. Extracting features like average speed or acceleration where critical. 3. Treating this as supervised problem seemed to help.
  • 21. Microsoft Malware Classification Challenge (BIG 2015) • Link: https://www.kaggle.com/c/malware-classification • Objective: Classify virus based on file contents • Report : http://blog.kaggle.com/2015/05/11/microsoft-malware-winners-intervie / Lessons learned: 1. Treating this problem as NLP (with bytes being the words) worked really well as there were certain sequences of bytes more prone to be infected by certain viruses. 2. Information from different compression techniques was also indicative of the virus.
  • 22. Otto Group Product Classification Challenge • Link: https:// www.kaggle.com/c/otto-group-product-classification-challenge • Objective: Classify products into the correct category with anonymized features Lessons learned: 1. Deep learning very good for this task 2. Lasagne (Theno based : http://lasagne.readthedocs.org/en/latest/ ) was a very good tool. 3. Multi-level meta modelling gave a boost. 4. Pretty much every common model family contributed!
  • 23. Click-Through Rate Prediction • Link: https://www.kaggle.com/c/avazu-ctr-prediction • Objective: Predict whether a mobile ad will be clicked • Lessons learned: 1. Follow The Regularized Leader (FTRL) , which uses the hashing trick was extremely efficient in making good predictions using less 1 MB of Ram on 40+milion data rows with thousand different categories. 2. Same tricks old tricks (woe, algorithms on sparse data, meta stacking)
  • 24. Truly Native? • Link: https://www.kaggle.com/c/dato-native • Objective: Predict which web pages served by StumbleUpon are sponsored • Report: http://blog.kaggle.com/2015/12/03/dato-winners-interview- / • Lessons learned: 1. Modelling with a trained corpus of over 4 grams was vital for winning. 2. Meta fully connected modelling Level 4 (Stack net). 3. Used many different input formats (zipped data or not). 4. Generating over 40 models allowed for greater generalization.
  • 25. Homesite Quote Conversion • Link: https://www.kaggle.com/c/homesite-quote-conversion Objective: Which customers will purchase a quoted insurance plan? • Lessons learned: 1. Generating a large pool of (500) models was really useful in exploiting AUC to the maximum. 2. Feature engineering with XGBfi and with noise imputation. 3. Exploring about 4-way interactions. 4. Retraining already trained models. 5. Dynamic collaboration is best!
  • 26. So… what wins competitions? In short: •Understand the problem •Discipline •try problem-specific things or new approaches •The hours you put in •the right tools •Collaboration •Experience •Ensembling •Luck
  • 27. © dunnhumby 2014 | Confidential2727 Different models I have experimented vol 1 ● Logistic/Linear/discriminant regression: Fast, Scalable, Comprehensible, solid under high dimensionality, can be memory- light. Best when relationships are linear or all features are categorical. Good for text classification too . ● Random Forests : Probably the best one-off overall algorithm out there (to my experience) . Fast, Scalable , memory-medium. Best when all features are numeric-continuous and there are strong non- linear relationships. Does not cope well with high dimensionality. ● Gradient Boosting (Trees): Less memory intense as forests (as individual predictors tend to be weaker). Fast, Semi-Scalable, memory-medium. Is good when forests are good ● Neural Nets (AKA deep Learning): Good for tasks humans are good at: Image Recognition, sound recognition. Good with categorical variables too (as they replicate on-and-off signals). Medium-speed, Scalable, memory-light . Generally good for linear and non-linear tasks. May take a lot to train depending on structure. Many parameters to tune. Very prone to over and under fitting.
  • 28. © dunnhumby 2014 | Confidential2828 ● Support Vector Machines (SVMs): Medium-Speed, not scalable, memory intense. Still good at capturing linear and non linear relationships. Holding the kernel matrix takes too much memory. Not advisable for data sets bigger than 20k. ● K Nearest Neighbours: Slow (depending on the size), Not easily scalable, memory-heavy. Good when really defining the good of the bad is matter of how much he/she looks to specific individuals. Also good when number of target variables are many as the similarity measures remain the same across different observations. Good for text classification too. ● Naïve Bayes : Quick, scalable, memory-ok. Good for quick classifications on big datasets. Not particularly predictive. ● Factorization Machines: Good gateways between Linear and non-linear problems. Stand between regressions , Knns and neural networks. Memory Medium, semi-scalable, medium-speed. Good for predicting the rating a customer will assign to a pruduct Different models I have experimented vol 2
  • 29. © dunnhumby 2014 | Confidential2929 Tools vol 1 ● Languages : Python, R, Java ● Liblinear : for linear models http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ● LibSvm for Support Vector machines www.csie.ntu.edu.tw/~cjlin/libsvm/ ● Scikit package in python for text classification, random forests and gradient boosting machines scikit-learn.org/stable/ ● Xgboost for fast scalable gradient boosting https://github.com/tqchen/xgboost ● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear models ● http://www.heatonresearch.com/encog encog for neural nets ● H2O in R for many models
  • 30. © dunnhumby 2014 | Confidential3030 ● LibFm www.libfm.org ● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/ ● Graphchi for factorizations : https://github.com/GraphChi ● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html ● Cxxnet : One of the best implementation of convolutional neural nets out there. Difficult to install and requires GPU with NVDIA Graphics card. https://github.com/antinucleon/cxxnet ● RankLib: The best library out there made in java suited for ranking algorithms (e.g. rank products for customers) that supports optimization fucntions like NDCG. people.cs.umass.edu/~vdang/ranklib.html ● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne ) for nets. Tools vol 2
  • 31. © dunnhumby 2014 | Confidential3131 Ensemble ● Key part (in winning competitions at least) to combine the various models made . ● Remember, even a crappy model can be useful to some small extend. ● Possible ways to ensemble: – Simple average (Model1 prediction + model2 prediction)/2 – Average Ranks for AUC (simple average after converting to rank) – Manually tune weights with cross validation – Using Geomean weighted average – Use Meta-Modelling (also called stack generalization or stacking) ● Check github for a complete example of these methods using the Amazon comp hosted by kaggle : https://github.com/kaz- Anova/ensemble_amazon (top 60 rank) .
  • 32. © dunnhumby 2014 | Confidential3232 Where to go next ● Coursera : https://www.coursera.org/course/ml Andrew’s NG class ● Kaggle.com : many competitions for learning. For instance: http://www.kaggle.com/c/titanic-gettingStarted . Look for the “knowledge flag” ● Very good slides from university of UTAH: www.cs.utah.edu/~piyush/teaching/cs5350.html ● clopinet.com/challenges/ . Many past predictive modelling competitions with tutorials. ● Wikipedia. Not to underestimate. Still the best source of information out there (collectively) .
  • 33. © dunnhumby 2014 | Confidential3333 Big Thanks to my Kaggle buddies for the #1
  • 34. A data Science Hero • Me: “ Don’t get Stressed” • Lucas: “ I want to. “ I want to win ” (20/04/2016) for the Santander competition. • Passed away 4 days afterwards (24/04/2016) after battling with cancer for 2.5 years. • Find Lucas’ winning solutions (and post competition threads) and learn from the best!