SlideShare a Scribd company logo
Starting Data Science
With Kaggle.com
6/25/2017
Starting Data Science with Kaggle.com -
Nathaniel Shimoni
1Nathaniel Shimoni 25/6/2017
• What is Kaggle?
• Why is Kaggle so great? The everyone wins approach
• Kaggle tiers & top kagglers
• Frequently used terms and the main rules
• The benefits of starting with Kaggle
• Common Kaggle data science process
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
2
Talk outline
• An online platform that runs data science competitions
• Declares itself to be the home of data science
• Has over 1M registered users & over 60k active users
• One of the most vibrant communities for data scientists
• A great place to meet other “data people”
• A great place to learn and test your data & modeling
skills
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
3
What is kaggle?
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
4
Why is Kaggle so great? (the everyone wins approach)
• Receives prizes,
knowledge,
exposure &
portfolio
showcase
• Rapid development
& adoption of
highly performing
platforms
• Receives money,
from competition
sponsors
• influence on the
community
• knowledge on the
platforms & algo.
trends
• Have data &
business task but
no data scientists
• Receives state of
the art models
quickly and
without hiring
data scientists
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
5
My Kaggle profile
• Novice – a new Kaggle user
• Contributor – participated in one or more competitions,
ran a kernel, and is active in the forums
• Expert – 2 top 25% finishes
• Master - 2 top 10% finishes, & 1 top 10 (places) finish
• Grandmaster – 5 top 10 finishes & 1 solo top 10 finish
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
6
Kaggle tiers
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
7
Top Kagglers
• Leaderboard (public & private)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
8
Frequently used terms
Training
Public
LB
Private LB
Available once
approved the rules
Used for ranking
submissions
through the
competition
Training data Testing data
Used for final scoring
(the only score that truly matters)
Public LB can serve as
additional validation
frame, but can also be
source of over fitting
• Leakage - the introduction of information about
the target that is not a legitimate predictor
(usually by a mistake within the data preparation process)
• Team merger – 2 or more participants competing
together
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
9
Frequently used terms
• LB shuffle – the re-ranking that occurs at the end
of the competition (upon moving from public to private LB)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
10
Frequently used terms
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
11
Main rules for Kaggle competitions
• One account per user
• No private sharing outside teams
(public sharing is usually allowed and endorsed)
• Limited number of entries per day & per competition
• Winning solutions must be written in open source code
• Winners should hand well documented source code in
order to be eligible of the price
• Usually select 2 solutions for final evaluation
• Project based learning – learn by doing
• Solve real world challenges
• Great supporting community
• Benchmark solutions & shared code samples
• Clear business objective and modeling task
• Develop work portfolio and rank yourself against
other competitors (and get recognition)
• Compete against state of the art solutions
• Learn (a lot!!!) when competition ends
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
12
Why start with Kaggle?
• Ability to team-up with others:
 learn from better Kagglers
 learn how to collaborate effectively
 merge different solutions to achieve a score boost
 meet new exciting people
• Answer the questions of others – you only truly learn
something when you teach it to someone else
• Ability to apply new ideas at work with little effort
• Varied areas of activity (verticals)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
13
Why start with Kaggle?
• The ability to follow many experts where each of them
specializes in a particular area (sample from my list)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
14
Why start with Kaggle?
Ensemble learning
Mathias Müller
Feature extraction
Darius Barušauskas
Validation
Gert Jacobusse
Super fast draft modeling
ZFTurbo - unknown
Inspiration – no minimal age for data science
Mikel Bober-Irizar
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
15
Common Kaggle Data Science process
Data cleaning
Data
augmentation
Adding
External Data
Single
models
Feature
engineering
Exploratory
data
analysis
Single
models
Diverse
single
models
Set the
correct
validation
method
Ensemble
learning
Final
prediction
EDA
Feature generation
modeling
Ensemble
learning
Data cleaning
& augmenting
Not always allowed yet
good practice to
consider when possible
40%20% 30% 10%
% of total
time spent in
each activity
• Impute missing values
(mean, median, most common value, use separate prediction task)
• Remove zero variance features
• Remove duplicated features
• Outlier removal – caution can be harmful, at cleaning
stage we’ll remove irrelevant values (e.g. negative price)
• Na’s encoding / imputing
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
16
Data cleaning
• External data sources:
 open street map
 weather measurement data
 online calendars
• API’s
• Scraping (using ScraPy / beautiful soup)
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
17
Data augmentation & external data
• Rescaling/ standardization of existing features
• Performing data transformations:
Tf-Idf, log1p, min-max scaling, binning of numeric features
• Turn categorical features to numeric
(label encoding / one hot encoding)
• Create count features
• Parsing textual features to get more generalizable
features
• Hashing trick
• Extracting date/time features i.e DayOfWeek, month, year,
dayOfMonth, isHoliday etc.
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
18
Feature engineering
• Remove near-zero-variance features
• Use feature importance and eliminate least
important features
• Recursive Feature Elimination
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
19
Feature selection
• Grid search CV (exhaustive, rarely better than alternatives)
• Random search CV
• Hyper-opt
• Bayesian optimization
* Hyper parameter adjustment will usually yield
better results but not as much as other activities
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
20
Hyper parameter optimization
• Train test split
• Shuffle split
• Kfold is the most commonly used
• Time based separation
• Group Kfold
• Leave one group out
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
21
Validation
• Simple/weighted average of previous best models
• Bagging of same type of models (i.e different rng,
different hyper-param)
• Majority vote
• Using out of fold predictions as meta features
a.k.a stacking
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
22
Ensemble learning
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
23
Out Of Fold predictions – a.k.a meta features
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
24
Out Of Fold predictions – a.k.a meta features
fold1
fold2
fold3
fold4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
Averaged
test
predictions
Test
predictions
fold1
Test
predictions
fold2
Test
predictions
fold3
Test
predictions
fold4
Divided training data - train on 3 folds
predict the forth fold and the testing data
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
25
Out Of Fold predictions – a.k.a meta features
oof 1
oof 2
oof 3
oof 4
Model 1
e.g. knn
Averaged test
predictions
Out of fold
predictions
oof 1
oof 2
oof 3
oof 4
Model 2
e.g. NN
oof 1
oof 2
oof 3
oof 4
Model 3
e.g. gbm
Train
labels
Model 1
e.g. knn
Model 2
e.g. NN
Model 3
e.g. gbm
After training several models using this method (3 different models in this sample)
We can now train a new model using our newly formed meta features
* Note that we can either train our meta model using only these new features or use
the new features along with our original train data for training
• Large focus on modeling relatively to the rest of
the steps in the process
• Small weight to runtime and scalability
• Little reasoning for selecting a specific eval metric
• Competing for the last few percent points isn’t
always valuable
• “Click and submit” phenomena
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
26
Disadvantages of Kaggle
• MOOC’s:
 Machine learning – Stanford Coursera
 Data science track – Johns Hopkins Coursera
 Udacity deep learning course
• Documentation:
 Scikit learn documentation
 Keras documentation
 R caret package documentation
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
27
Additional reading resources
• This presentation draws heavily from the
following sources:
• Mark Peng’s presentation
“Tips for participating Kaggle challenges”
• Darius Barušauskas’s presentation
“Tips and tricks to win Kaggle data science competitions”
• Kaggle discussion forums and blog
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
28
Links to sources
Questions?
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
29
6/25/2017
Starting Data Science with Kaggle.com
Nathaniel Shimoni
30

More Related Content

What's hot

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
Ontotext
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
Active learning: Scenarios and techniques
Active learning: Scenarios and techniquesActive learning: Scenarios and techniques
Active learning: Scenarios and techniques
web2webs
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Equifax Ltd
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
inovex GmbH
 
Data Extraction
Data ExtractionData Extraction
Data Extraction
Francisco J Grajales III
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
Opinion Dynamics on Generalized Networks
Opinion Dynamics on Generalized NetworksOpinion Dynamics on Generalized Networks
Opinion Dynamics on Generalized Networks
Mason Porter
 
Exploration and diversity in recommender systems
Exploration and diversity in recommender systemsExploration and diversity in recommender systems
Exploration and diversity in recommender systems
Jaya Kawale
 
Understanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledgeUnderstanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledge
Neeti Naag
 
1st Place in EY Data Science Challenge
1st Place in EY Data Science Challenge 1st Place in EY Data Science Challenge
1st Place in EY Data Science Challenge
Hyunju Shim
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
Steffen Staab
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
Tsahi Glik
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box Models
Jonathan Bechtel
 

What's hot (20)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Active learning: Scenarios and techniques
Active learning: Scenarios and techniquesActive learning: Scenarios and techniques
Active learning: Scenarios and techniques
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Data Extraction
Data ExtractionData Extraction
Data Extraction
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
Opinion Dynamics on Generalized Networks
Opinion Dynamics on Generalized NetworksOpinion Dynamics on Generalized Networks
Opinion Dynamics on Generalized Networks
 
Exploration and diversity in recommender systems
Exploration and diversity in recommender systemsExploration and diversity in recommender systems
Exploration and diversity in recommender systems
 
Understanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledgeUnderstanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledge
 
1st Place in EY Data Science Challenge
1st Place in EY Data Science Challenge 1st Place in EY Data Science Challenge
1st Place in EY Data Science Challenge
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box Models
 

Similar to Starting data science with kaggle.com

Introduction to competitive data science
Introduction to competitive data scienceIntroduction to competitive data science
Introduction to competitive data science
Nathaniel Shimoni
 
Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
Louis Rosenfeld
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
Tilmann Rabl
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j
 
Day 00 - Introduction to machine learning with big data
Day 00 - Introduction to machine learning with big dataDay 00 - Introduction to machine learning with big data
Day 00 - Introduction to machine learning with big data
ssusere5ddd6
 
Andreas weigend
Andreas weigendAndreas weigend
Andreas weigend
BigDataExpo
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
Ferris Jumah
 
Hawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural MeetupHawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural Meetup
Michael Motoki
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
 
Machine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The PredictionsMachine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The Predictions
Luca Zavarella
 
Bdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchenBdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchen
Christopher Bergh
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
Vivastream
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
Agile 2008 Retrospective
Agile 2008 RetrospectiveAgile 2008 Retrospective
Agile 2008 Retrospective
Craig Smith
 
BitBootCamp Evening Classes
BitBootCamp Evening ClassesBitBootCamp Evening Classes
BitBootCamp Evening Classes
Menish Gupta
 
Q1meeting_presentation
Q1meeting_presentationQ1meeting_presentation
Q1meeting_presentation
Angela Stansell, MS
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
CS, NcState
 
U Unit 6 [MT355] Page 1 of 3 Unit 6 Assignment.docx
U Unit 6 [MT355]  Page 1 of 3  Unit 6 Assignment.docxU Unit 6 [MT355]  Page 1 of 3  Unit 6 Assignment.docx
U Unit 6 [MT355] Page 1 of 3 Unit 6 Assignment.docx
ouldparis
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 

Similar to Starting data science with kaggle.com (20)

Introduction to competitive data science
Introduction to competitive data scienceIntroduction to competitive data science
Introduction to competitive data science
 
Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
Day 00 - Introduction to machine learning with big data
Day 00 - Introduction to machine learning with big dataDay 00 - Introduction to machine learning with big data
Day 00 - Introduction to machine learning with big data
 
Andreas weigend
Andreas weigendAndreas weigend
Andreas weigend
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 
Hawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural MeetupHawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural Meetup
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Machine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The PredictionsMachine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The Predictions
 
Bdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchenBdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchen
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
Agile 2008 Retrospective
Agile 2008 RetrospectiveAgile 2008 Retrospective
Agile 2008 Retrospective
 
BitBootCamp Evening Classes
BitBootCamp Evening ClassesBitBootCamp Evening Classes
BitBootCamp Evening Classes
 
Q1meeting_presentation
Q1meeting_presentationQ1meeting_presentation
Q1meeting_presentation
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
U Unit 6 [MT355] Page 1 of 3 Unit 6 Assignment.docx
U Unit 6 [MT355]  Page 1 of 3  Unit 6 Assignment.docxU Unit 6 [MT355]  Page 1 of 3  Unit 6 Assignment.docx
U Unit 6 [MT355] Page 1 of 3 Unit 6 Assignment.docx
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 

Recently uploaded

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Starting data science with kaggle.com

  • 1. Starting Data Science With Kaggle.com 6/25/2017 Starting Data Science with Kaggle.com - Nathaniel Shimoni 1Nathaniel Shimoni 25/6/2017
  • 2. • What is Kaggle? • Why is Kaggle so great? The everyone wins approach • Kaggle tiers & top kagglers • Frequently used terms and the main rules • The benefits of starting with Kaggle • Common Kaggle data science process 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 2 Talk outline
  • 3. • An online platform that runs data science competitions • Declares itself to be the home of data science • Has over 1M registered users & over 60k active users • One of the most vibrant communities for data scientists • A great place to meet other “data people” • A great place to learn and test your data & modeling skills 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 3 What is kaggle?
  • 4. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 4 Why is Kaggle so great? (the everyone wins approach) • Receives prizes, knowledge, exposure & portfolio showcase • Rapid development & adoption of highly performing platforms • Receives money, from competition sponsors • influence on the community • knowledge on the platforms & algo. trends • Have data & business task but no data scientists • Receives state of the art models quickly and without hiring data scientists
  • 5. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 5 My Kaggle profile
  • 6. • Novice – a new Kaggle user • Contributor – participated in one or more competitions, ran a kernel, and is active in the forums • Expert – 2 top 25% finishes • Master - 2 top 10% finishes, & 1 top 10 (places) finish • Grandmaster – 5 top 10 finishes & 1 solo top 10 finish 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 6 Kaggle tiers
  • 7. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 7 Top Kagglers
  • 8. • Leaderboard (public & private) 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 8 Frequently used terms Training Public LB Private LB Available once approved the rules Used for ranking submissions through the competition Training data Testing data Used for final scoring (the only score that truly matters) Public LB can serve as additional validation frame, but can also be source of over fitting
  • 9. • Leakage - the introduction of information about the target that is not a legitimate predictor (usually by a mistake within the data preparation process) • Team merger – 2 or more participants competing together 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 9 Frequently used terms
  • 10. • LB shuffle – the re-ranking that occurs at the end of the competition (upon moving from public to private LB) 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 10 Frequently used terms
  • 11. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 11 Main rules for Kaggle competitions • One account per user • No private sharing outside teams (public sharing is usually allowed and endorsed) • Limited number of entries per day & per competition • Winning solutions must be written in open source code • Winners should hand well documented source code in order to be eligible of the price • Usually select 2 solutions for final evaluation
  • 12. • Project based learning – learn by doing • Solve real world challenges • Great supporting community • Benchmark solutions & shared code samples • Clear business objective and modeling task • Develop work portfolio and rank yourself against other competitors (and get recognition) • Compete against state of the art solutions • Learn (a lot!!!) when competition ends 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 12 Why start with Kaggle?
  • 13. • Ability to team-up with others:  learn from better Kagglers  learn how to collaborate effectively  merge different solutions to achieve a score boost  meet new exciting people • Answer the questions of others – you only truly learn something when you teach it to someone else • Ability to apply new ideas at work with little effort • Varied areas of activity (verticals) 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 13 Why start with Kaggle?
  • 14. • The ability to follow many experts where each of them specializes in a particular area (sample from my list) 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 14 Why start with Kaggle? Ensemble learning Mathias Müller Feature extraction Darius Barušauskas Validation Gert Jacobusse Super fast draft modeling ZFTurbo - unknown Inspiration – no minimal age for data science Mikel Bober-Irizar
  • 15. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 15 Common Kaggle Data Science process Data cleaning Data augmentation Adding External Data Single models Feature engineering Exploratory data analysis Single models Diverse single models Set the correct validation method Ensemble learning Final prediction EDA Feature generation modeling Ensemble learning Data cleaning & augmenting Not always allowed yet good practice to consider when possible 40%20% 30% 10% % of total time spent in each activity
  • 16. • Impute missing values (mean, median, most common value, use separate prediction task) • Remove zero variance features • Remove duplicated features • Outlier removal – caution can be harmful, at cleaning stage we’ll remove irrelevant values (e.g. negative price) • Na’s encoding / imputing 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 16 Data cleaning
  • 17. • External data sources:  open street map  weather measurement data  online calendars • API’s • Scraping (using ScraPy / beautiful soup) 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 17 Data augmentation & external data
  • 18. • Rescaling/ standardization of existing features • Performing data transformations: Tf-Idf, log1p, min-max scaling, binning of numeric features • Turn categorical features to numeric (label encoding / one hot encoding) • Create count features • Parsing textual features to get more generalizable features • Hashing trick • Extracting date/time features i.e DayOfWeek, month, year, dayOfMonth, isHoliday etc. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 18 Feature engineering
  • 19. • Remove near-zero-variance features • Use feature importance and eliminate least important features • Recursive Feature Elimination 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 19 Feature selection
  • 20. • Grid search CV (exhaustive, rarely better than alternatives) • Random search CV • Hyper-opt • Bayesian optimization * Hyper parameter adjustment will usually yield better results but not as much as other activities 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 20 Hyper parameter optimization
  • 21. • Train test split • Shuffle split • Kfold is the most commonly used • Time based separation • Group Kfold • Leave one group out 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 21 Validation
  • 22. • Simple/weighted average of previous best models • Bagging of same type of models (i.e different rng, different hyper-param) • Majority vote • Using out of fold predictions as meta features a.k.a stacking 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 22 Ensemble learning
  • 23. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 23 Out Of Fold predictions – a.k.a meta features fold1 fold2 fold3 fold4 oof 1 oof 2 oof 3 oof 4 Out of fold predictions Averaged test predictions Test predictions fold1 Test predictions fold2 Test predictions fold3 Test predictions fold4 Divided training data - train on 3 folds predict the forth fold and the testing data
  • 24. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 24 Out Of Fold predictions – a.k.a meta features fold1 fold2 fold3 fold4 oof 1 oof 2 oof 3 oof 4 Out of fold predictions Averaged test predictions Test predictions fold1 Test predictions fold2 Test predictions fold3 Test predictions fold4 Divided training data - train on 3 folds predict the forth fold and the testing data
  • 25. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 25 Out Of Fold predictions – a.k.a meta features oof 1 oof 2 oof 3 oof 4 Model 1 e.g. knn Averaged test predictions Out of fold predictions oof 1 oof 2 oof 3 oof 4 Model 2 e.g. NN oof 1 oof 2 oof 3 oof 4 Model 3 e.g. gbm Train labels Model 1 e.g. knn Model 2 e.g. NN Model 3 e.g. gbm After training several models using this method (3 different models in this sample) We can now train a new model using our newly formed meta features * Note that we can either train our meta model using only these new features or use the new features along with our original train data for training
  • 26. • Large focus on modeling relatively to the rest of the steps in the process • Small weight to runtime and scalability • Little reasoning for selecting a specific eval metric • Competing for the last few percent points isn’t always valuable • “Click and submit” phenomena 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 26 Disadvantages of Kaggle
  • 27. • MOOC’s:  Machine learning – Stanford Coursera  Data science track – Johns Hopkins Coursera  Udacity deep learning course • Documentation:  Scikit learn documentation  Keras documentation  R caret package documentation 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 27 Additional reading resources
  • 28. • This presentation draws heavily from the following sources: • Mark Peng’s presentation “Tips for participating Kaggle challenges” • Darius Barušauskas’s presentation “Tips and tricks to win Kaggle data science competitions” • Kaggle discussion forums and blog 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 28 Links to sources
  • 29. Questions? 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 29
  • 30. 6/25/2017 Starting Data Science with Kaggle.com Nathaniel Shimoni 30