SlideShare a Scribd company logo
An Intro to Kaggle
By Lex Toumbourou
Senior Consultant at Thoughtworks
Part 1: Kaggle Overview
What is
● Founded in 2010 in Australia
● Acquired by Google in 2017
● Host of data science
competitions
● Largest data science
community at 536,000
registered users
?
Why
● Good resource for turning
theoretical skills in practical
skills
● Learn from other data scientists
● Gain reputation
?
Getting started with competitions
● What problems are you interested
in solving?
● What computational budget do you
have?
● Is the competition a good match for
your level?
Competition evaluation and rules
● What is the goal of the
competition?
● How is it evaluated?
○ Accuracy
○ Log Loss
○ Root mean squared error
○ Area under the ROC curve
○ F1 score
○ (many more)
Datasets
● 3 main files:
○ Train.csv
○ Test.csv
○ Sample submissions.csv
● Important to read data documentation
● Kaggle CLI useful for download datasets
on headless computers:
kaggle competitions download -c
house-prices-advanced-regression-tec
hniques
Loading dataset (useful Pandas one-liner)
Leaderboard
● Split into public and private
leaderboard.
● Be careful not to overfit on the
test set.
● Equal scores = oldest predict
wins.
Submissions
● Predictions provided as a CSV with row id
and prediction value(s)
● Some predictions are used for public, the
other for private.
● Usually limited to 5 submissions per day.
● At competition conclusion, pick 2
submissions to use on private
leaderboard.
Generating submission one-liner
Kernels
● Kaggle provided computers - even GPUs
provided
● Allows for sharing results with others.
● Scripts allows you to submit submissions
directly after running code.
Discussion forums
● Lots of useful insights.
● Competition winners will usually always
have read the forums in full.
Part 2: Getting Started
Tools
● Usually Python or R
● Jupyter Notebooks (interactive
development)
● Numpy (linear algebra)
● Pandas (structured data)
● Matplotlib
● Scikit-learn (models and ML tools)
● PyTorch or Tensorflow/Keras (neural
networks)
Model selection
● Dependent on problem
● Tree-based (RandomForests, XGBoost,
LightGBM) - good starting point for
structured data
● Linear Models (SVM, Logistic Reg) - still
useful for certain problems.
● Neural Networks (CNN, RNN) - image,
text and speech data, sometimes
structured
Choosing a validation method
● Train / val split
● Cross-validation
● Out-of-bag error
Fast iteration
● Run experiments on a subset of
your data.
● Good validation strategy.
● Save complex model stacking
and ensembling until after you’ve
maximized feature engineering.
Preparing data
● Model dependent
● Careful feature preparation and
engineering usually quite
important.
● 4 main columns type:
continuous, ordinal, categorical
and date
Image by Tobias Fischer
Continuous (aka numeric) features
● Scaling recommended (non-tree models)
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.StandardScaler
● Outlier cleaning (non-tree models)
Winsorization: remove 99th and 1th percentile
log(x)
● Data imputation (fill in missing values)
df.SomeValue.fillna(df.SomeValue.median())
df[‘SomeValue_isna’] = df.SomeValue.isna()
Categorical features
● Ensure order of ordinal columns
df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True)
● One-hot encode non-ordinal columns
dummies = pd.get_dummies(df[cat_columns], dummy_na=True)
df = pd.concat([df, dummies], axis=1)
https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
Date time features
● Lots of information in a single date:
○ Day of week
○ Day of month
○ Is it a weekend?
○ Is it a public holiday?
● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns
Image by Charisse Kenion
Feature engineering
● Combining columns (adding
values together, multiplying,
dividing etc
● Adding additional data sources*
○ Things nearby to house
○ Weather on the day
○ Etc etc
* Ensure competition allows it
● Discover Feature Engineering -
great article
Image by Chester Alvarez
Hyperparameter (aka settings) tuning
● Hyperparam = parameter
that isn’t learned by model.
● Manually (try some values
and see what happens)
● Automated
○ RandomizedSearchCV
(sklearn)
○ GridSearchCV (sklearn)
○ Hyperopt
○ Spearmint
○ Lots more...
Stacking / ensembling (aka combining models)
● Most winnings solutions a
combination of models.
● Averaging predictions of multiple
models
● “Meta models”: a model trained on
predictions of multiple models.
http://www.chioka.in/stacking-blending-and-stacked-generalization/
Fin.

More Related Content

What's hot

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
Dataconomy Media
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
Franz Inc. - AllegroGraph
 

What's hot (20)

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
 
Build a Sentiment Model using ML.Net
Build a Sentiment Model using ML.NetBuild a Sentiment Model using ML.Net
Build a Sentiment Model using ML.Net
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Data science
Data scienceData science
Data science
 
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemPresentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive Problem
 
Kaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityKaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML Interpretability
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
Dynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systemsDynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systems
 
Incremental Machine Learning.pptx
Incremental Machine Learning.pptxIncremental Machine Learning.pptx
Incremental Machine Learning.pptx
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 steps
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Machine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's business
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
 

Similar to A Kaggle Talk

Similar to A Kaggle Talk (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Centernet
CenternetCenternet
Centernet
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellenGraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 

A Kaggle Talk

  • 1. An Intro to Kaggle By Lex Toumbourou Senior Consultant at Thoughtworks
  • 2. Part 1: Kaggle Overview
  • 3. What is ● Founded in 2010 in Australia ● Acquired by Google in 2017 ● Host of data science competitions ● Largest data science community at 536,000 registered users ?
  • 4. Why ● Good resource for turning theoretical skills in practical skills ● Learn from other data scientists ● Gain reputation ?
  • 5. Getting started with competitions ● What problems are you interested in solving? ● What computational budget do you have? ● Is the competition a good match for your level?
  • 6. Competition evaluation and rules ● What is the goal of the competition? ● How is it evaluated? ○ Accuracy ○ Log Loss ○ Root mean squared error ○ Area under the ROC curve ○ F1 score ○ (many more)
  • 7. Datasets ● 3 main files: ○ Train.csv ○ Test.csv ○ Sample submissions.csv ● Important to read data documentation ● Kaggle CLI useful for download datasets on headless computers: kaggle competitions download -c house-prices-advanced-regression-tec hniques
  • 8. Loading dataset (useful Pandas one-liner)
  • 9. Leaderboard ● Split into public and private leaderboard. ● Be careful not to overfit on the test set. ● Equal scores = oldest predict wins.
  • 10. Submissions ● Predictions provided as a CSV with row id and prediction value(s) ● Some predictions are used for public, the other for private. ● Usually limited to 5 submissions per day. ● At competition conclusion, pick 2 submissions to use on private leaderboard.
  • 12. Kernels ● Kaggle provided computers - even GPUs provided ● Allows for sharing results with others. ● Scripts allows you to submit submissions directly after running code.
  • 13. Discussion forums ● Lots of useful insights. ● Competition winners will usually always have read the forums in full.
  • 14. Part 2: Getting Started
  • 15. Tools ● Usually Python or R ● Jupyter Notebooks (interactive development) ● Numpy (linear algebra) ● Pandas (structured data) ● Matplotlib ● Scikit-learn (models and ML tools) ● PyTorch or Tensorflow/Keras (neural networks)
  • 16. Model selection ● Dependent on problem ● Tree-based (RandomForests, XGBoost, LightGBM) - good starting point for structured data ● Linear Models (SVM, Logistic Reg) - still useful for certain problems. ● Neural Networks (CNN, RNN) - image, text and speech data, sometimes structured
  • 17. Choosing a validation method ● Train / val split ● Cross-validation ● Out-of-bag error
  • 18. Fast iteration ● Run experiments on a subset of your data. ● Good validation strategy. ● Save complex model stacking and ensembling until after you’ve maximized feature engineering.
  • 19. Preparing data ● Model dependent ● Careful feature preparation and engineering usually quite important. ● 4 main columns type: continuous, ordinal, categorical and date Image by Tobias Fischer
  • 20. Continuous (aka numeric) features ● Scaling recommended (non-tree models) sklearn.preprocessing.MinMaxScaler sklearn.preprocessing.StandardScaler ● Outlier cleaning (non-tree models) Winsorization: remove 99th and 1th percentile log(x) ● Data imputation (fill in missing values) df.SomeValue.fillna(df.SomeValue.median()) df[‘SomeValue_isna’] = df.SomeValue.isna()
  • 21. Categorical features ● Ensure order of ordinal columns df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True) ● One-hot encode non-ordinal columns dummies = pd.get_dummies(df[cat_columns], dummy_na=True) df = pd.concat([df, dummies], axis=1) https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
  • 22. Date time features ● Lots of information in a single date: ○ Day of week ○ Day of month ○ Is it a weekend? ○ Is it a public holiday? ● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns Image by Charisse Kenion
  • 23. Feature engineering ● Combining columns (adding values together, multiplying, dividing etc ● Adding additional data sources* ○ Things nearby to house ○ Weather on the day ○ Etc etc * Ensure competition allows it ● Discover Feature Engineering - great article Image by Chester Alvarez
  • 24. Hyperparameter (aka settings) tuning ● Hyperparam = parameter that isn’t learned by model. ● Manually (try some values and see what happens) ● Automated ○ RandomizedSearchCV (sklearn) ○ GridSearchCV (sklearn) ○ Hyperopt ○ Spearmint ○ Lots more...
  • 25. Stacking / ensembling (aka combining models) ● Most winnings solutions a combination of models. ● Averaging predictions of multiple models ● “Meta models”: a model trained on predictions of multiple models. http://www.chioka.in/stacking-blending-and-stacked-generalization/
  • 26. Fin.