Kaggle and data science

Akira Shibata
Akira ShibataChief Data Scientist
© DataRobot, Inc. All rights reserved.
Kaggle
and
Data Science
Japan, 2018
Sergey Yurgenson
Director, Advanced Data Science Services
Kaggle Grandmaster
© DataRobot, Inc. All rights reserved.
© DataRobot, Inc. All rights reserved.
Kaggle
● Kaggle is a platform for data science competitions
● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San
Francisco
● In March of 2017 it was acquired by Google
● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the
most known in data science community name
● As of now Kaggle hosted more than 280 competitions and has more than 1 million
members from more than 190 countries
© DataRobot, Inc. All rights reserved.
Kaggle competitions
● Most of Kaggle competitions are predictive modeling competition
● Participants are provided with training data to train their models and test data with
unknown targets
● Participants need to calculate predictions for test data and submit those
predictions to Kaggle platform.
● Accuracy of predictions is evaluated using predefined objective metric and that
result is provided back to participants.
● Model performance of all participants is publicly available and participants can
compare quality of their models with models of other participants
● Many competitions have monetary prizes for top finishers
© DataRobot, Inc. All rights reserved.
Kaggle competitions
© DataRobot, Inc. All rights reserved.
Kaggle ranking
● Based on competitions performance Kaggle ranks members using points and
awards titles for top finishing in competitions
● For example to get title of master member needs to earn one gold medal and
two silver medal. For competitions with 1000 participants it means to finish
once in top 10 places and twice in top 50.
© DataRobot, Inc. All rights reserved.
Kaggle ranking
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
© DataRobot, Inc. All rights reserved.
Why do you dislike Kaggle ?
● Kaggle competition does not have much in common with real Data Science
○ The problems are already well formulated with metrics predefined. In an industry setting there is
ambiguity, and knowing what to solve is one of the key steps towards a solution.
○ Data is most cases is already provided and is relatively clean.
○ The goal is more leaderboard driven rather than understanding driven. Winning a competition
versus why an approach works is a top priority. Results may not be trustworthy.
○ There are chances of overfitting to test data with repeated submissions.
○ In most cases the solution is an ensemble of algorithms and not “productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
True or False ?
● “The problems are already well formulated with metrics predefined. In an
industry setting there is ambiguity, and knowing what to solve is one of the
key steps towards a solution.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Problem is well formulated
Mostly True , however...
● Need for criteria is inherited property of any competition.
● In real world not all data scientists are free to select and reformulate the problem. Many problems
are already defined with assigned specific success criteria.
● We learn many subjects and skills by solving provided predefined problems, doing predefined
exercises. We learn math by solving problems from textbooks, we learn physics by solving
problems from textbooks. Problems already formulated. By solving problems we also learn how
to formulate problems, what is suitable approach in particular data science situation.
● We also have to admit that evaluating business value of solving the problem is completely out of
scope of Kaggle competitions. While business value analysis and problem prioritization is
important part of many real life data science projects.
© DataRobot, Inc. All rights reserved.
True or False ?
● “Data is most cases is already provided and is relatively clean.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Data is clean
Half true
● In many competitions datasets are
○ Very big
○ Have multiple tables
○ Some records are duplicated and mislabeled
○ Contain combination of structured data and unstructured data
● Some competitions encourage search for additional sources of data
● Many data leaks
● Often features names and meaning are not provided making problem even more difficult than in real
world
● Data may be intentionally distorted to conform to data privacy laws
© DataRobot, Inc. All rights reserved.
Data is clean
● Complex data structure ● Big datasets
● No meaningful feature names
© DataRobot, Inc. All rights reserved.
Data is clean
● Kaggle competitions teach unique data manipulation skills:
○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding…
○ Using EDA to uncover meaning of data without relying on labels or other provided information
○ Data leaks discovering based on the data analysis
© DataRobot, Inc. All rights reserved.
True or False ?
● The goal is more leaderboard driven rather than understanding driven. Winning
a competition versus why an approach works is a top priority. Results may not
be trustworthy.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
No understanding
True but maybe not that important
● Assumes that model we can not understand is less valuable than model we can understand
○ Model is not necessarily used for knowledge discovery
○ In real life we often use something and rely on something we do not completely understand
○ If something that we do not understand can not be trustworthy then how we ever trust other
people?
○ Even complex machine learning model may provide simplification of even more complex real
system
© DataRobot, Inc. All rights reserved.
No understanding
● Ignores all new research of model interpretability
○ Feature importance
○ Reason codes
○ Partial dependence plots
○ Surrogate models
○ Neuron activation visualization
○ ...
● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and
Neural Networks
© DataRobot, Inc. All rights reserved.
No understanding ?
© DataRobot, Inc. All rights reserved.
True or False ?
● There are chances of overfitting to test data with repeated submissions.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Overfitting
False
● Complete misunderstanding of how Kaggle works
○ Test data in Kaggle competition is split into two parts - public and private
○ During competition models are evaluated only on public part of the test set
○ Final results are based only on private part of the test dataset
○ Thus final model evaluation is based on completely new data
● One of first lessons all competitions participants learn very fast
○ Do not overfit leaderboard.
○ Create training/validation partition which reflect as much as possible test data including
seasonality effects and data drift
© DataRobot, Inc. All rights reserved.
True or False ?
● In most cases the solution is an ensemble of algorithms and not
“productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Difficult to put in production
Half True, half false
● Yes, in most cases top models are complicated ensembles
● Difficult to put in production if one does it one-by-one for each model separately
● Easy if one uses appropriately developed platform that can handle many models and blenders
© DataRobot, Inc. All rights reserved.
True or False ?
● Sometimes, a 0.01 difference in AUC can be the difference between 1st place
and 294th place (out of 626) . Those marginal gains take significant time and
effort that may not be worthwhile in the face of other projects and priorities
https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
© DataRobot, Inc. All rights reserved.
Marginal gain is not valuable
Not always true
● Often we ourselves advise clients on balance between time spent and model performance
● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or
loss
● Competition aspect of the data science problem with small margins drives innovation
○ New preprocessing steps
○ New feature engineering ideas
○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM -
CatBoost)
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● “Kaggle competitions cover a decent amount of what a data scientist does.
The two big missing pieces are:
○ 1. taking a business problem and specifying it as a data science problem
(which includes pulling the data and structuring it so that it addresses that
business problem).
○ 2. putting models into production.”
Anthony Goldbloom
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● Kaggle is a competition
● “Real” Data Science is ...
also competition
© DataRobot, Inc. All rights reserved.
Kaggle to “real life” Data Science
● DataRobot - created by top Kagglers
Owen Zhang
Product Advisor
Highest: #1
Xavier Conort
Chief Data Scientist
Highest: 1st
Sergey Yurgenson
Director- AI Services
Highest: 1st
Jeremy Achin
CEO & Co-Founder
Highest: 20th
Tom de Godoy
CTO & Co-Founder
Highest: 20th
Amanda Schierz
Data Scientist
Highest: 24
DataRobot automatically replicates the steps seasoned data scientists take. This allows
non-technical business users to create accurate predictive models and data scientists to add
to their existing tool set.
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
1 of 29

Recommended

General Tips for participating Kaggle Competitions by
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
85.7K views74 slides
Tips for data science competitions by
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
85.6K views32 slides
Feature Engineering by
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
150.9K views76 slides
Winning Kaggle 101: Introduction to Stacking by
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
8.8K views21 slides
実践多クラス分類 Kaggle Ottoから学んだこと by
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだことnishio
38.6K views129 slides
Winning data science competitions, presented by Owen Zhang by
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
40K views23 slides

More Related Content

What's hot

統計的因果推論への招待 -因果構造探索を中心に- by
統計的因果推論への招待 -因果構造探索を中心に-統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-Shiga University, RIKEN
27.3K views56 slides
Kaggle presentation by
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
13.4K views58 slides
Exploratory data analysis using xgboost package in R by
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RSatoshi Kato
4.3K views93 slides
Practical tips for handling noisy data and annotaiton by
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonRyuichiKanoh
4.5K views47 slides
XGBoost: the algorithm that wins every competition by
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
16.7K views40 slides
Feature Engineering by
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
10.7K views45 slides

What's hot(20)

統計的因果推論への招待 -因果構造探索を中心に- by Shiga University, RIKEN
統計的因果推論への招待 -因果構造探索を中心に-統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-
Kaggle presentation by HJ van Veen
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen13.4K views
Exploratory data analysis using xgboost package in R by Satoshi Kato
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
Satoshi Kato4.3K views
Practical tips for handling noisy data and annotaiton by RyuichiKanoh
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
RyuichiKanoh4.5K views
XGBoost: the algorithm that wins every competition by Jaroslaw Szymczak
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak16.7K views
Feature Engineering by Sri Ambati
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati10.7K views
NeurIPS'21参加報告 tanimoto_public by Akira Tanimoto
NeurIPS'21参加報告 tanimoto_publicNeurIPS'21参加報告 tanimoto_public
NeurIPS'21参加報告 tanimoto_public
Akira Tanimoto453 views
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性 by Ichigaku Takigawa
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
Ichigaku Takigawa14.1K views
機械学習をこれから始める人が読んでおきたい 特徴選択の有名論文紹介 by 西岡 賢一郎
機械学習をこれから始める人が読んでおきたい 特徴選択の有名論文紹介機械学習をこれから始める人が読んでおきたい 特徴選択の有名論文紹介
機械学習をこれから始める人が読んでおきたい 特徴選択の有名論文紹介
西岡 賢一郎504 views
[DL輪読会](Sequential) Variational Autoencoders for Collaborative Filtering by Deep Learning JP
[DL輪読会](Sequential) Variational Autoencoders for Collaborative Filtering[DL輪読会](Sequential) Variational Autoencoders for Collaborative Filtering
[DL輪読会](Sequential) Variational Autoencoders for Collaborative Filtering
Deep Learning JP706 views
最適輸送の計算アルゴリズムの研究動向 by ohken
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向
ohken 8.1K views
[DL輪読会]Understanding Black-box Predictions via Influence Functions by Deep Learning JP
[DL輪読会]Understanding Black-box Predictions via Influence Functions [DL輪読会]Understanding Black-box Predictions via Influence Functions
[DL輪読会]Understanding Black-box Predictions via Influence Functions
Deep Learning JP6.1K views
ベイズ最適化によるハイパラーパラメータ探索 by 西岡 賢一郎
ベイズ最適化によるハイパラーパラメータ探索ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索
西岡 賢一郎492 views
ブラックボックス最適化とその応用 by gree_tech
ブラックボックス最適化とその応用ブラックボックス最適化とその応用
ブラックボックス最適化とその応用
gree_tech6K views
機械学習のためのベイズ最適化入門 by hoxo_m
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
hoxo_m185.6K views
公平性を保証したAI/機械学習
アルゴリズムの最新理論 by Kazuto Fukuchi
公平性を保証したAI/機械学習
アルゴリズムの最新理論公平性を保証したAI/機械学習
アルゴリズムの最新理論
公平性を保証したAI/機械学習
アルゴリズムの最新理論
Kazuto Fukuchi1.4K views
機械学習を用いた異常検知入門 by michiaki ito
機械学習を用いた異常検知入門機械学習を用いた異常検知入門
機械学習を用いた異常検知入門
michiaki ito38.4K views
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング by mlm_kansai
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
mlm_kansai67.5K views
星野「調査観察データの統計科学」第3章 by Shuyo Nakatani
星野「調査観察データの統計科学」第3章星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章
Shuyo Nakatani33.2K views

Similar to Kaggle and data science

Making better use of Data and AI in Industry 4.0 by
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Albert Y. C. Chen
926 views50 slides
Model selection and tuning at scale by
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
2.6K views25 slides
vodQA Pune (2019) - Design patterns in test automation by
vodQA Pune (2019) - Design patterns in test automationvodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automationvodQA
457 views33 slides
Kaggle Days Milan - March 2019 by
Kaggle Days Milan - March 2019Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019Alberto Danese
236 views26 slides
How to train your product owner by
How to train your product ownerHow to train your product owner
How to train your product ownerDavid Murgatroyd
684 views56 slides
Demystifying Xgboost by
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboosthalifaxchester
360 views21 slides

Similar to Kaggle and data science(20)

Making better use of Data and AI in Industry 4.0 by Albert Y. C. Chen
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
Albert Y. C. Chen926 views
Model selection and tuning at scale by Owen Zhang
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang2.6K views
vodQA Pune (2019) - Design patterns in test automation by vodQA
vodQA Pune (2019) - Design patterns in test automationvodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automation
vodQA457 views
Kaggle Days Milan - March 2019 by Alberto Danese
Kaggle Days Milan - March 2019Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019
Alberto Danese236 views
"What we learned from 5 years of building a data science software that actual... by Dataconomy Media
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media374 views
Operationalizing Machine Learning in the Enterprise by mark madsen
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
mark madsen756 views
Profit from AI & Machine Learning: The Best Practices for People & Process by Tony Baer
Profit from AI & Machine Learning: The Best Practices for People & ProcessProfit from AI & Machine Learning: The Best Practices for People & Process
Profit from AI & Machine Learning: The Best Practices for People & Process
Tony Baer135 views
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy by LazarinaStoyanova
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
LazarinaStoyanova658 views
CD in Machine Learning Systems by Thoughtworks
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
Thoughtworks1.1K views
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus... by Big Data Week
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
Big Data Week70 views
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest by Berker Kozan
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
Berker Kozan804 views

More from Akira Shibata

W&B monthly meetup#7 Intro.pdf by
W&B monthly meetup#7 Intro.pdfW&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdfAkira Shibata
738 views14 slides
20230705 - Optuna Integration (to share).pdf by
20230705 - Optuna Integration (to share).pdf20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdfAkira Shibata
103 views15 slides
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf by
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfmakoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfAkira Shibata
759 views29 slides
LLM Webinar - シバタアキラ to share.pdf by
LLM Webinar - シバタアキラ to share.pdfLLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdfAkira Shibata
332 views10 slides
W&B Seminar #4.pdf by
W&B Seminar #4.pdfW&B Seminar #4.pdf
W&B Seminar #4.pdfAkira Shibata
448 views11 slides
Data x by
Data xData x
Data xAkira Shibata
485 views10 slides

More from Akira Shibata(20)

W&B monthly meetup#7 Intro.pdf by Akira Shibata
W&B monthly meetup#7 Intro.pdfW&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdf
Akira Shibata738 views
20230705 - Optuna Integration (to share).pdf by Akira Shibata
20230705 - Optuna Integration (to share).pdf20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdf
Akira Shibata103 views
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf by Akira Shibata
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfmakoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
Akira Shibata759 views
LLM Webinar - シバタアキラ to share.pdf by Akira Shibata
LLM Webinar - シバタアキラ to share.pdfLLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdf
Akira Shibata332 views
Akira shibata at developer summit 2016 by Akira Shibata
Akira shibata at developer summit 2016Akira shibata at developer summit 2016
Akira shibata at developer summit 2016
Akira Shibata4.9K views
PyData.Tokyo Hackathon#2 TensorFlow by Akira Shibata
PyData.Tokyo Hackathon#2 TensorFlowPyData.Tokyo Hackathon#2 TensorFlow
PyData.Tokyo Hackathon#2 TensorFlow
Akira Shibata2.6K views
20150421 日経ビッグデータカンファレンス by Akira Shibata
20150421 日経ビッグデータカンファレンス20150421 日経ビッグデータカンファレンス
20150421 日経ビッグデータカンファレンス
Akira Shibata1.7K views
人工知能をビジネスに活かす by Akira Shibata
人工知能をビジネスに活かす人工知能をビジネスに活かす
人工知能をビジネスに活かす
Akira Shibata3.6K views
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT) by Akira Shibata
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
Akira Shibata6.7K views
PyData Tokyo Tutorial & Hackathon #1 by Akira Shibata
PyData Tokyo Tutorial & Hackathon #1PyData Tokyo Tutorial & Hackathon #1
PyData Tokyo Tutorial & Hackathon #1
Akira Shibata13.3K views
PyData NYC by Akira Shibata by Akira Shibata
PyData NYC by Akira ShibataPyData NYC by Akira Shibata
PyData NYC by Akira Shibata
Akira Shibata31.2K views
20141127 py datatokyomeetup2 by Akira Shibata
20141127 py datatokyomeetup220141127 py datatokyomeetup2
20141127 py datatokyomeetup2
Akira Shibata1.5K views
The LHC Explained by CNN by Akira Shibata
The LHC Explained by CNNThe LHC Explained by CNN
The LHC Explained by CNN
Akira Shibata641 views
Analysis Software Development by Akira Shibata
Analysis Software DevelopmentAnalysis Software Development
Analysis Software Development
Akira Shibata826 views
Top Cross Section Measurement by Akira Shibata
Top Cross Section MeasurementTop Cross Section Measurement
Top Cross Section Measurement
Akira Shibata815 views
Analysis Software Benchmark by Akira Shibata
Analysis Software BenchmarkAnalysis Software Benchmark
Analysis Software Benchmark
Akira Shibata732 views

Recently uploaded

Survey on Factuality in LLM's.pptx by
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptxNeethaSherra1
6 views9 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
16 views4 slides
CRM stick or twist.pptx by
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
10 views16 slides
How Leaders See Data? (Level 1) by
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)Narendra Narendra
14 views76 slides
Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
14 views13 slides
Advanced_Recommendation_Systems_Presentation.pptx by
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 views9 slides

Recently uploaded(20)

Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra16 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9016 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821710 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials14 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821712 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views

Kaggle and data science

  • 1. © DataRobot, Inc. All rights reserved. Kaggle and Data Science Japan, 2018
  • 2. Sergey Yurgenson Director, Advanced Data Science Services Kaggle Grandmaster © DataRobot, Inc. All rights reserved.
  • 3. © DataRobot, Inc. All rights reserved. Kaggle ● Kaggle is a platform for data science competitions ● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San Francisco ● In March of 2017 it was acquired by Google ● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the most known in data science community name ● As of now Kaggle hosted more than 280 competitions and has more than 1 million members from more than 190 countries
  • 4. © DataRobot, Inc. All rights reserved. Kaggle competitions ● Most of Kaggle competitions are predictive modeling competition ● Participants are provided with training data to train their models and test data with unknown targets ● Participants need to calculate predictions for test data and submit those predictions to Kaggle platform. ● Accuracy of predictions is evaluated using predefined objective metric and that result is provided back to participants. ● Model performance of all participants is publicly available and participants can compare quality of their models with models of other participants ● Many competitions have monetary prizes for top finishers
  • 5. © DataRobot, Inc. All rights reserved. Kaggle competitions
  • 6. © DataRobot, Inc. All rights reserved. Kaggle ranking ● Based on competitions performance Kaggle ranks members using points and awards titles for top finishing in competitions ● For example to get title of master member needs to earn one gold medal and two silver medal. For competitions with 1000 participants it means to finish once in top 10 places and twice in top 50.
  • 7. © DataRobot, Inc. All rights reserved. Kaggle ranking
  • 8. © DataRobot, Inc. All rights reserved. Kaggle and Data Science
  • 9. © DataRobot, Inc. All rights reserved. Why do you dislike Kaggle ? ● Kaggle competition does not have much in common with real Data Science ○ The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution. ○ Data is most cases is already provided and is relatively clean. ○ The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. ○ There are chances of overfitting to test data with repeated submissions. ○ In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 10. © DataRobot, Inc. All rights reserved. True or False ? ● “The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 11. © DataRobot, Inc. All rights reserved. Problem is well formulated Mostly True , however... ● Need for criteria is inherited property of any competition. ● In real world not all data scientists are free to select and reformulate the problem. Many problems are already defined with assigned specific success criteria. ● We learn many subjects and skills by solving provided predefined problems, doing predefined exercises. We learn math by solving problems from textbooks, we learn physics by solving problems from textbooks. Problems already formulated. By solving problems we also learn how to formulate problems, what is suitable approach in particular data science situation. ● We also have to admit that evaluating business value of solving the problem is completely out of scope of Kaggle competitions. While business value analysis and problem prioritization is important part of many real life data science projects.
  • 12. © DataRobot, Inc. All rights reserved. True or False ? ● “Data is most cases is already provided and is relatively clean.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 13. © DataRobot, Inc. All rights reserved. Data is clean Half true ● In many competitions datasets are ○ Very big ○ Have multiple tables ○ Some records are duplicated and mislabeled ○ Contain combination of structured data and unstructured data ● Some competitions encourage search for additional sources of data ● Many data leaks ● Often features names and meaning are not provided making problem even more difficult than in real world ● Data may be intentionally distorted to conform to data privacy laws
  • 14. © DataRobot, Inc. All rights reserved. Data is clean ● Complex data structure ● Big datasets ● No meaningful feature names
  • 15. © DataRobot, Inc. All rights reserved. Data is clean ● Kaggle competitions teach unique data manipulation skills: ○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding… ○ Using EDA to uncover meaning of data without relying on labels or other provided information ○ Data leaks discovering based on the data analysis
  • 16. © DataRobot, Inc. All rights reserved. True or False ? ● The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 17. © DataRobot, Inc. All rights reserved. No understanding True but maybe not that important ● Assumes that model we can not understand is less valuable than model we can understand ○ Model is not necessarily used for knowledge discovery ○ In real life we often use something and rely on something we do not completely understand ○ If something that we do not understand can not be trustworthy then how we ever trust other people? ○ Even complex machine learning model may provide simplification of even more complex real system
  • 18. © DataRobot, Inc. All rights reserved. No understanding ● Ignores all new research of model interpretability ○ Feature importance ○ Reason codes ○ Partial dependence plots ○ Surrogate models ○ Neuron activation visualization ○ ... ● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and Neural Networks
  • 19. © DataRobot, Inc. All rights reserved. No understanding ?
  • 20. © DataRobot, Inc. All rights reserved. True or False ? ● There are chances of overfitting to test data with repeated submissions. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 21. © DataRobot, Inc. All rights reserved. Overfitting False ● Complete misunderstanding of how Kaggle works ○ Test data in Kaggle competition is split into two parts - public and private ○ During competition models are evaluated only on public part of the test set ○ Final results are based only on private part of the test dataset ○ Thus final model evaluation is based on completely new data ● One of first lessons all competitions participants learn very fast ○ Do not overfit leaderboard. ○ Create training/validation partition which reflect as much as possible test data including seasonality effects and data drift
  • 22. © DataRobot, Inc. All rights reserved. True or False ? ● In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 23. © DataRobot, Inc. All rights reserved. Difficult to put in production Half True, half false ● Yes, in most cases top models are complicated ensembles ● Difficult to put in production if one does it one-by-one for each model separately ● Easy if one uses appropriately developed platform that can handle many models and blenders
  • 24. © DataRobot, Inc. All rights reserved. True or False ? ● Sometimes, a 0.01 difference in AUC can be the difference between 1st place and 294th place (out of 626) . Those marginal gains take significant time and effort that may not be worthwhile in the face of other projects and priorities https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
  • 25. © DataRobot, Inc. All rights reserved. Marginal gain is not valuable Not always true ● Often we ourselves advise clients on balance between time spent and model performance ● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or loss ● Competition aspect of the data science problem with small margins drives innovation ○ New preprocessing steps ○ New feature engineering ideas ○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM - CatBoost)
  • 26. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● “Kaggle competitions cover a decent amount of what a data scientist does. The two big missing pieces are: ○ 1. taking a business problem and specifying it as a data science problem (which includes pulling the data and structuring it so that it addresses that business problem). ○ 2. putting models into production.” Anthony Goldbloom
  • 27. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● Kaggle is a competition ● “Real” Data Science is ... also competition
  • 28. © DataRobot, Inc. All rights reserved. Kaggle to “real life” Data Science ● DataRobot - created by top Kagglers Owen Zhang Product Advisor Highest: #1 Xavier Conort Chief Data Scientist Highest: 1st Sergey Yurgenson Director- AI Services Highest: 1st Jeremy Achin CEO & Co-Founder Highest: 20th Tom de Godoy CTO & Co-Founder Highest: 20th Amanda Schierz Data Scientist Highest: 24 DataRobot automatically replicates the steps seasoned data scientists take. This allows non-technical business users to create accurate predictive models and data scientists to add to their existing tool set.
  • 29. © DataRobot, Inc. All rights reserved. Kaggle and Data Science