SlideShare a Scribd company logo

Before Kaggle : from a business goal to a Machine Learning problem

Dataiku
Dataiku

Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances. This is a presentation by Pierre Gutierrez (Dataiku’s data scientist).

1 of 47
Download to read offline
Before Kaggle
From a business goal to a ML problem
Pierre	
  Gu(errez	
  @prrgu(errez	
  
•  Data Science competitions platform
(There are others : DataScience.net in France)
•  332,000 Data Scientists
•  today : 192 competitions, 18 active
+ 516 In class, 12 active
•  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
•  Price pool?
•  325,000 $ to make on August 31st
•  Good luck with that !
•  Not a good hourly wage
•  today : 192 competitions, 18 active
Understand :
•  Lot’s of datasets about approximately every DS topic
•  Lot’s of winner solutions, tip and tricks, etc…
•  Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition?
Why	
  AUC?	
  F1	
  score?	
  Log	
  loss?	
  	
  
Could	
  that	
  depend	
  on	
  my	
  train/test	
  split?	
  	
  
Where	
  do	
  they	
  come	
  from	
  ?	
  Do	
  you	
  always	
  
have	
  some?	
  	
  	
  
Why	
  is	
  the	
  split	
  this	
  way?	
  Random?	
  Time?	
  	
  
What you don’t learn on Kaggle (or in class?):
•  How to model a business question into a ML problem.
•  How to manage/create labels. (proxy / missing…)
•  How to evaluate a model:
•  How to choose your metric
•  How to design your train/test split
•  How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
•  How to design your cross validation scheme (and not overfit)
•  How to create relevant features
•  Hacks and tricks (leak exploitation J)
What is a Data Science Competition?

Recommended

Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsSri Ambati
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 

More Related Content

What's hot

Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Valdas Maksimavičius
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Betacowork
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaSri Ambati
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 

What's hot (20)

Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral Bajaria
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
DataHub
DataHubDataHub
DataHub
 

Viewers also liked

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competitionMerja Kajava
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...M Khurram Shahzad
 
121206 3-dirty-words-webinar
121206 3-dirty-words-webinar121206 3-dirty-words-webinar
121206 3-dirty-words-webinarLeanne Smith
 
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreBudgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreSpringboard
 
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...courageasia
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 

Viewers also liked (13)

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competition
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
 
10 ways to boost your company sales
10 ways to boost your company sales10 ways to boost your company sales
10 ways to boost your company sales
 
Budgeting 101 Fall Institute 2011 Final
Budgeting 101 Fall Institute 2011 FinalBudgeting 101 Fall Institute 2011 Final
Budgeting 101 Fall Institute 2011 Final
 
Restaurant Profitability 101: Budgeting
Restaurant Profitability 101: BudgetingRestaurant Profitability 101: Budgeting
Restaurant Profitability 101: Budgeting
 
121206 3-dirty-words-webinar
121206 3-dirty-words-webinar121206 3-dirty-words-webinar
121206 3-dirty-words-webinar
 
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreBudgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
 
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
 
B101 slideshow
B101 slideshowB101 slideshow
B101 slideshow
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 

Similar to Before Kaggle : from a business goal to a Machine Learning problem

Churn prediction data modeling
Churn prediction data modelingChurn prediction data modeling
Churn prediction data modelingPierre Gutierrez
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)Rodrigo Lopes
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productBruce Kuo
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveBigML, Inc
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Why-What-How Consulting, LLC
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanUXPA UK
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROCraig Sullivan
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerProduct School
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and PlanningMojammel Haque
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive MaintenanceArnab Biswas
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckSasha Lazarevic
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulationDevaKumari Vijay
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product managementArseny Kravchenko
 

Similar to Before Kaggle : from a business goal to a Machine Learning problem (20)

Churn prediction data modeling
Churn prediction data modelingChurn prediction data modeling
Churn prediction data modeling
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business Perspective
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Real timeanalyticsl oreal
Real timeanalyticsl orealReal timeanalyticsl oreal
Real timeanalyticsl oreal
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig Sullivan
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product Manager
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and Planning
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist Deck
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulation
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product management
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

More from Dataiku (17)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

Recently uploaded

Introduction to Chromatography (Column chromatography)
Introduction to Chromatography (Column chromatography)Introduction to Chromatography (Column chromatography)
Introduction to Chromatography (Column chromatography)Ahmed Metwaly
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsPeter Coles
 
Presentacion Mariana Arango- biología molecular
Presentacion Mariana Arango- biología molecularPresentacion Mariana Arango- biología molecular
Presentacion Mariana Arango- biología molecularmarianaarangop
 
commercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usescommercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usesSilpa Selvaraj
 
Ento-322, Agrochemicals for agriculture usee
Ento-322, Agrochemicals for agriculture useeEnto-322, Agrochemicals for agriculture usee
Ento-322, Agrochemicals for agriculture useeDrAnita Sharma
 
discussion on the endocrine system for science grade10.pptx
discussion on the endocrine system for science grade10.pptxdiscussion on the endocrine system for science grade10.pptx
discussion on the endocrine system for science grade10.pptxShePerezDelaCruz
 
Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Naresh Gupta
 
Automatic Stainer & Screener technique.pptx
Automatic Stainer & Screener technique.pptxAutomatic Stainer & Screener technique.pptx
Automatic Stainer & Screener technique.pptxSagarBhakare1
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsPeter Coles
 
Tissue of the primary plant body.pdf pdf
Tissue of the primary plant body.pdf pdfTissue of the primary plant body.pdf pdf
Tissue of the primary plant body.pdf pdfstephenopokuasante
 
American Eclipse A Nation’s Epic Race to Catch the_240225_095603
American Eclipse A Nation’s Epic Race to Catch the_240225_095603American Eclipse A Nation’s Epic Race to Catch the_240225_095603
American Eclipse A Nation’s Epic Race to Catch the_240225_095603SOCIEDAD JULIO GARAVITO
 
green chemistry, clean sustainable environment.ppt
green chemistry, clean sustainable environment.pptgreen chemistry, clean sustainable environment.ppt
green chemistry, clean sustainable environment.pptRashmiSanghi1
 
Seminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisSeminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisLinaMarcelaCharrisRa
 
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...AmalDhivaharS
 
Seminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisSeminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisLinaMarcelaCharrisRa
 
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...dkNET
 
1.0 - The Light Miscroscope.ppt microscopy
1.0 - The Light Miscroscope.ppt microscopy1.0 - The Light Miscroscope.ppt microscopy
1.0 - The Light Miscroscope.ppt microscopystephenopokuasante
 
Genetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfGenetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfmughalgumar440
 
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptx
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptxCHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptx
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptxASWIN ANANDH
 

Recently uploaded (20)

Introduction to Chromatography (Column chromatography)
Introduction to Chromatography (Column chromatography)Introduction to Chromatography (Column chromatography)
Introduction to Chromatography (Column chromatography)
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
 
Presentacion Mariana Arango- biología molecular
Presentacion Mariana Arango- biología molecularPresentacion Mariana Arango- biología molecular
Presentacion Mariana Arango- biología molecular
 
commercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its usescommercial production of cellulase enzyme and its uses
commercial production of cellulase enzyme and its uses
 
Ento-322, Agrochemicals for agriculture usee
Ento-322, Agrochemicals for agriculture useeEnto-322, Agrochemicals for agriculture usee
Ento-322, Agrochemicals for agriculture usee
 
discussion on the endocrine system for science grade10.pptx
discussion on the endocrine system for science grade10.pptxdiscussion on the endocrine system for science grade10.pptx
discussion on the endocrine system for science grade10.pptx
 
Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.Salesforce Starter Package Presentation.
Salesforce Starter Package Presentation.
 
Automatic Stainer & Screener technique.pptx
Automatic Stainer & Screener technique.pptxAutomatic Stainer & Screener technique.pptx
Automatic Stainer & Screener technique.pptx
 
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of AstrophysicsOpen Access Publishing in Astrophysics and the Open Journal of Astrophysics
Open Access Publishing in Astrophysics and the Open Journal of Astrophysics
 
Tissue of the primary plant body.pdf pdf
Tissue of the primary plant body.pdf pdfTissue of the primary plant body.pdf pdf
Tissue of the primary plant body.pdf pdf
 
American Eclipse A Nation’s Epic Race to Catch the_240225_095603
American Eclipse A Nation’s Epic Race to Catch the_240225_095603American Eclipse A Nation’s Epic Race to Catch the_240225_095603
American Eclipse A Nation’s Epic Race to Catch the_240225_095603
 
green chemistry, clean sustainable environment.ppt
green chemistry, clean sustainable environment.pptgreen chemistry, clean sustainable environment.ppt
green chemistry, clean sustainable environment.ppt
 
Seminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisSeminario biología molecular Lina Charris
Seminario biología molecular Lina Charris
 
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
Anti-Obesity Activity of Anthocyanins and Corresponding Introduction in Dieta...
 
Seminario biología molecular Lina Charris
Seminario biología molecular Lina CharrisSeminario biología molecular Lina Charris
Seminario biología molecular Lina Charris
 
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...
 
1.0 - The Light Miscroscope.ppt microscopy
1.0 - The Light Miscroscope.ppt microscopy1.0 - The Light Miscroscope.ppt microscopy
1.0 - The Light Miscroscope.ppt microscopy
 
REGULATION OF METABOLISM IN PLANTS AND THE DIFFERENT MECHANISMS
REGULATION OF METABOLISM IN PLANTS  AND THE DIFFERENT MECHANISMSREGULATION OF METABOLISM IN PLANTS  AND THE DIFFERENT MECHANISMS
REGULATION OF METABOLISM IN PLANTS AND THE DIFFERENT MECHANISMS
 
Genetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdfGenetic Code. A comprehensive overview..pdf
Genetic Code. A comprehensive overview..pdf
 
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptx
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptxCHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptx
CHEMICAL TESTS FOR GLYCOSIDES AND ALKALOIDS.pptx
 

Before Kaggle : from a business goal to a Machine Learning problem

  • 1. Before Kaggle From a business goal to a ML problem Pierre  Gu(errez  @prrgu(errez  
  • 2. •  Data Science competitions platform (There are others : DataScience.net in France) •  332,000 Data Scientists •  today : 192 competitions, 18 active + 516 In class, 12 active •  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex… What is ?
  • 3. •  Price pool? •  325,000 $ to make on August 31st •  Good luck with that ! •  Not a good hourly wage •  today : 192 competitions, 18 active Understand : •  Lot’s of datasets about approximately every DS topic •  Lot’s of winner solutions, tip and tricks, etc… •  Lot’s of “beat the benchmark” for beginners I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ... Why should I join ?
  • 4. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition ?
  • 5. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition? Why  AUC?  F1  score?  Log  loss?     Could  that  depend  on  my  train/test  split?     Where  do  they  come  from  ?  Do  you  always   have  some?       Why  is  the  split  this  way?  Random?  Time?    
  • 6. What you don’t learn on Kaggle (or in class?): •  How to model a business question into a ML problem. •  How to manage/create labels. (proxy / missing…) •  How to evaluate a model: •  How to choose your metric •  How to design your train/test split •  How to account for this in feature engineering Understanding this actually helps you in Kaggle competition : •  How to design your cross validation scheme (and not overfit) •  How to create relevant features •  Hacks and tricks (leak exploitation J) What is a Data Science Competition?
  • 8. Christophe Bourguignat DS cheat sheet @chris_bour     Today  
  • 9. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction
  • 10. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction The  newcomer  disillusion   The  produc(on  bad  surprise   The  business  obfusca(on  
  • 11. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 12. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 14. •  Everywhere is fraud E-business, Telco, Medicare,… •  Easily defined as a classification problem •  Target well defined ? •  E-business : yes with lag •  Elsewhere : need checks, labels are expensive Fraud Detection
  • 15. •  Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving Churn
  • 16. •  Subscription models: •  Telco •  E-gamming (Wow) •  Ex : Coyote -> 1 year subscription -> you know when someone leave •  Non subscription models: •  E-Business (Amazon, Price Minister, Vente Privée) •  E-gamming (Candy Crush, free MMORPG) -> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months Two types of Churn
  • 17. •  Predict if a vehicle / machine / part is going to fail •  Classification Problem: •  Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target •  Vary a lot the target -> spurious correlation •  Just choose it as the result of the exact business need Predictive Maintenance
  • 18. •  Target is “will like” or “will buy” •  Target is often proxy of real interest (implicit feedback) Recommender System
  • 19. •  Can you model the problem as a ML problem? •  Ex : predictive maintenance •  Ask the right question from a business point of view. Not what you know how to do. •  Is your target a proxy? •  Recommendation system •  May need bandit algorithm •  Is it easy to get labels? •  Ex : Fraud detection •  Can be expensive •  Mechanical Turk can be the answer Summary on Labels
  • 20. •  Random Split •  Just like in school Train / test split   •  When  and  why  ?     -­‐>    When  each  line  is  independent  from  the   rest  (not  that  common  !)       image,  document  classifica(on,  sen(ment   analysis  (“but  aha  is  the  new  lol”  )     -­‐>    When  you  want  to  quickly  iterate  /   benchmark:  “is  it  even  possible?”     -­‐>    When  you  want  to  sell  something  to   your  boss  
  • 21. •  Column / group based Ex : Caterpillar challenge •  Predict a price •  for each tube id •  Tube id in train and test are different Objective : being able to generalize to other tubes! Train / test split
  • 22. •  Time based •  Simply separate train and test on a time variable •  When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples : Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,… Train / test split
  • 23. •  No subscription example •  Target : 4 month without buying •  Features ? Train / test split : Churn example
  • 24. Ex : Train and predict scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for  feature   genera(on.   Use  model  to  predict   future  churn   Train  model  using  features  and  target  
  • 25. Ex : Train Evaluation and Predict Scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for   feature  genera(on   Valida&on  set   Use  model  to   predict  future   churn   Training   Evaluate  on  the  target   of  the  valida(on  set   T  –  8  month   Data  is  used  for  features   genera(on.   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months  
  • 26. •  More complex design •  Graph sampling (fraud rings ? ) •  Random sampling in client / machine life •  Mix of column based and time based … •  The rule : 1)  What is the problem ? 2)  To what would I like to generalize my model ? Future ? Other individuals ? … 3)  => Train / Test split Train / test split
  • 27. •  Predictive Maintenance problem •  Objective : predict failure in next 3 days. •  Metric is proportional to accuracy (and 0.57 is the best score !) •  Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge EX PHM Society (Fail example)
  • 31. •  How to design the evaluation scheme? •  What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? •  Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening •  Choose several dates for each asset? -> beware of asset over-fitting •  In the challenge : random selection of (asset, date) in the future + over sampling of failures. EX PHM Society
  • 32. •  Basic Feature engineering EX PHM Society
  • 33. •  Random Sampling EX PHM Society This  is  decent!     «  With  some  more  work  I  could  have  a  model   that  beat  randomness  enough  to  be  useful  »  
  • 34. •  Time based split EX PHM Society Wait  what?    
  • 35. •  TIME LEAK EX PHM Society
  • 36. •  TIME LEAK EX PHM Society Tree  cuts  
  • 37. •  Beware of the distribution of you features! •  Is there a time dependency? •  Ex : count, sum, … that will only increase with time •  -> Calculate count and sum rescaled by time / in moving windows instead. •  Can be found in Churn, Fraud detection, Ad click prediction,… •  A categorical variable dependency? •  Ex : email flag in fraud detection •  Is there a Network dependency? •  Ex : Fraud / Bot detection (network features can be useful but leaky) Feature Engineering
  • 38. •  Final trick : -  Stack train and test and add is_test boolean -  Try to predict is_test -  Check if the model is able to predict -  If so : -  check the feature importance -  Remove / modify feature and iterate Feature Engineering
  • 39. •  Final trick: •  Back to Phm example: Feature Engineering Huge  (me  leak  !    
  • 40. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… Evaluation metric : Classification
  • 41. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… •  Customs Evaluation metric : Classification Not  good  if  unbalanced  target   When  you  have  an  order  problem     When  you  are  going  stochas(c   When  you  need  to  s(ck  to  business   Accuracy  alterna(ve  
  • 42. •  Custom metrics •  Cost based •  Ex Fraud: •  Mean loss of 50 $ / fraud (FN) •  Mean loss of 20 $ / wrongly cancelled transaction (FP) •  F1 score often used in papers •  in practice, you often have a business cost Evaluation metric : Classification TP   FN   TN  FP  
  • 43. •  Custom metrics •  Fraud Example 1: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this by manually handling transactions with score higher than threshold •  I have 1 person that does this fulltime and able to deal with 100 transactions / day •  The rest is automatically accepted -> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication. Evaluation metric : Classification
  • 44. •  Custom metrics •  Fraud Example 2: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this automatically by blocking all transactions with score higher than threshold -> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better Evaluation metric : Classification
  • 45. •  My cheat sheet Evaluation metric : Classification Metric   Op&mized  By  ML  model  ?     Treshold  Dependant   Applica&on  example   Accuracy   YES   YES   image  classifica(on,  nlp  …     F1-­‐score   NO   YES   ?  Papers  ?     AUC   NO   NO   fraud  detec(on,  churn,  healthcare  …     Log-­‐Loss   YES   NO   add  click  predic(on   Custom  metric   NO   ?     all  ?    
  • 46. •  Business Question dictates Evaluation Scheme! •  test set design •  evaluation metric •  Indirectly impact feature engineering •  Indirectly impact label quality •  Think (not too much) before coding •  Don’t try to optimize the wrong problem! Conclusion
  • 47. Thank you for your attention!