SlideShare a Scribd company logo
1 of 13
How to crack ML
Competitions
LIVE on Aug 11th, 2019
AppliedAICourse.com
ML Competitions
1. Kaggle.com
2. KDD Cup
3. Company specific competitions/hackathons.
KDD Cup 2009
Why did we choose this competition?
Problem:
“Predict, from customer data provided by the French Telecom company Orange,
the propensity of customers to switch providers (churn), buy new products or
services (appetency), or buy upgrades or add-ons (up-selling)”
Winner’s solution: http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
Dataset
Size: 100,000 data-points in total
Train-Test: Split randomly into equally sized training and test sets.
# Variables: 15000 variables were made available for prediction,
Categorical vs Numerical: Out of which 260 were categorical.
Missing values: Most of the categorical variables, and 333 of the continuous
variables had missing values.
Interpretability: To maintain the confidentiality of customers, all variables were
scrambled
Challenges
Slow vs Fast Challenge
● Fast challenge: 5 days
● Slow: 30 days, Subset of 230 variables, 40 of which were categorical
Metric: AUC averaged across the 3 prediction tasks
Feedback on submission: AUC on a random 10% of the test-test.
Preprocessing, Cleaning
Missing Values
● Categorical: special additional categorical value.
● Numerical: mean imputation, isMissing feature
Categorical to numerical:
● one-hot encoding of only the top 10 values per feature.
Feature Normalization
Remove features with constant value for all data points. [13436 features left]
Experimental Setup:
10-fold CV
Ignore the feedback (from 10% test) except for
sanity checks.
Avoid overfit at all costs.
Library of Base Models
Overall Strategy: Ensemble models;
Random Forests: with many combinations of params
GBDT: with many combinations of params
Logistic Regression (L1, L2) , SVMs, k-NN, Naive Bayes , Co-Clustering
500-1000 individual models for each of the three problems problem.
Calibration of each model: Platt Scaling using Logistic function.
KITCHEN SINK APPROACH
Best Individual models: GBDTs or RF
Ensemble Selection
1. Initialize the ensemble with a set of N classifiers that have the best uni-model
performance on an held-out set.
2. Add more models one-by-one (like Feature Selection) as long as they
improve the overall performance, even by a bit.
Results better than other competitors on the Fast challenge
No feature engineering or human expertise till now.
Feature engineering to improve the models
1. Binning using Decision Trees
[L1 reg LR become the best model using these features]
2. Explicit Feature Construction
[Positive rate of churn for all rows with 0 value was up to twice the positive
rate for all other numeric values]
Single feature AUC=0.62
Feature engineering (contd)
3. Tree based feature using 2 features at a time.
Binning used only one feature based DTs.
4. Co-Clustering for missing values
Matrix Factorization like approaches
Bi-Clustering
FS1: Given features + cleaning preprocessing
FS2: FS1 + binned using DT features
FS3: FS2 + all other above feature engineering methods

More Related Content

Similar to 11.1. PPT on How to crack ML Competitions all steps explained.pptx

Deepak-Computational Advertising-The LinkedIn Way
Deepak-Computational Advertising-The LinkedIn WayDeepak-Computational Advertising-The LinkedIn Way
Deepak-Computational Advertising-The LinkedIn Way
yingfeng
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the race
Scaleway
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 

Similar to 11.1. PPT on How to crack ML Competitions all steps explained.pptx (20)

Kdd 2009
Kdd 2009Kdd 2009
Kdd 2009
 
Deepak-Computational Advertising-The LinkedIn Way
Deepak-Computational Advertising-The LinkedIn WayDeepak-Computational Advertising-The LinkedIn Way
Deepak-Computational Advertising-The LinkedIn Way
 
Capacity planning in cellular network
Capacity planning in cellular networkCapacity planning in cellular network
Capacity planning in cellular network
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For Classification
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the race
 
Constraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in SchedulingConstraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in Scheduling
 
Om0010 operations management
Om0010 operations managementOm0010 operations management
Om0010 operations management
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Prespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPrespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandas
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
Om0010 operations management
Om0010 operations managementOm0010 operations management
Om0010 operations management
 
Intelligence Artificielle et performances énergétiques | Axis Parc (LLN) - 27...
Intelligence Artificielle et performances énergétiques | Axis Parc (LLN) - 27...Intelligence Artificielle et performances énergétiques | Axis Parc (LLN) - 27...
Intelligence Artificielle et performances énergétiques | Axis Parc (LLN) - 27...
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
StackAdapt Machine Learning Pipeline
StackAdapt Machine Learning PipelineStackAdapt Machine Learning Pipeline
StackAdapt Machine Learning Pipeline
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
13 2017.03.30 freeman 7th pvpmc iec 61853 presentation
13 2017.03.30 freeman 7th pvpmc iec 61853 presentation13 2017.03.30 freeman 7th pvpmc iec 61853 presentation
13 2017.03.30 freeman 7th pvpmc iec 61853 presentation
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 

11.1. PPT on How to crack ML Competitions all steps explained.pptx

  • 1. How to crack ML Competitions LIVE on Aug 11th, 2019 AppliedAICourse.com
  • 2. ML Competitions 1. Kaggle.com 2. KDD Cup 3. Company specific competitions/hackathons.
  • 3. KDD Cup 2009 Why did we choose this competition? Problem: “Predict, from customer data provided by the French Telecom company Orange, the propensity of customers to switch providers (churn), buy new products or services (appetency), or buy upgrades or add-ons (up-selling)” Winner’s solution: http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
  • 4. Dataset Size: 100,000 data-points in total Train-Test: Split randomly into equally sized training and test sets. # Variables: 15000 variables were made available for prediction, Categorical vs Numerical: Out of which 260 were categorical. Missing values: Most of the categorical variables, and 333 of the continuous variables had missing values. Interpretability: To maintain the confidentiality of customers, all variables were scrambled
  • 5. Challenges Slow vs Fast Challenge ● Fast challenge: 5 days ● Slow: 30 days, Subset of 230 variables, 40 of which were categorical Metric: AUC averaged across the 3 prediction tasks Feedback on submission: AUC on a random 10% of the test-test.
  • 6. Preprocessing, Cleaning Missing Values ● Categorical: special additional categorical value. ● Numerical: mean imputation, isMissing feature Categorical to numerical: ● one-hot encoding of only the top 10 values per feature. Feature Normalization Remove features with constant value for all data points. [13436 features left]
  • 7. Experimental Setup: 10-fold CV Ignore the feedback (from 10% test) except for sanity checks. Avoid overfit at all costs.
  • 8. Library of Base Models Overall Strategy: Ensemble models; Random Forests: with many combinations of params GBDT: with many combinations of params Logistic Regression (L1, L2) , SVMs, k-NN, Naive Bayes , Co-Clustering 500-1000 individual models for each of the three problems problem. Calibration of each model: Platt Scaling using Logistic function. KITCHEN SINK APPROACH
  • 10. Ensemble Selection 1. Initialize the ensemble with a set of N classifiers that have the best uni-model performance on an held-out set. 2. Add more models one-by-one (like Feature Selection) as long as they improve the overall performance, even by a bit. Results better than other competitors on the Fast challenge No feature engineering or human expertise till now.
  • 11. Feature engineering to improve the models 1. Binning using Decision Trees [L1 reg LR become the best model using these features] 2. Explicit Feature Construction [Positive rate of churn for all rows with 0 value was up to twice the positive rate for all other numeric values] Single feature AUC=0.62
  • 12. Feature engineering (contd) 3. Tree based feature using 2 features at a time. Binning used only one feature based DTs. 4. Co-Clustering for missing values Matrix Factorization like approaches Bi-Clustering
  • 13. FS1: Given features + cleaning preprocessing FS2: FS1 + binned using DT features FS3: FS2 + all other above feature engineering methods