11.1. PPT on How to crack ML Competitions all steps explained.pptx

3. KDD Cup 2009 Why did we choose this competition? Problem: “Predict, from customer data provided by the French Telecom company Orange, the propensity of customers to switch providers (churn), buy new products or services (appetency), or buy upgrades or add-ons (up-selling)” Winner’s solution: http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

4. Dataset Size: 100,000 data-points in total Train-Test: Split randomly into equally sized training and test sets. # Variables: 15000 variables were made available for prediction, Categorical vs Numerical: Out of which 260 were categorical. Missing values: Most of the categorical variables, and 333 of the continuous variables had missing values. Interpretability: To maintain the confidentiality of customers, all variables were scrambled

5. Challenges Slow vs Fast Challenge ● Fast challenge: 5 days ● Slow: 30 days, Subset of 230 variables, 40 of which were categorical Metric: AUC averaged across the 3 prediction tasks Feedback on submission: AUC on a random 10% of the test-test.

6. Preprocessing, Cleaning Missing Values ● Categorical: special additional categorical value. ● Numerical: mean imputation, isMissing feature Categorical to numerical: ● one-hot encoding of only the top 10 values per feature. Feature Normalization Remove features with constant value for all data points. [13436 features left]

7. Experimental Setup: 10-fold CV Ignore the feedback (from 10% test) except for sanity checks. Avoid overfit at all costs.

8. Library of Base Models Overall Strategy: Ensemble models; Random Forests: with many combinations of params GBDT: with many combinations of params Logistic Regression (L1, L2) , SVMs, k-NN, Naive Bayes , Co-Clustering 500-1000 individual models for each of the three problems problem. Calibration of each model: Platt Scaling using Logistic function. KITCHEN SINK APPROACH

9. Best Individual models: GBDTs or RF

10. Ensemble Selection 1. Initialize the ensemble with a set of N classifiers that have the best uni-model performance on an held-out set. 2. Add more models one-by-one (like Feature Selection) as long as they improve the overall performance, even by a bit. Results better than other competitors on the Fast challenge No feature engineering or human expertise till now.

11. Feature engineering to improve the models 1. Binning using Decision Trees [L1 reg LR become the best model using these features] 2. Explicit Feature Construction [Positive rate of churn for all rows with 0 value was up to twice the positive rate for all other numeric values] Single feature AUC=0.62

12. Feature engineering (contd) 3. Tree based feature using 2 features at a time. Binning used only one feature based DTs. 4. Co-Clustering for missing values Matrix Factorization like approaches Bi-Clustering

13. FS1: Given features + cleaning preprocessing FS2: FS1 + binned using DT features FS3: FS2 + all other above feature engineering methods

11.1. PPT on How to crack ML Competitions all steps explained.pptx

Recommended

Recommended

More Related Content

Similar to 11.1. PPT on How to crack ML Competitions all steps explained.pptx

Similar to 11.1. PPT on How to crack ML Competitions all steps explained.pptx (20)

Recently uploaded

Recently uploaded (20)

11.1. PPT on How to crack ML Competitions all steps explained.pptx