3. KDD Cup 2009
Why did we choose this competition?
Problem:
“Predict, from customer data provided by the French Telecom company Orange,
the propensity of customers to switch providers (churn), buy new products or
services (appetency), or buy upgrades or add-ons (up-selling)”
Winner’s solution: http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
4. Dataset
Size: 100,000 data-points in total
Train-Test: Split randomly into equally sized training and test sets.
# Variables: 15000 variables were made available for prediction,
Categorical vs Numerical: Out of which 260 were categorical.
Missing values: Most of the categorical variables, and 333 of the continuous
variables had missing values.
Interpretability: To maintain the confidentiality of customers, all variables were
scrambled
5. Challenges
Slow vs Fast Challenge
● Fast challenge: 5 days
● Slow: 30 days, Subset of 230 variables, 40 of which were categorical
Metric: AUC averaged across the 3 prediction tasks
Feedback on submission: AUC on a random 10% of the test-test.
6. Preprocessing, Cleaning
Missing Values
● Categorical: special additional categorical value.
● Numerical: mean imputation, isMissing feature
Categorical to numerical:
● one-hot encoding of only the top 10 values per feature.
Feature Normalization
Remove features with constant value for all data points. [13436 features left]
8. Library of Base Models
Overall Strategy: Ensemble models;
Random Forests: with many combinations of params
GBDT: with many combinations of params
Logistic Regression (L1, L2) , SVMs, k-NN, Naive Bayes , Co-Clustering
500-1000 individual models for each of the three problems problem.
Calibration of each model: Platt Scaling using Logistic function.
KITCHEN SINK APPROACH
10. Ensemble Selection
1. Initialize the ensemble with a set of N classifiers that have the best uni-model
performance on an held-out set.
2. Add more models one-by-one (like Feature Selection) as long as they
improve the overall performance, even by a bit.
Results better than other competitors on the Fast challenge
No feature engineering or human expertise till now.
11. Feature engineering to improve the models
1. Binning using Decision Trees
[L1 reg LR become the best model using these features]
2. Explicit Feature Construction
[Positive rate of churn for all rows with 0 value was up to twice the positive
rate for all other numeric values]
Single feature AUC=0.62
12. Feature engineering (contd)
3. Tree based feature using 2 features at a time.
Binning used only one feature based DTs.
4. Co-Clustering for missing values
Matrix Factorization like approaches
Bi-Clustering
13. FS1: Given features + cleaning preprocessing
FS2: FS1 + binned using DT features
FS3: FS2 + all other above feature engineering methods