Maximizing a churn campaign’s profitability with cost sensitive predictive analytics

Copyright © 2014 SAS Institute Inc. All rights reserved. #analytics2014
Maximizing a Churn Campaign’s
Profitability With Cost-Sensitive
Predictive Analytics
Alejandro Correa Bahnsen, Luxembourg University
Andres Felipe Gonzalez Montoya, DIRECTV

Copyright © 2014, SAS Institute Inc. All rights reserved. #analytics2014
Agenda
• Churn modeling
• Evaluation Measures
• Offers
• Predictive modeling
• Cost-Sensitive Predictive Modeling
 Cost Proportionate Sampling
 Bayes Minimum Risk
 CS – Decision Trees
• Conclusions

Churn Modeling
• Detect which customers are likely to abandon
Voluntary churn
Involuntary churn

Customer Churn Management Campaign
Inflow
New
Customers
Customer
Base
Active
Customers
*Verbraken et. al (2013). A novel profit maximizing metric for measuring classification performance of customer churn prediction models.
Predicted Churners
Predicted Non-Churners
TP: Actual Churners
FP: Actual Non-Churners
FN: Actual Churners
TN: Actual Non-Churners
Outflow
Effective
Churners
Churn Model Prediction
1
1
1 − 𝛾𝛾
1

Evaluation of a Campaign
• Confusion Matrix
• Accuracy =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• Recall =
𝑇𝑃
𝑇𝑃+𝐹𝑁
• Precision =
𝑇𝑃
𝑇𝑃+𝐹𝑃
• F1-Score = 2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
True Class (𝑦𝑖)
Churner (𝑦𝑖=1) Non-Churner(𝑦𝑖=0)
Predicted
class (𝑐𝑖)
Churner (𝑐𝑖=1) TP FP
Non-Churner (𝑐𝑖=0) FN TN

Evaluation of a Campaign
• However these measures assign the same weight to different
errors
• Not the case in a Churn model since
 Failing to predict a churner carries a different cost than wrongly
predicting a non-churner
 Churners have different financial impact

Financial Evaluation of a Campaign
Inflow
New
Customers
Customer
Base
Active
Customers
*Verbraken et. al (2013). A novel profit maximizing metric for measuring classification performance of customer churn prediction models.
Predicted Churners
Predicted Non-Churners
TP: Actual Churners
FP: Actual Non-Churners
FN: Actual Churners
TN: Actual Non-Churners
Outflow
Effective
Churners
Churn Model Prediction
0
𝐶𝐿𝑉
𝐶𝐿𝑉 + 𝐶 𝑎𝐶 𝑜 + 𝐶 𝑎
𝐶 𝑜 + 𝐶 𝑎

• Cost Matrix
where:
True Class (𝑦𝑖)
Churner (𝑦𝑖=1) Non-Churner(𝑦𝑖=0)
Predicted
class (𝑐𝑖)
Churner (𝑐𝑖=1)
Non-Churner (𝑐𝑖=0)
𝐶 𝑎 = Administrative cost 𝐶𝐿𝑉𝑖 = Client Lifetime Value of
customer 𝑖
𝐶 𝑜 𝑖
= Cost of the offer made to
customer 𝑖
𝛾𝑖 = Probability that customer 𝑖
accepts the offer
𝐶 𝑇𝑃 𝑖
= 𝛾𝑖 𝐶 𝑜 𝑖
+ 1 − 𝛾𝑖 𝐶𝐿𝑉𝑖 + 𝐶 𝑎
𝐶 𝐹𝑁 𝑖
= 𝐶𝐿𝑉𝑖 𝐶 𝑇𝑁 𝑖
= 0
𝐶 𝐹𝑃 𝑖
= 𝐶 𝑜 𝑖
+ 𝐶 𝑎

• Using the cost matrix the total cost is calculated as:
𝐶 = 𝑦𝑖 𝑐𝑖 ∙ 𝐶 𝑇𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖 + 1 − 𝑦𝑖 𝑐𝑖 ∙ 𝐶 𝐹𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖
• Additionally the savings are defined as:
𝐶𝑠 =
𝐶0 − 𝐶
𝐶0
where 𝐶0 is the cost when all the customers are predicted as non-churners

• Customer Lifetime Value
*Glady et al. (2009). Modeling churn using customer lifetime value.

Offers
• Same offer may not apply to all customers (eg. Already have
premium channels)
• An offer should be made such that it maximizes the
probability of acceptance (𝛾) and CLV

Offers clusters

Offers Analysis
Improve
to HD DVR
Monthly
Discount
Premium
Channels
Evaluate
Offers
Performance

Offers Analysis
88%
90%
92%
94%
96%
98%
100%
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Churn Rate Gamma (right axis)
𝛾 = Probability that a customer accepts the offer

Predictive Modeling
• Using predictive analytics for detecting the behavioral
patterns of those customer's who had defect in the past

Predictive Modeling
• Then check which of the current customers share the same
patterns

Predictive Modeling
• Dataset
Dataset N Churn 𝑪 𝟎 (Euros)
Total 9410 4.83% 580,884
Training 3758 5.05% 244,542
Validation 2824 4.77% 174,171
Testing 2825 4.42% 162,171
Under-Sampling 374 50.80% 244,542

Predictive Modeling
• Algorithms
 Decision Trees
 Logistic Regression
 Random Forest

Predictive Modeling - Results
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Decision
Trees
Logistic
Regression
Random
Forest
F1-Score
Training Under-Sampling
0%
1%
2%
3%
4%
5%
6%
7%
8%
Decision Trees Logistic
Regression
Random
Forest
Savings
Training Under-Sampling

Predictive Modeling - SMOTE
• Synthetic Minority Over-sampling Technique
Dim2
Dim 1 Synthetic samples

• Dataset
Total 9410 4.83% 580,884
Training 3758 5.05% 244,542
Validation 2824 4.77% 174,171
Testing 2825 4.42% 162,171
Under-Sampling 374 50.80% 244,542
SMOTE 6988 48.94% 4,273,083

0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Decision
Trees
Logistic
Regression
Random
Forest
F1-Score
Training Under-Sampling SMOTE
0%
1%
2%
3%
4%
5%
6%
7%
8%
Regression
Random
Forest
Savings
Training Under-Sampling SMOTE

• Sampling techniques helps to improve models’ predictive
power however not necessarily the savings
• There is a need for methods that aim to increase savings

Cost-Sensitive Predictive Modeling
• Traditional methods assume the same cost for different errors
• Not the case in Churn modeling
• Some cost-sensitive methods assume a constant cost difference between
errors
• Example-Dependent Cost-Sensitive Predictive Modeling

Cost-Sensitive Predictive Modeling
• Changing class distribution
 Cost Proportionate Rejection Sampling
 Cost Proportionate Over Sampling
• Direct Cost
 Bayes Minimum Risk
• Modifying a learning algorithm
 CS – Decision Tree

Cost Proportionate Sampling
• Normalized Cost weight
𝑤𝑖 =
𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0
𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1
𝑤𝑖 =
𝑤𝑖
max
𝑗
𝑤𝑗

• Cost Proportionate Over Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Initial Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Cost Proportionate Dataset
(1,0,1)
(2,1,1), (2,1,1), …, (2,1,1)
(3,0,2), (3,0,2)
(4,1,1), (4,1,1), (4,1,1), …, (4,1,1), (4,1,1)
(5,0,1)
*Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.

• Cost Proportionate Rejection Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Initial Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Cost
Proportionate
Dataset
(2,1,1)
(4,1,1)
(4,1,1)
(5,0,1)
*Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting.
𝑤𝑖
0.05
0.5
0.1
1
0.05

• Dataset
Total 9410 4.83% 580,884
Training 3758 5.05% 244,542
Validation 2824 4.77% 174,171
Testing 2825 4.42% 162,171
Under-Sampling 374 50.80% 244,542
SMOTE 6988 48.94% 4,273,083
CS – Rejection-Sampling 428 41.35% 231,428
CS – Over-Sampling 5767 31.24% 2,350,285

0%
5%
10%
15%
20%
25%
Regression
Random
Forest
Savings
Training Under SMOTE
CS-Rejection CS-Over
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Decision
Trees
Logistic
Regression
Random
Forest
F1-Score
Training Under SMOTE
CS-Rejection CS-Over

• Decision model based on quantifying tradeoffs between
various decisions using probabilities and the costs that
accompany such decisions
• Risk of classification
𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖
𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖
Bayes Minimum Risk

• Using the different risks the prediction is made based on the
following condition:
𝑐𝑖 =
0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Example-dependent threshold
𝑡 𝐵𝑀𝑅 𝑖 =
𝐶 𝐹𝑃 𝑖 − 𝐶 𝑇𝑁 𝑖
𝐶 𝐹𝑁 𝑖 − 𝐶 𝑇𝑁 𝑖 − 𝐶 𝑇𝑃 𝑖 + 𝐶 𝐹𝑃 𝑖
Bayes Minimum Risk

Bayes Minimum Risk
0%
5%
10%
15%
20%
25%
30%
35%
- BMR - BMR - BMR
Decision Trees Logistic Regression Random Forest
Savings
Training Under-Sampling SMOTE CS-Rejection CS-Over

Bayes Minimum Risk
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
- BMR - BMR - BMR
Decision Trees Logistic Regression Random Forest
F1-Score

Bayes Minimum Risk
• Bayes Minimum Risk increases the savings by using a cost-
insensitive method and then introducing the costs
• Why not introduce the costs during the estimation of the
methods?

CS – Decision Trees
• Decision trees
 Classification model that iteratively creates binary decision rules
𝑥 𝑗
, 𝑙 𝑗
𝑚 that maximize certain criteria
 Where 𝑥 𝑗
, 𝑙 𝑗
𝑚 refers to making a rule using feature 𝑗 on value 𝑚

• Decision trees – Construction
• Then the impurity of each leaf is calculated using:
 Misclassification: 𝐼 𝑚 𝜋1 = 1 − 𝑚𝑎𝑥 𝜋1, (1 − 𝜋1)
 Entropy : 𝐼𝑒 𝜋1 = −𝜋1 log 𝜋1 − 1 − 𝜋1 log(1 − 𝜋1)
 Gini : 𝐼𝑔 𝜋1 = 2𝜋1 1 − 𝜋1
𝜋1is the percentage of positives.
𝑆
𝑆 𝑙 𝑆 𝑟
𝑆 𝑙
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 ≤ 𝑙 𝑗
𝑚 𝑆 𝑟
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 > 𝑙 𝑗
𝑚
𝑥 𝑗
, 𝑙 𝑗
𝑚

• Afterwards the gain of applying a given rule to the set 𝑆 is:
𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑗
𝑚 = 𝐼 𝜋1 −
𝑆 𝑙
𝑆
𝐼(𝜋 𝑙
1) −
𝑆 𝑟
𝑆
𝐼(𝜋 𝑟
1)
𝑆
𝑆 𝑙 𝑆 𝑟
𝑆 𝑙
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 ≤ 𝑙 𝑗
𝑚 𝑆 𝑟
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 > 𝑙 𝑗
𝑚
𝑥 𝑗
, 𝑙 𝑗
𝑚

• The rule that maximizes the gain is selected
𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = argmax
(𝑗,𝑚)
𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑗
𝑚
• The process is repeated until a stopping criteria is met:
S
S S
S S S S
S S S S

• Decision trees - Pruning
• Calculation of the Tree error and pruned Tree error
• After calculating the pruning criteria for all possible trees. The maximum
improvement is selected and the Tree is pruned.
• Later the process is repeated until there is no further improvement.
S
S S
S S S S
S S S S
S
S S
S S S S
S S
S
S S
S S
𝜖 𝑇𝑟𝑒𝑒
𝜖 𝐸𝐵(𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑐ℎ) − 𝜖 𝑇𝑟𝑒𝑒
𝑇𝑟𝑒𝑒 − |𝐸𝐵(𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑐ℎ)|
𝜖 𝐸𝐵(𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑐ℎ) − 𝜖 𝑇𝑟𝑒𝑒

• Maximize the accuracy is different than maximizing the cost
• To solve this, some studies had been proposed method that
aim to introduce the cost-sensitivity into the algorithms
• However, research have been focused on class-dependent
methods Instead we used a:
 Example-dependent cost based impurity measure
 Example-dependent cost based pruning criteria

• Cost based impurity measure
• The impurity of each leaf is calculated using:
𝐼𝑐 𝑆 = 𝑚𝑖𝑛 𝐶0, 𝐶1
𝑓(𝑆) =
0 𝐶0 ≤ 𝐶1
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑆
𝑆 𝑙 𝑆 𝑟
𝑆 𝑙
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 ≤ 𝑙 𝑗
𝑚 𝑆 𝑟
= 𝑆|𝑋𝑖 ∈ 𝑆 ⋀ 𝑥 𝑗
𝑖 > 𝑙 𝑗
𝑚
𝑥 𝑗
, 𝑙 𝑗
𝑚

• Cost sensitive pruning
𝑃𝐶𝑐 =
𝐶 𝐸𝐵(𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑐ℎ) − 𝐶 𝑇𝑟𝑒𝑒
• New pruning criteria that evaluates the improvement in cost
of eliminating a particular branch

0%
10%
20%
30%
40%
50%
Error Pruning Cost Pruning
Decision Trees Cost-Sensitive Decision Trees
Savings

0
0.05
0.1
0.15
0.2
0.25
0.3
F1-Score

Comparison of Models
0%
10%
20%
30%
40%
50%
Random Forest
Train
Logistic Regression
CSRejection
Logistic Regression
BMR Train
Decision Tree
CostPruning
CSRejection
CS-Decision Tree
Train
Savings F1-Score

Conclusions
• Selecting models based on traditional statistics does not gives
the best results measured by savings
• Incorporating the costs into the modeling helps to achieve
higher savings

Other Applications
• Fraud Detection
 Correa Bahnsen et al. (2013). Cost Sensitive Credit Card Fraud Detection
using Bayes Minimum Risk.
 Correa Bahnsen, et al. (2014). Improving Credit Card Fraud Detection with
Calibrated Probabilities.
• Credit Scoring
 Correa Bahnsen, et al. (2014). Example-Dependent Cost-Sensitive Credit
Scoring using Bayes Minimum Risk.
• Direct Marketing
 Correa Bahnsen, et al. (2014). Example-Dependent Cost-Sensitive Decision
Trees.

Contact Information
Alejandro Correa Bahnsen
University of Luxembourg
Luxembourg
al.bahnsen@gmail.com
http://www.linkedin.com/in/albahnsen
http://www.slideshare.net/albahnsen
Andres Gonzalez Montoya
DIRECTV
Colombia
andrezfg@gmail.com

Maximizing a churn campaign’s profitability with cost sensitive predictive analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Maximizing a churn campaign’s profitability with cost sensitive predictive analytics

Similar to Maximizing a churn campaign’s profitability with cost sensitive predictive analytics (20)

More from Alejandro Correa Bahnsen, PhD

More from Alejandro Correa Bahnsen, PhD (6)

Recently uploaded

Recently uploaded (13)

Maximizing a churn campaign’s profitability with cost sensitive predictive analytics