1. MIS 637 Final ProjectMIS 637 Final Project
Predicting Churners in aPredicting Churners in a
Telecom CompanyTelecom Company
By
Rahul Bhatia
Student Id : 10398302
2. ABSTRACTABSTRACT
• "Churn Rate" is a business term describing the rate at which
customers leave or cease paying for a product or service. It's a
critical figure in many businesses, as it's often the case that
acquiring new customers is a lot more costly than retaining existing
ones (in some cases, 5 to 20 times more expensive).
• Understanding what keeps customers engaged, therefore, is
incredibly valuable, as it is a logical foundation from which to
develop retention strategies and roll out operational practices
aimed to keep customers from walking out the door. Consequently,
there's growing interest among companies to develop better
churn-detection techniques, leading many to look to data mining
and machine learning for new and creative approaches.
1
4. Business UnderstandingBusiness Understanding
Profound Question:
For this project, I have obtained a longstanding telecom customer
dataset of a Telecom (Mobile) company which aims to predict
whether its customers will churn or not. The objective of this
competition is to build a model, learned using historical data, that will
determine churners in the telecom company.
Objective:
The classification goal is to derive rules and predict whether a
customer will churn or not by using KNN and C4.5(variable churn)
algorithms and compare both the model accuracies.
Accomplishments:
By using this model, we can increase churn prediction efficiency by
identifying the main variables which result in churning, and have a
more rational estimate about which customers are potential churners
that we should contact first.
4
5. Data Source: This dataset was used in yhat blog post “Predicting
customer churn with scikit-learn” by Eric Chiang.
Data set details:
•The data is straightforward. Each row represents a subscribing
telephone customer. Each column contains customer attributes such
as phone number, call minutes used during different times of day,
charges incurred for services, lifetime account duration, and whether
or not the customer is still a customer. The original dataset contains a
total of 3333 rows with 1 dependent variable and 20 independent
variables.
5
Data Understanding
9. Data PreparationData Preparation
Data Cleaning and Transformations:
Handle Missing values & Identify outliers:
No missing values and outliers have been found in original data.
Normalization:
Z-Score Normalization was performed on input variables Account Length ,
Number of Voice Mail Messages, Total Day Minutes, Total Day calls, Total
Evening Minutes, Total Evening calls, Total Night Minutes, Total Night Calls,
Total International Minutes, Total International Calls, Customer Service Calls.
9
10. Data PreparationData Preparation
• Attributes Selection:
Attributes State, Area Code and Phone Number were dropped from
the model as we do not need these columns for churn prediction.
Attributes Total Day Charge, Total Evening Charge and Total night
calls and Total International Charge were also dropped from the
model as high correlation was found between them and Total Day
Minutes, Total Evening Minutes, Total night minutes, Total International
Minutes respectively.
1
16. Data PreparationData Preparation
Data Division:
After data clean, the data set consisting of 3333 records is divided into 2 sets.
Training data set: 80% of the data (2666 records) is used to develop the model.
Testing data: 20% of the data ( 667 records) is used to evaluate the model.
16
17. ModelingModeling
Algorithm?
The target variable is categorical (true, false) and is not continuous, Classification is
the right choice.
Classification: predicts categorical class labels and classifies data based on the
training set and the values in a classifying attribute and uses it in classifying new
data.
17
18. ModelingModeling
K-Nearest Neighbors algorithm:
The output is a class membership. An object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common among
its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the
object is simply assigned to the class of that single nearest neighbor.
C4.5 algorithm: An extension of ID3 algorithm. C4.5 recursively visits each
decision node, selecting the optimal splits, until no further splits are possible. It
makes use of the concept of information gain or entropy reduction to select the
optimal split.
18
19. ModelingModeling
Software:
SPSS Modeler 17.0 is a data mining and text analytics software application built by
IBM.
It is an extensive predictive analytics platform that is designed to built predictive
models, conduct analytic tasks and bring predictive intelligence to decisions by
providing a range of advanced algorithms and techniques.
19
27. C4.5 Test Dataset on TrainingC4.5 Test Dataset on Training
Data ModelData Model
27
94.9% Accuracy
28. EvaluationEvaluation
C4.5 algorithm(94.9%) is preferred over K-nearest neighbor algorithm
(87.1%) as the model accuracy is higher.
C4.5 Algorithm:
Coincidence Matrix
Shows a high accuracy in predicting “false” while a low accuracy when predicting “True”.
This is because the model often yield misleading results if the data set is unbalanced, as in
this project, we have 558 “false” and 109 ”true”, the classifier could easily be biased into
classifying all the samples as “false”.
However, we still can use this mode to predict a “true” due to the lifting and gain chart.
28
29. EvaluationEvaluation
29
6 Times accurate
Lifting is a measure of the
effectiveness of a predictive
model calculated as the ratio
between the results obtained
with and without the predictive
model.
For contacting 10% of
customers, using no model we
should get 10% of positive
churners and using the given
model we should get 60% of
positive churners.
30. EvaluationEvaluation
30
Gains Chart
The y-axis shows the percentage the
total possible positive churners(“true”)
The x-axis shows the percentage of
customers contacted
By using this model, we just need to
contact 50% of customers to receive
90% of the “true” churners.
32. EvaluationEvaluation
Conclusion:
We can conclude that Day Minutes, Number of Customer Service Calls ,
International Plan, Evening Minutes, Number of International calls and Voice Mail
Plan are the most important variables in predicting Churners.
32
33. DeploymentDeployment
• Predicting churn is particularly important for businesses w/ subscription models
such as cell phone, cable, or merchant credit card processing plans.
• Since the model achieved high predictive performances, it can be used to in
predicting churners in any Telecom Company and help the company to prevent it
customers from churning by improving on the most important variables as
discussed earlier and also save campaign cost.
33
34. ReferencesReferences
Data source:
Link to the dataset:
https://raw.githubusercontent.com/EricChiang/churn/master/data/c
hurn.csv
Software:
http://www-01.ibm.com/software/analytics/spss/
Other References:
http://blog.yhathq.com/posts/predicting-customer-churn-with-
sklearn.html
34