- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
Slides from the presentation of this NYC meetup : http://www.meetup.com/Data-Modeling/events/224554990/
I talked about how to model churn before even thinking about the machine learning model.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
Customer churn has become a big issue in many banks because it costs a lot more to acquire a new customer than retaining existing ones. With the use of a customer churn prediction model possible churners in a bank can be identified, and as a result the bank can take some action to prevent them from leaving. In order to set up such a model in a bank in Iceland few things have to be considered. How a churner in a bank is defined, and which variables and methods to use. We propose that a churner for that Icelandic bank should be defined as a customer who has not been active for the last three months based on the bank definition of an active customer. Behavioral and demographic variables should be used as an input for the model, and either decision tree or logistic regression used as a technique.
This presentation introduces big data and explains how to generate actionable insights using analytics techniques. The deck explains general steps involved in a typical analytics project and provides a brief overview of the most commonly used predictive analytics methods and their business applications.
Vijay Adamapure is a Data Science Enthusiast with extensive experience in the field of data mining, predictive modeling and machine learning. He has worked on numerous analytics projects ranging from healthcare, business analytics, renewable energy to IoT.
Vijay presented these slides during the Internet of Everything Meetup event 'Predictive Analytics - An Overview' that took place on Jan. 9, 2015 in Mumbai. To join the Meetup group, register here: http://bit.ly/1A7T0A1
The importance of this type of research in the telecom market is to help companies make more profit.
It has become known that predicting churn is one of the most important sources of income to Telecom companies.
Hence, this research aimed to build a system that predicts the churn of customers i telecom company.
These prediction models need to achieve high AUC values. To test and train the model, the sample data is divided into 70% for training and 30% for testing.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Customer churn prediction for telecom data set.Kuldeep Mahani
Customer churn prediction and relevant recommendations as per DSN telecom data analysis. Random forest and logistic regression were applied to predict customer churn.
Customer churn occurs when customers or subscribers stop doing business with a company or service.
Also known as customer attrition, customer churn is a critical metric because it is much less expensive to retain existing customers than it is to acquire new customers – earning business from new customer’s means working leads all the way through the sales funnel, utilizing your marketing and sales resources throughout the process.
BA is used to gain insights that inform business decisions and can be used to automate and optimize business processes. Data-driven companies treat their data as a corporate asset and leverage it for a competitive advantage. Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business, and an organizational commitment to data-driven decision-making.
Business analytics examples
Business analytics techniques break down into two main areas. The first is basic business intelligence. This involves examining historical data to get a sense of how a business department, team or staff member performed over a particular time. This is a mature practice that most enterprises are fairly accomplished at using.
Data Mining on Customer Churn ClassificationKaushik Rajan
Implemented multiple classifiers to classify if a customer will leave or stay with the company based on multiple independent variables.
Tools used:
> RStudio for Exploratory data analysis, Data Pre-processing and building the models
> Tableau and RStudio for Visualization
> LATEX for documentation
Machine learning models used:
> Random Forest
> C5.0
> Decision tree
> Neural Network
> K-Nearest Neighbour
> Naive Bayes
> Support Vector Machine
Methodology: CRISP-DM
Customer churn classification using machine learning techniquesSindhujanDhayalan
Advanced data mining project on classifying customer churn by
using machine learning algorithms such as random forest,
C5.0, Decision tree, KNN, ANN, and SVM. CRISP-DM approach was followed for developing the project. Accuracy rate, Error rate, Precision, Recall, F1 and ROC curve was generated using R programming and the efficient model was found comparing these values.
Slides from the presentation of this NYC meetup : http://www.meetup.com/Data-Modeling/events/224554990/
I talked about how to model churn before even thinking about the machine learning model.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
Customer churn has become a big issue in many banks because it costs a lot more to acquire a new customer than retaining existing ones. With the use of a customer churn prediction model possible churners in a bank can be identified, and as a result the bank can take some action to prevent them from leaving. In order to set up such a model in a bank in Iceland few things have to be considered. How a churner in a bank is defined, and which variables and methods to use. We propose that a churner for that Icelandic bank should be defined as a customer who has not been active for the last three months based on the bank definition of an active customer. Behavioral and demographic variables should be used as an input for the model, and either decision tree or logistic regression used as a technique.
This presentation introduces big data and explains how to generate actionable insights using analytics techniques. The deck explains general steps involved in a typical analytics project and provides a brief overview of the most commonly used predictive analytics methods and their business applications.
Vijay Adamapure is a Data Science Enthusiast with extensive experience in the field of data mining, predictive modeling and machine learning. He has worked on numerous analytics projects ranging from healthcare, business analytics, renewable energy to IoT.
Vijay presented these slides during the Internet of Everything Meetup event 'Predictive Analytics - An Overview' that took place on Jan. 9, 2015 in Mumbai. To join the Meetup group, register here: http://bit.ly/1A7T0A1
The importance of this type of research in the telecom market is to help companies make more profit.
It has become known that predicting churn is one of the most important sources of income to Telecom companies.
Hence, this research aimed to build a system that predicts the churn of customers i telecom company.
These prediction models need to achieve high AUC values. To test and train the model, the sample data is divided into 70% for training and 30% for testing.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Customer churn prediction for telecom data set.Kuldeep Mahani
Customer churn prediction and relevant recommendations as per DSN telecom data analysis. Random forest and logistic regression were applied to predict customer churn.
Customer churn occurs when customers or subscribers stop doing business with a company or service.
Also known as customer attrition, customer churn is a critical metric because it is much less expensive to retain existing customers than it is to acquire new customers – earning business from new customer’s means working leads all the way through the sales funnel, utilizing your marketing and sales resources throughout the process.
BA is used to gain insights that inform business decisions and can be used to automate and optimize business processes. Data-driven companies treat their data as a corporate asset and leverage it for a competitive advantage. Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business, and an organizational commitment to data-driven decision-making.
Business analytics examples
Business analytics techniques break down into two main areas. The first is basic business intelligence. This involves examining historical data to get a sense of how a business department, team or staff member performed over a particular time. This is a mature practice that most enterprises are fairly accomplished at using.
Data Mining on Customer Churn ClassificationKaushik Rajan
Implemented multiple classifiers to classify if a customer will leave or stay with the company based on multiple independent variables.
Tools used:
> RStudio for Exploratory data analysis, Data Pre-processing and building the models
> Tableau and RStudio for Visualization
> LATEX for documentation
Machine learning models used:
> Random Forest
> C5.0
> Decision tree
> Neural Network
> K-Nearest Neighbour
> Naive Bayes
> Support Vector Machine
Methodology: CRISP-DM
Customer churn classification using machine learning techniquesSindhujanDhayalan
Advanced data mining project on classifying customer churn by
using machine learning algorithms such as random forest,
C5.0, Decision tree, KNN, ANN, and SVM. CRISP-DM approach was followed for developing the project. Accuracy rate, Error rate, Precision, Recall, F1 and ROC curve was generated using R programming and the efficient model was found comparing these values.
The data set used in this project is available in the Kaggle and contains nineteen columns (independent variables) that indicate the characteristics of the clients of a fictional telecommunications corporation. The Churn column (response variable) indicates whether the customer departed within the last month or not. The class No includes the clients that did not leave the company last month, while the class YES contains the clients that decided to terminate their relations with the company. The objective of the analysis is to obtain the relation between the customer’s characteristics and the churn.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
More on https://highlyscalable.wordpress.com/
Data Mining Problems in Retail is an analytical report that studies how retailers can make sense of their
data by adopting advanced data analysis and optimization techniques that enable automated decision
making in the area of marketing and pricing. The report analyzes dozens of practical case studies and
research reports and presents a systematic view on the problem.
We hope that this article will be useful for data scientists, marketing specialists, and business analysts
who are looking beyond the basic statistical and data mining techniques to build comprehensive
data-driven business optimization processes and solutions.
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case of customer churn where we work on a data of postpaid customers with a contract. The data has information about the customer usage behavior, contract details and the payment details. The data also indicates which were the customers who canceled their service. Based on this past data, we need to build a model which can predict whether a customer will cancel their service in the future or not.
To identify the segment of customers, who have a higher tendency to default, if they are offered a Personal Loan
To leverage the existing Two-Wheeler Loan (TW) customer base to cross-sell the Personal Loan product
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Similar to Prediction of customer propensity to churn - Telecom Industry (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Prediction of customer propensity to churn - Telecom Industry
1. Pranov Shobhan Mishra
Prediction of customer propensity to churn
TELECOM INDUSTRY
Refer to github link for the code
https://github.com/Pranov1984/Prediction-of-customer-propensity-to-churn
2. Executive Summary
Overview:
At company X, there is a concern that there is an increased churn being experienced by all companies in the
telecom industry. Customer retention and Average revenue per unit (ARPU) has become a strong area of focus,
especially for company X, as the churn rate for X, has been relatively high.
Problem Statement:
Currently the effort to retain customers has been very reactive as the attempt is made only when the subscriber
calls in to close the account. The management team is keen to take more initiatives on this front and have a
targeted proactive strategy.
Goal Statement:
The aim of this project is to do a data analysis and help the company with insights on customer behaviour that
would be helpful in devising targeted strategies for retention of customers. The specific goals expected to be
achieved are given below
Identification of the top variables driving likelihood of churn
An independent industry survey suggested that “cost and billing”, “network and service quality” and “data usage
connectivity issues” are key influencing factors of churn. Validate if this holds true for Company X.
Build a predictive model to identify customers who have highest probability of churn that the company can use for
proactive retention strategy.
Build a lift chart so that the company can optimize their efforts by targeting most of the potential churns with
least contact efforts. Here with 30% of the total customer pool, the model accurately provides 33% of total
potential churn candidates.
3. Executive Summary Continued ...
Extensive experimentation with more than 10 different models was done to identify the best model that could
predict customer behaviour so that the company can take proactive steps to retain the customer wherever
possible.
The data was slightly imbalanced (Majority class: Minority Class = 76:24) and hence appropriate model evaluation
metric was required to be chosen. A combination of Harmonic mean (F1 score) and Area Under the Curve (AUC)
was used to finalize the best model.
Models tried to arrive at the best are
Simple Models like Logistic Regression & Discriminant Analysis with different thresholds for classification
Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
Ensemble of five individual models and predicting the output by averaging the individual output probabilities
Xgboost algorithm
Note: Ensemble model with majority and stacked ensemble were also tried but they did not give good results and hence not included in the project report
The key insights as provided by the logistic regression model, which incidentally provided the best metrics,
indicate that the variables that impact customer behaviour can be broadly classified as below:
minutes of usage(both in terms of duration and number of calls),
overage charges,
network quality (for both voice and data),
number of subscribers in the family,
handset price and
credit class
A Gains chart was prepared which gave a cumulative lift of 110% in the first 3 deciles. The customers with highest
probability of churn were identified. The company could use the customer details to proactively work with them
and retain them.
4. Approach
• Variables
greater than
30% missing
values
Remove
Variables
• Identify
Significant
variables using
Boruta
Feature
Selection
• kNN used on
dataset with
significant
variables
Missing Value
Imputation
• Winsor &
Variable
Transformation
Outlier
Treatment
Model Building
Logistic Regression
Linear Discriminant Analysis
Random Forest
Ensemble Models – Averaging the output from each individual model
Ensemble Models – Majority Voting
Stacked Ensemble Models
Boosting – Xgboost
7. Significant Variables, Missing Value Imputation & Outlier Treatment
The above variables were removed from the original dataset and then a boruta model was built for feature selection
purposes. The model identified the significant variables which was used to further subset the original dataset.
The significant variables were, as identified, given below
8. Correlation Plot highlighting the correlated numeric variables
Highly correlated variables
that was considered for
removal to reduce multi-
collinearity
9. Significant Variables, Missing Value Imputation & Outlier Treatment
Further sub setting the original with by selecting the significant variables, we still have 50 variables, with 18 of them
still having missing values.
The missing values were treated by using kNN algorithm to predict the missing values. Library VIM was used.
After that some of the highly correlated variables were removed.
Most of the numeric variables had outliers. The boxplots as can be seen when the code is run, help visualize the
variables with outliers.
The outliers were treated by winsorizing the variables. Library Desctools was used. Some variables were further log
transformed. See the r code for more details.
10. Data Visualization
Similar trend for customers who churn and who do not. Higher no. of
customers with low usage and high mean usage.
For customers who churn, the trend is generally decreasing after a value of
30. Customers who don’t churn, the trend is decreasing after close to 50.
Similar pattern in rev_Range for both customer behaviors
Similar pattern but differences in densities noticeable for comparable values.
11. Data Visualization
Similar trend for both customer behavior.
Similar pattern in rev_Range for both customer behaviors
Similar pattern but differences in densities noticeable for comparable values.Similar pattern but differences in densities noticeable for comparable values.
12. Data Visualization – Using Transformed Variables
Distinct differentiation noticed in customer behavior across models and
Months of usage.
Similarly for Call Drops also as seen below
Distinct differentiation achieved for both the variables after transformation
to two classes based on churn percentage
14. Model Comparison – The previously build models and an Ensemble Model
Ensemble of five Individual models and final prediction was done by averaging the predictions from the models
15. Model Comparison with Ensemble Modelling
Five Individual models built and final prediction was done by combining the predictions from the models
16. Model Comparison
Ten models tried to finalize the best amongst them. The evaluation metric comparison seen
above
LG_26 seems to be giving the best results with highest AUC, second highest accuracy and
relatively good sensitivity and specificity
LG_26 is a logistic regression model with a threshold of cut-off for levels chosen at 26%
17. Variables with more than 50% probability of changing the decision of the customer for every 1 unit
change in the respective independent variable
Top ten factors for customer churn are
1. crcscod (Credit Class Code),
2. hnd_price (current handset price),
3. avgqty, (average monthly number of calls over the life of customer) ,
4. adjmou (Billing adjusted total minutes of use over the life of customer),
5. Uniqsubs (Number of unique subscribers in the household),
6. callwait_range, (Range of number of call waiting calls)
7. Datovr_Range, Range of revenue of data overage
8. Drop_blk_Mean, Mean number of dropped or blocked calls
9. Drop_vce_Range, Range of number of dropped (failed) voice calls
10. rev_range, Range of revenue(charge amount)
18. Validation of applicability of Independent Survey Finding for the Telecom Industry (Refer the Executive summary)
Shown below is the summary of the Logistic model used with each variable’s significance status.
Variables impacting cost and billing are highly significant. Adjmou has one of the top 5 odds ratio.
Mean total monthly recurring charge (totmrc_Mean), Revenue (charge amount) Range (rev_Range), adjmou (Billing adjustments) etc are
found to be highly significant, suggesting cost and billing impact customer behaviour.
Similarly network and service quality variables like drop_bkl_Mean (mean no. of dropped and blocked calls) is highly significant,
drop_vce_Range is significant, whereas callwait_mean is significant at alpha = 10%.
Datovr_Range(Range of revenue of data overage) is not found to be significant but has an odds ratio of more than 1 indicating that 1 unit
change in its value has more than 50% chance of changing the customer behaviour from one level to other, thereby suggesting the need to pay
attention to it. Additionally the intercept is significant which constitutes effects of levels of categorical variables that were removed by the
model.
19. Recommend rate plan migration as a proactive retention strategy?
The analysis below suggests an answer as “Yes”.
Mou_Mean (minutes of usage) is one of the highly significant variables as was seen in the previous slide and hence it makes sense to work
toward proactively working with customers to increase their MOU so that they are retained for a longer period.
The below boxplot also suggests that customers who are retained seem to have higher MOU.
Additionally mouR_Factor is found to be highly significant which is a derived variable of mou_Range.
Changes in MOU is also highly significant. Change_mF is a derived variable of change_mou.
To complement the above we also see that ovrmou_Mean is also a highly significant variable with an odds ration of more than 1. The variable
has positive estimate of coefficient indicating increase in overage increases churn.
It would help if the company is able to work with the customers and based on their usage migrate them to optimal plan rates to avoid overage
charges.
20. Use of model for prioritisation of customers for a proactive retention campaigns
Gains – Lift Chart
The lift achieved will help to reach out to churn candidates by targeting much fewer of the total
customer pool with the company. The highest gain can be achieved through the first 30 deciles which
can give about 33% of the customers who are likely to terminate the services. This means the company
selects 30% of entire customer database and that covers 33% of people who are likely to leave. This is
much better than randomly calling customers in the absence of model which would have given maybe
15% hit rate from all potential churn candidates.
21. Identification of the customers with highest probability of terminating services
The 20% customers who need to be proactively worked with to prevent churn were identified with.
They are the customers whose probability of churn is greater than 32.24% and less than 84.7. The code
to get the customer details is given below
gains(as.numeric(Telecom_Winsor$churn),predict(LGMF,type="response",newdata=Telecom_Winsor[,-42]),groups = 10)
Telecom_Winsor$Cust_ID=mydata$Customer_ID
Telecom_Winsor$prob<-predict(LGMF,type="response",newdata=Telecom_Winsor[,-42])
quantile(Telecom_Winsor$prob,prob=c(0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1))
targeted=Telecom_Winsor%>%filter(prob>0.3224491 & prob<=0.8470540)%>%dplyr::select(Cust_ID)