Implemented multiple classifiers to classify if a customer will leave or stay with the company based on multiple independent variables.
Tools used:
> RStudio for Exploratory data analysis, Data Pre-processing and building the models
> Tableau and RStudio for Visualization
> LATEX for documentation
Machine learning models used:
> Random Forest
> C5.0
> Decision tree
> Neural Network
> K-Nearest Neighbour
> Naive Bayes
> Support Vector Machine
Methodology: CRISP-DM
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
The purpose of this study is, with the help of Watson Analytics examine why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
CHURN ANALYSIS AND PLAN RECOMMENDATION FOR TELECOM OPERATORSJournal For Research
With increasing number of mobile operators, user is entitled with unlimited freedom to switch from one mobile operator to another if he is not satisfied with service or pricing. This trend is not good for operators as they lose their revenue because of customer switch. To solve it, operators are looking for machine learning tools which can predict well in advance which customer may churn, so that they can predict any alternative plans to satisfy and retain them. In this paper, we design a hybrid machine learning classifier to predict if the customer will churn based on the CDR parameters and we also propose a rule engine to suggest best plans.
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
The purpose of this study is, with the help of Watson Analytics examine why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
CHURN ANALYSIS AND PLAN RECOMMENDATION FOR TELECOM OPERATORSJournal For Research
With increasing number of mobile operators, user is entitled with unlimited freedom to switch from one mobile operator to another if he is not satisfied with service or pricing. This trend is not good for operators as they lose their revenue because of customer switch. To solve it, operators are looking for machine learning tools which can predict well in advance which customer may churn, so that they can predict any alternative plans to satisfy and retain them. In this paper, we design a hybrid machine learning classifier to predict if the customer will churn based on the CDR parameters and we also propose a rule engine to suggest best plans.
The importance of this type of research in the telecom market is to help companies make more profit.
It has become known that predicting churn is one of the most important sources of income to Telecom companies.
Hence, this research aimed to build a system that predicts the churn of customers i telecom company.
These prediction models need to achieve high AUC values. To test and train the model, the sample data is divided into 70% for training and 30% for testing.
Customer segmentation for a mobile telecommunications company based on servic...Shohin Aheleroff
Competition between the mobile operators is becoming more based on subscriber’s behavior. In order to improve mobile operator’s competitiveness and customer value, several data mining technologies can be used.Most telecommunications carriers cluster their mobile customers by billing system data. This paper discusses how to cluster mobile customers based on their call detail records and analyze their consumer behaviors.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYIJITCA Journal
Churn customer, one of the most important issues in customer relationship management and marketing is especially in industries such as telecommunications, the financial and insurance. In recent decades much
research has been done in this area. In this research, the index set for the reasons set reason churn customers for our customers is of particular importance. In this study we are intended to provide a formula for the index churn customers, the better to understand the reasons for customers to provide churn. Therefore, in order to evaluate the formula provided through six Classification methods (Decision tree QUEST, Decision tree C5.0, Decision tree CHAID, Decision trees CART, Bayesian network, Neural network) to evaluate the formula will be involved with individual indicators
Projection pursuit Random Forest using discriminant feature analysis model fo...IJECEIAES
A major and demand issue in the telecommunications industry is the prediction of churn customers. Churn describes the customer who attrites from the current provider to competitors searching for better service offers. Companies from the Telco sector frequently have customer relationship management offices it is the main objective in how to win back defecting clients because preserve long-term customers can be much more beneficial than gain newly recruited customers. Researchers and practitioners are paying great attention to developing a robust customer churn prediction model, especially in the telecommunication business by proposed numerous machine learning approaches. Many approaches of Classification are established, but the most effective in recent times is a tree-based method. The main contribution of this research is to predict churners/non-churners in the Telecom sector based on project pursuit Random Forest (PPForest) that uses discriminant feature analysis as a novelty extension of the conventional Random Forest for learning oblique Project Pursuit tree (PPtree). The proposed methodology leverages the advantage of two discriminant analysis methods to calculate the project index used in the construction of PPtree. The first method used Support Vector Machines (SVM) while, the second method used Linear Discriminant Analysis (LDA) to achieve linear splitting of variables during oblique PPtree construction to produce individual classifiers that are robust and more diverse than classical Random Forest. It is found that the proposed methods enjoy the best performance measurements e.g. Accuracy, hit rate, ROC curve, Lift, H-measure, AUC. Moreover, PPForest based on LDA delivers effective evaluators in the prediction model.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Telecommunication Analysis(3 use-cases) with IBM cognos insightsheetal sharma
The purpose of this study is, with the help of IBM Cognos Insight analyze why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Proposed ranking for point of sales using data mining for telecom operatorsijdms
This study helps telecom companies in making decisions that optimize its sales points to reduce costs, also
to identify profitable customers and churn ones. This study builds two research models; physical model for
continuous mining of database where ever it resides i.e., as we have On Line Analytic Processing (OLAP)
we must have On Line Data Mining (OLDM), and logical model using Technology Acceptance Model.
Previous Studies showed that using basic information of customers, call details and customer service
related data, a model can effectively achieve accurate prediction data.
This research gives a new definition and classification for telecommunication services from the data
mining point of view. Then this research proposed a formula for total rank a shop and each term of this
formula gives a sub rank. The proposed example shows that even a shop with lower numbers of population
and visitors, it still has higher rank.
This research suggested that telecom operators has to concentrate more on their e-shopping and epayment
as it is more cost effective and use data from shops for marketing issues. Some assumptions made
in this study need to be validated using surveys, also proposed ranking should be applied on live database.
INTEGRATION OF MACHINE LEARNING TECHNIQUES TO EVALUATE DYNAMIC CUSTOMER SEGME...IJDKP
The telecommunications industry is highly competitive, which means that the mobile providers need a
business intelligence model that can be used to achieve an optimal level of churners, as well as a minimal
level of cost in marketing activities. Machine learning applications can be used to provide guidance on
marketing strategies. Furthermore, data mining techniques can be used in the process of customer
segmentation. The purpose of this paper is to provide a detailed analysis of the C.5 algorithm, within naive
Bayesian modelling for the task of segmenting telecommunication customers behavioural profiling
according to their billing and socio-demographic aspects. Results have been experimentally implemented.
Predicting churn with filter-based techniques and deep learningIJECEIAES
Customer churn prediction is of utmost importance in the telecommunications industry. Retaining customers through effective churn prevention strategies proves to be more cost-efficient. In this study, attribute selection analysis and deep learning are integrated to develop a customer churn prediction model to improve performance while reducing feature dimensions. The study includes the analysis of customer data attributes, exploratory data analysis, and data preprocessing for data quality enhancement. Next, significant features are selected using two attribute selection techniques, which are chi-square and analysis of variance (ANOVA). The selected features are fed into an artificial neural network (ANN) model for analysis and prediction. To enhance prediction performance and stability, a learning rate scheduler is deployed. Implementing the learning rate scheduler in the model can help prevent overfitting and enhance convergence speed. By dynamically adjusting the learning rate during the training process, the scheduler ensures that the model optimally adapts to the data while avoiding overfitting. The proposed model is evaluated using the Cell2Cell telecom database, and the results demonstrate that the proposed model exhibits a promising performance, showcasing its potential as an effective churn prediction solution in the telecommunications industry.
The importance of this type of research in the telecom market is to help companies make more profit.
It has become known that predicting churn is one of the most important sources of income to Telecom companies.
Hence, this research aimed to build a system that predicts the churn of customers i telecom company.
These prediction models need to achieve high AUC values. To test and train the model, the sample data is divided into 70% for training and 30% for testing.
Customer segmentation for a mobile telecommunications company based on servic...Shohin Aheleroff
Competition between the mobile operators is becoming more based on subscriber’s behavior. In order to improve mobile operator’s competitiveness and customer value, several data mining technologies can be used.Most telecommunications carriers cluster their mobile customers by billing system data. This paper discusses how to cluster mobile customers based on their call detail records and analyze their consumer behaviors.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYIJITCA Journal
Churn customer, one of the most important issues in customer relationship management and marketing is especially in industries such as telecommunications, the financial and insurance. In recent decades much
research has been done in this area. In this research, the index set for the reasons set reason churn customers for our customers is of particular importance. In this study we are intended to provide a formula for the index churn customers, the better to understand the reasons for customers to provide churn. Therefore, in order to evaluate the formula provided through six Classification methods (Decision tree QUEST, Decision tree C5.0, Decision tree CHAID, Decision trees CART, Bayesian network, Neural network) to evaluate the formula will be involved with individual indicators
Projection pursuit Random Forest using discriminant feature analysis model fo...IJECEIAES
A major and demand issue in the telecommunications industry is the prediction of churn customers. Churn describes the customer who attrites from the current provider to competitors searching for better service offers. Companies from the Telco sector frequently have customer relationship management offices it is the main objective in how to win back defecting clients because preserve long-term customers can be much more beneficial than gain newly recruited customers. Researchers and practitioners are paying great attention to developing a robust customer churn prediction model, especially in the telecommunication business by proposed numerous machine learning approaches. Many approaches of Classification are established, but the most effective in recent times is a tree-based method. The main contribution of this research is to predict churners/non-churners in the Telecom sector based on project pursuit Random Forest (PPForest) that uses discriminant feature analysis as a novelty extension of the conventional Random Forest for learning oblique Project Pursuit tree (PPtree). The proposed methodology leverages the advantage of two discriminant analysis methods to calculate the project index used in the construction of PPtree. The first method used Support Vector Machines (SVM) while, the second method used Linear Discriminant Analysis (LDA) to achieve linear splitting of variables during oblique PPtree construction to produce individual classifiers that are robust and more diverse than classical Random Forest. It is found that the proposed methods enjoy the best performance measurements e.g. Accuracy, hit rate, ROC curve, Lift, H-measure, AUC. Moreover, PPForest based on LDA delivers effective evaluators in the prediction model.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Telecommunication Analysis(3 use-cases) with IBM cognos insightsheetal sharma
The purpose of this study is, with the help of IBM Cognos Insight analyze why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Proposed ranking for point of sales using data mining for telecom operatorsijdms
This study helps telecom companies in making decisions that optimize its sales points to reduce costs, also
to identify profitable customers and churn ones. This study builds two research models; physical model for
continuous mining of database where ever it resides i.e., as we have On Line Analytic Processing (OLAP)
we must have On Line Data Mining (OLDM), and logical model using Technology Acceptance Model.
Previous Studies showed that using basic information of customers, call details and customer service
related data, a model can effectively achieve accurate prediction data.
This research gives a new definition and classification for telecommunication services from the data
mining point of view. Then this research proposed a formula for total rank a shop and each term of this
formula gives a sub rank. The proposed example shows that even a shop with lower numbers of population
and visitors, it still has higher rank.
This research suggested that telecom operators has to concentrate more on their e-shopping and epayment
as it is more cost effective and use data from shops for marketing issues. Some assumptions made
in this study need to be validated using surveys, also proposed ranking should be applied on live database.
INTEGRATION OF MACHINE LEARNING TECHNIQUES TO EVALUATE DYNAMIC CUSTOMER SEGME...IJDKP
The telecommunications industry is highly competitive, which means that the mobile providers need a
business intelligence model that can be used to achieve an optimal level of churners, as well as a minimal
level of cost in marketing activities. Machine learning applications can be used to provide guidance on
marketing strategies. Furthermore, data mining techniques can be used in the process of customer
segmentation. The purpose of this paper is to provide a detailed analysis of the C.5 algorithm, within naive
Bayesian modelling for the task of segmenting telecommunication customers behavioural profiling
according to their billing and socio-demographic aspects. Results have been experimentally implemented.
Predicting churn with filter-based techniques and deep learningIJECEIAES
Customer churn prediction is of utmost importance in the telecommunications industry. Retaining customers through effective churn prevention strategies proves to be more cost-efficient. In this study, attribute selection analysis and deep learning are integrated to develop a customer churn prediction model to improve performance while reducing feature dimensions. The study includes the analysis of customer data attributes, exploratory data analysis, and data preprocessing for data quality enhancement. Next, significant features are selected using two attribute selection techniques, which are chi-square and analysis of variance (ANOVA). The selected features are fed into an artificial neural network (ANN) model for analysis and prediction. To enhance prediction performance and stability, a learning rate scheduler is deployed. Implementing the learning rate scheduler in the model can help prevent overfitting and enhance convergence speed. By dynamically adjusting the learning rate during the training process, the scheduler ensures that the model optimally adapts to the data while avoiding overfitting. The proposed model is evaluated using the Cell2Cell telecom database, and the results demonstrate that the proposed model exhibits a promising performance, showcasing its potential as an effective churn prediction solution in the telecommunications industry.
An efficient enhanced k-means clustering algorithm for best offer prediction...IJECEIAES
Telecom companies usually offer several rate plans or bundles to satisfy the customers’ different needs. Finding and recommending the best offer that perfectly matches the customer’s needs is crucial in maintaining customer loyalty and the company’s revenue in the long run. This paper presents an effective method of detecting a group of customers who have the potential to upgrade their telecom package. The used data is an actual dataset extracted from call detail records (CDRs) of a telecom operator. The method utilizes an enhanced k-means clustering model based on customer profiling. The results show that the proposed k-means-based clustering algorithm more effectively identifies potential customers willing to upgrade to a higher tier package compared to the traditional k-means algorithm. Our results showed that our proposed clustering model accuracy was over 90%, while the traditional k-means accuracy was under 70%.
The data set used in this project is available in the Kaggle and contains nineteen columns (independent variables) that indicate the characteristics of the clients of a fictional telecommunications corporation. The Churn column (response variable) indicates whether the customer departed within the last month or not. The class No includes the clients that did not leave the company last month, while the class YES contains the clients that decided to terminate their relations with the company. The objective of the analysis is to obtain the relation between the customer’s characteristics and the churn.
Trying to find an edge over your telecom competitors? At PNA, our Data Analytics services can help you find the patterns unseen to human eyes. Stop trying to find edges and start getting ahead of your competition.
Customer churn occurs when customers or subscribers stop doing business with a company or service.
Also known as customer attrition, customer churn is a critical metric because it is much less expensive to retain existing customers than it is to acquire new customers – earning business from new customer’s means working leads all the way through the sales funnel, utilizing your marketing and sales resources throughout the process.
Predicting reaction based on customer's transaction using machine learning a...IJECEIAES
Banking advertisements are important because they help target specific customers on subscribing to their packages or other deals by giving their current customers more fixed-term deposit offers. This is done through promotional advertisements on the Internet or media pages, and this task is the responsibility of the shopping department. In order to build a relationship with them, offer them the best deals, and be appropriate for the client with the company's assurance to recover these deposits, many banks or telecommunications firms store the data of their customers. The Portuguese bank increases its sales by establishing a relationship with its customers. This study proposes creating a prediction model using machine learning algorithms, to see how the customer reacts to subscribe to those fixed-term deposits or offers made with the aid of their past record. This classification is binary, i.e., the prediction of whether or not a customer will embrace these offers. Four classifiers that include k-nearest neighbor (k-NN) algorithm, decision tree, naive Bayes, and support vector machines (SVM) were used, and the best result was obtained from the classifier decision tree with an accuracy of 91% and the other classifier SVM with an accuracy of 89%.
Analysis of Crime Big Data using MapReduceKaushik Rajan
Analyzed Crime Big data of Washington DC to solve the following business queries:
> Which hour has the highest crime count?
> Which shift has the highest crime count?
> Year wise crime count
> Hour wise crime count
> Crime count by an offense
> Average of Shift wise crime count
The data was initially stored in MySql which was then moved to HDFS using SQOOP, from where 4 MapReduce operations are doing using JAVA in Eclipse IDE. The outputs of the queries are then moved to HBase using SQOOP. Two more MapReduce operations are done using PIG, the output of which is also moved to HBase using SQOOP. All the outputs were then moved to the local system and are visualized using RStudio and Tableau.
Tools used:
> MySQL, HDFS and HBase to store the data
> SCOOP to move the data from one database to another
> JAVA (Eclipse IDE) and PIG to run the MapReduce queries
> RStudio for data pre-processing and visualization
> Tableau for visualization
> LATEX for Documentation
Infographic follows Andy Kirk's Three Principles of Good Visualization Design.
The infographic contains the information on both the negatives and positives of the smartphone boom in the USA.
8 graphs have been created in total,
1) Number of smartphone users in US – year wise
2) Evolution of smartphone – 1993 to 2012
3) Most preferred brand of smartphone
4) Number of internet users – State wise
5) Correlation plot between smartphone users and internet users in New Hampshire
6) Correlation plot between smartphone users, internet users and GDP of US – Year wise
7) Timeframe in which smartphones are being used – Before sleeping and after waking up
8) Smartphone which emits the highest radiation
Business magazine-style report on World Suicide Rate Analysis.
Andy Kirk's The Three Principles of Good Visualization Design was followed to create the report.
Tools used: RStudio, Tableau and Canva
Graphs plotted:
1) Map Chart is used to show the amount of suicide in each country
2) A horizontal bar graph is used to further compare the differences in suicide
counts in each country
3) A single line graph is used to show the amount of suicides committed each year
during the period 1985-2016
4) Stacked 100% Area graph is used to compare the suicide counts among
different age groups, namely 5-14 years,15-24 years, 25-34 years, 35-54 years,
55-74 years
5) Side by side bar chart is used to compare the number of suicide counts
among different age groups, sex-wise
6) A pie chart is used to see the composition of causes of deaths in the US in the year
2017
7) Bubble Chart is used to check out the methods by which people commit
suicide and to check which one causes the maximum death
8) A line graph is used to compare the Suicide count and happiness index for the years 2006 to 2015
9) The correlation matrix is used to find the correlation between different elements.
10) Side by side line graph is used to compare gender wise suicide
percentage in the US
11) Treemaps are used to see the composition of the number of suicides
among the states of US
12) Side by side area graph is used to see the availability of different drugs in the US
over time
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...Kaushik Rajan
Implemented a Data Warehouse on smart agriculture to solve various Business Intelligence queries. Integrated multiple datasets from 3 different data sources including both structured and unstructured data.
Tools used:
> SQL Server Integration Services for ETL
> SQL Server Management Services for Database
> SQL Server Analysis Services for building the Schema
> Tableau and PowerBI for Visualization
> R for data preprocessing
> LATEX for documentation
Video Presentation: https://www.youtube.com/watch?v=0oIlLQcyPdM
Performance Analysis of HBASE and MONGODBKaushik Rajan
Comparison of different NoSQL databases,
namely, HBase and MongoDB at different workloads using Yahoo Cloud Serving Benchmarking (YCSB)
Tools used
> HBase, MongoDB, Shell Scripting, YCSB, Hadoop Environment
> Tableau for Visualization
> LATEX for documentation
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. Customer churn classification using machine learning techniques and
comparing various classification methods
Adhithya Vattiam Ganesan, Kaushik Rajan, Sindhujan Dhayalan, Sushmit Kajalekar
X18105203, X17165849, X17170265, X18107001
School of computing
National college of Ireland
Abstract—Abstract: The telecommunication sector is rapidly
developing every day with the service providers offering a wide
range of services and benefits to the customers for various
reasons, but they also have a customer churn rate, i.e. the
number of customers choosing to leave the services offered by
one telecom service provider to another service provider. This
directly affects the business of the company leading to huge
loss. CRISP-Data Mining model is used to plan, understand and
implement machine learning methods to predict the impact of the
factors affecting customer retention and if a customer will stay or
leave the company. To do the customer churn prediction, dataset
from Kaggle is taken. Initially, the factors which has the highest
correlation with the customer churn variable is determined. The
customer dataset is taken and trained, then multiple algorithms
such as Random Forest, Decision tree algorithms (CART & C5.0),
Nave Bayes, KNN & ANN are applied to classify if the customer
will leave or stay with the company. All the used models are
evaluated and compared based on accuracy and ROC curve.
the impact of various factors leading to customer churn rate
and compared the models for churn prediction accuracy. C5.0
approach was found to be the best fit for churn prediction with
easy implementation and best fit for the problem.
Index Terms—Random forest, C5.0, KNN, ANN, Naive Bayes,
Decision tree, SVM.
I. INTRODUCTION
Telecommunication sector is at its peak with the
advancement of technologies, they offer a wide range
of services from internet, mobile services and cable
subscriptions all together as a bundle or one can select the
services separately based on preference. This sector has a lot
of players and each of them with the intention of trying to
control the market share provide offers to lure the customers
to change to their service providers for better services and
offers. This movement of customers from one service provider
to another service provider is known as customer churn rate.
This movement of customers affects the service providers in
many ways, first this tarnishes the reputation of the company
in the market creating a bad image of the service provided.
Then secondly this leads to a business and financial loss. The
service providers can cope up with business loss to some
extent but once the reputation of the company is gone then it
becomes irrevocable, which no organisation would want to
face. Most of the telecommunication companies keep losing
customers to competitors from time to time. The customers
that stopped using the company’s product or service during
a certain time frame is known as customer churn [1]. On
multiple instances it is observed that retaining present clients
is a whole lot profitable and cheaper than attracting an entirely
new client. Hence, customer churn prediction is turning into
the top concerns that many organizations dedicate their time
and assets to deal with it.
The service provider analyses the reason for a customer to
discontinue their services to improve their business strategies,
it is estimated that the cost of acquiring a new customer
is 6-7 times more expensive than the cost of retaining
an existing customer. Hence the service providers spend
time analysing the reasons for churn, this may be due to
dissatisfaction, cheaper or more affordable services provided
by the competitors or better sales and marketing strategies
employed by the competitors. Each of which indicating the
areas where the company falls behind its competitor in the
market [2].
In this paper we analyse the most commonly applied
machine learning techniques for customer churn prediction
by applying Decision Tree algorithms, K-Nearest Neighbour
algorithm, Random Forest algorithm and Artificial
neural network to predict the customer churn rate using
telecommunication customer data and then compare the
models based on their prediction accuracy, error rate and the
ease of performing the classification. The objective of the
paper is to find the best model for Telecom customer churn
prediction.
II. LITERATURE REVIEW
The classification problem associated with customer churn
prediction system can be fixed by using decision tree method.
The information theory is used to create information nodes,
branches as per the various values of different aspects[3]. It
examines the statistics of capabilities from large instances,
calculate the entropy of particular feature, and locate the
maximum essential function for category. For over 13000
records in a customer churn data, the decision tree method
showed 88.53% accuracy for prediction[4].
The imbalance between customer churn data and predicting
applicable solutions on it often results in poor performance
of learning and also it is difficult for machine learning
algorithms[5]. The other classifier is c5.0 which gives good
2. performance over the large data set. It performs constantly
when variety of variables is large. The boosting technique is
carried out in this set of rules to enhance the precision. The
other purpose is that the C5.0 selection tree can overcome the
scatter feature of the telecommunication information. This
algorithm solves few problems such as low rate of churn,
scatter problem. It provides sampling method which removes
noise, redundancy and performs separate clustering. Precision
ratio is 80% and prediction results are satisfactory enough to
retain the potential churn customers[6].
In case of the company which is recently established
or newly adopted some different technology or has lost
the historical information associated with the customers,
the traditional churn prediction systems wont be useful.
In such cases JIT(Just-In-Time) approach is more practical
alternative. This CCP process categories customers as churn
and non-churn according to the probability of a customer
activity in a nearer future. The four important aspect of this
approach are behaviour of customer, the loyalty of customer
with the company, customers personal details and some
external factors. This model follows a Nave Bayes approach
which calculates posteriori probabilities[7].
Another approach is based on information processing mech-
anism. It is primarily based on a large series of different neural
networks. The structure of these network is determined by the
inter connection between them. These networks are trained to
achieve desirable outcome. Their maximum excellent charac-
teristic is their capability to research automatically from the
information so that one can provide a method for predictions.
Hence these models are used to build efficient, reliable and
high accuracy customer churn prediction system[8].
Support Vector Machines are used for minimizing the risks.
It uses the kernel method to draft the data to a 2D or 3D
space. The data is divided in linear manner so that further
classification can be done. In this paper, an approach for churn
prediction by using SVM is discussed. To gain more accurate
results and performing, a grid search manner along with the
SVM was used. The SVM for customer churn prediction
method by using SVM provided 98.7% accuracy[9].
The analysis of the imbalanced data for the churn prediction
based on various predictive model with respect to KNN
classifier. The ratio of non-churn customers compared to churn
clients is only 2 percent, therefore lowering the imbalance
of the instructions before enforcing the models may be very
critical. Another important issue raised due to ratio imbalance
is over sampling resulting in redundancy. This is handled
by the algorithm SMOTE (Synthetic Minority Oversampling
Technique). In the crucial aspect of model selection, the
implementation of various model like AdaBoost, Extra Trees,
kNN, Neural Network and XGBoost[10] is compared. The
study is carried out based on the different resources provided
by one of the internet service providers of leading Telecom
Company. Usage based on numerous statistical assessments;
121 most relevant functions had been decided on to use for
training. these attributes were accumulated over 365 days
and examined the seasonal effect. The non-parametric lazy
learning algorithm used to carry out the classification is KNN
known as k-Nearest Neighbour. The algorithm takes all the
nearer cases in training set into the consideration using mea-
sures like Euclidian distance. This classifier is to understand
statistical analysis of customer churn and pattern recognition.
XGBoost feature importance score is used to understand the
top number of customer variable and best fit model. In the sim-
ilar way, maximum appropriate top functions for AdaBoost,
more trees, KNN and Neural network version are determined
by evaluating their performances. The important aspect of this
study is feature engineering on modelling phase and providing
methodology which can handle imbalanced information in
churn prediction system[11].
Customer churn analysis, is an important factor to be
considered while analysing company profits and revenues.
The only challenge in case of churn analysis is low rate of
churned customers in entire data which causes imbalance data
information and data sets. The paper put forth two approaches
to analyse the imbalanced data. One is sampling approach and
another one is known as cost-sensitive approach. The motive
of this analysis is to address the imbalanced records in the
dataset used in this study, so that it can produce greater per-
formance in churn prediction model. The dataset is processed
by the sampling approach and WRF classifier classification
used on the data produces inadequate results hence sampling
techniques are used to minimize the unbalancing between the
classes churn and non-churn. SMOTE[12] is used to increase
the probability of fetching the data from churn class. The two
methods mentioned earlier to reduce imbalanced data between
the classes, uses many classifiers to solve this issue such as,
logistic regression, linear classification, nave Bayes, decision
tree, multi-layer perception neural networks, support vector
machine, data mining evolutionary algorithm and random
forest. The paper proposes the classification algorithm along
with WRF and sampling which is used to generate multiple
decision trees from the training data and provide the churn
result with precision. The paper tests proposed model and
data with 10 cross validations to obtain result with maximum
precision. Which proves this model provides the churn analysis
with reliable accuracy and can assist in reducing potential
customer churn[5].
III. METHODOLOGY
CRISP DM Model has been used for this Data Mining
project. Cross-Industry Standard Process for Data Mining is a
6-phase constructed approach used for data mining project.
A. Business Understanding
This is the initial stage of the Data Mining project. In-
formation regarding the customer churn and details on the
determined outputs are decided.
Business Questions:
1) Which variables have the highest impact on the customer
retention?
3. Fig. 1. CRISP DM Model
2) Which model performs the best at classifying if the
customer will leave or stay?
B. Data Understanding
For this project, Customer churn dataset created on 2018-02-
23 is sourced Kaggle1
. The dataset consists of 21 attributes and
details with the datatype of each variable is given in figure2.
Fig. 2. Dataset information
C. Data Preparation
Data Preparation and pre-processing is the most important
stage of any data mining project. In this stage, the dataset is
checked for outliers, missing values and also if the attributes
are in proper datatype. Following steps were done in order to
convert the data to the stage where it is suitable for model
implementation:
1https://www.kaggle.com/blastchar/telco-customer-churn/metadata
1) The datatypes of few attributes were changed. The
categorical variables were converted into factors.
2) The dataset was checked for outliers. Interquartile range
rule was used and the range value is set based on the Q1
and Q3. The IQR rule was used on the tenure attribute.
3) Missing values were replaced with the mean of the
attribute instead of removing the row.
4) CustomerID was removed as it was not used for predic-
tion.
5) The variables were normalized using Z-score normaliza-
tion. The normalization was done using scale() function
in R.
Fig. 3. Boxplot of Tenure
Fig. 4. Boxplot of Totalcharges
D. Modelling and Evaluation of the models
After the data was cleaned and converted as per the re-
quirement, we split the dataset into training and test data in
order to implement the models. The models learn from the
training set and implement on the test set. The models that are
implemented are Random forest, C5.0, K-Nearest Neighbour,
Decision Tree and Artificial neural network.
Random Forest: Random forest is an ensemble of decision
tree which can be used for regression and classification. The
4. main advantage of random forest is that it does not produce
overfit results and also produces great accuracy. We have used
random forest to classify if the customer will churn or not.
The confusion matrix is given in figure5.
Fig. 5. Random forest model Results- Rcode
The Random forest produced results with 73.08% accuracy
and with the error rate of 0.27. The area under the curve value
is 0.63 which falls in the D category,The ROC of the model
is displayed in figure6.
Fig. 6. ROC of Random forest model
Business question 1:
Random forest to find the importance the variables in
classifying the customer churn variable:
We use random forest to determine the importance of each
of the variable and to find which variable has the highest
correlation with the customer churn variable.
From the above plot we can infer that the variables Tenure,
Total Charges, contract and monthly charges are the most
important variables which are used to classify the customer
churn.
Fig. 7. Parameters contributing the most to accuracy of the model
C5.0: C5.0 is an updated version of C4.5 algorithm which
can produce 2 kinds of models, namely, decision tree and rule
set. The C5.0 can also find the attribute relevance. We have im-
plemented the C5.0 algorithm by using Tenure, Total Charges,
Contract and monthly charges as independent variables. The
model is evaluated based on the value of accuracy, error rate,
precision, recall and AUC value.
Fig. 8. C5.0 Results - Rcode
The accuracy was found to be 77.05% with error rate of
0.22. The values of precision and recall were found to be
0.81, 0.88 and the F1 value is 0.84.AUC value was found to
be 0.79 which falls under C category.Model of C5.0 and ROC
curve is displayed in figure9 and figure10 respectively.
NAIVE BAYES Naive Bayes algorithm is based on the
Bayes theorem of probability distribution. Nave Bayes was
implemented and the accuracy was found to be 69.90. The
confusion matrix is displayed in figure11.
The error rate was 0.30 and precision, recall and F1 values
are 0.88, 0.67 and 0.76 respectively.The value of area under
5. Fig. 9. C5.0 Model
Fig. 10. C5.0 Model ROC curve
Fig. 11. Naive Bayes Results RCode
the curve was found to be 0.79 which falls under C category.
The ROC curve of the model is displayed in figure12.
Decision Tree: Decision tree is a graphical representation
of all the possible solution to a decision. The main advantage
Fig. 12. ROC Curve of Naive Bayes model
of a decision tree is that it can be easily explained. We have
implemented the Decision tree algorithm and got the following
results. The confusion matrix is given below in the figure13.
The accuracy, error rate, precision, recall and F value were
Fig. 13. Decision Tree Results Rcode
found to be 77.96, 0.22, 0.82, 0.89 and 0.85 respectively.The
area under the curve value was found to be 0.79 which falls
under the C category.The Model of decision tree and ROC
curve is displayed in figure14 and figure15.
K-Nearest Neighbour: It is a supervised machine learning
classification model. The main advantages if KNN are it is
simple and effective and does not have any assumption. The
disadvantages of KNN are that it requires large amount of
memory and does not work well if the dataset has missing
values. We have implemented the KNN algorithm with K
= 1,5,10 and 30. The results obtained show the accuracy
increases with increase in K value. Figure16, Figure17 and
Figure18 show Rcode implementation of KNN model for K
value= 1, 5 and 30 respectively.
We can see that the KNN produces highest accuracy of
78.30% when the K value is 30.Comparison of Performance
6. Fig. 14. Decision Tree Model
Fig. 15. Decision Tree ROC curve
Fig. 16. KNN Model with K = 1 - Rcode
measure when values of K are 1,5,10 and 30 are displayed in
figure19.
Support vector machine (SVM): SVM is a supervised
method which can be used for classification and regression.
Fig. 17. KNN Model with K = 5 - Rcode
Fig. 18. KNN Model with K = 30 - Rcode
Fig. 19. Comparison of KNN model with different K values
SVM is good at handling non-linear and complex dataset. The
only disadvantage of SVM is that it is a Blackbox method. We
have implemented SVM and obtained the following results.
The Confusion matrix is displayed in figure20.
The accuracy, error rate, precision, recall and F value were
found to be 78.13%, 0.21, 0.83, 0.87 and 0.85 respectively.
Artificial Neural Network: ANN is a supervised algorithm
which is typically used for regression and classification. The
main disadvantages are it is difficult to understand as it is a
black box method and also takes long time for processing.
ANN was implemented and the results displayed in figure21
obtained. Accuracy of ANN was found to be 79%
The Model of ANN and ROC Curve is displayed in figure22
and figure23 respectively.
7. Fig. 20. SVM model Results - RCode
Fig. 21. ANN - Results
Fig. 22. ANN - Model
Business question 2:
Which model produce better results?
• Accuracy of Random Forest - 73.08%
• Accuracy of C5.0 - 77.05%
Fig. 23. ANN - ROC Curve
• Accuracy of Decision Tree 77.96%
• Accuracy of Naive Bayes - 69.90%
• Accuracy of KNN - 78.30%
• Accuracy of SVM - 78.13%
• Accuracy of ANN - 79.15%
ANN produces better accuracy compared to other
models. Model comparison of Random Forest,Decision
tree(CART),C5.0 and Naive Bayes is displayed in figure24
Fig. 24. ROC Curve comparison of RF,CART,C5.0 and NB
E. Deployment
This is the final phase of a CRISP DM Model where the
models are deployed in real life scenarios.
IV. CONCLUSION AND FUTURE WORK
In this project, we have classified the customer churn by
using machine learning algorithms such as random forest,
C5.0, Decision tree, KNN, ANN and SVM. We have also
found the importance of each variable in contribution to the
classification of customer churn variable. In future, we can
work with a larger dataset and also use various other attributes
to do the same task of classifying the customer churn.
8. REFERENCES
[1] S. Amaresan, “What is customer churn?” 2019. [Online]. Available:
https://blog.hubspot.com/service/what-is-customer-churn
[2] M. Karanovic, M. Popovac, S. Sladojevic, M. Arsenovic, and D. Ste-
fanovic, “Telecommunication services churn prediction-deep learning
approach,” in 2018 26th Telecommunications Forum (TELFOR). IEEE,
2018, pp. 420–425.
[3] H. Ma, M. Qin, and J. Wang, “Analysis of the business customer churn
based on decision tree method,” in 2009 9th International Conference
on Electronic Measurement & Instruments. IEEE, 2009, pp. 4–818.
[4] F. Guo and H.-L. Qin, “The analysis of customer churns in e-commerce
based on decision tree,” in 2015 International Conference on Computer
Science and Applications (CSA). IEEE, 2015, pp. 199–203.
[5] V. Effendy, Z. A. Baizal et al., “Handling imbalanced data in customer
churn prediction using combined sampling and weighted random forest,”
in 2014 2nd International Conference on Information and Communica-
tion Technology (ICoICT). IEEE, 2014, pp. 325–330.
[6] H. Li, D. Yang, L. Yang, X. Lin et al., “Supervised massive data
analysis for telecommunication customer churn prediction,” in 2016
IEEE International Conferences on Big Data and Cloud Computing
(BDCloud), Social Computing and Networking (SocialCom), Sustainable
Computing and Communications (SustainCom)(BDCloud-SocialCom-
SustainCom). IEEE, 2016, pp. 163–169.
[7] A. Amin, B. Shah, A. M. Khattak, T. Baker, S. Anwar et al., “Just-in-
time customer churn prediction: With and without data transformation,”
in 2018 IEEE Congress on Evolutionary Computation (CEC). IEEE,
2018, pp. 1–6.
[8] S. H. Dolatabadi and F. Keynia, “Designing of customer and employee
churn prediction model based on data mining method and neural
predictor,” in 2017 2nd International Conference on Computer and
Communication Systems (ICCCS). IEEE, 2017, pp. 74–77.
[9] A. Rodan, H. Faris, J. Alsakran, and O. Al-Kadi, “A support vector ma-
chine approach for churn prediction in telecom industry,” International
journal on information, vol. 17, no. 8, pp. 3961–3970, 2014.
[10] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining. ACM, 2016, pp. 785–794.
[11] D. Do, P. Huynh, P. Vo, and T. Vu, “Customer churn prediction in an
internet service provider,” in 2017 IEEE International Conference on
Big Data (Big Data). IEEE, 2017, pp. 3928–3933.
[12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
synthetic minority over-sampling technique,” Journal of artificial intel-
ligence research, vol. 16, pp. 321–357, 2002.