report

Predicting Customer Churn at QWE Inc.
Group 11:
Qiang Gong
Jiaxuan Han
Meghan Hickey
Xiangheng Ma
Yawen Yang

ID Days Since Last
Login 0-1
Probability of
Churn
354 -1 4.49%
672 2 4.83%
5203 5 5.19%
Executive Summary
In this paper, we will determine QWE Inc.’s customers’ probability of churn by the end of February 2012. This report will
use a mix of modelling methods to classify customers based on their likelihood to churn. Based on logistic regression anal-
ysis,we found that Customer Happiness Index in December, CHI change from November to December and difference
of Days Since Last Login between December and November are three most important factors QWE can use to pre-
dict a customer’s probability of terminating the contract.We will explain the application of this method and then look
at how a decision tree approach further validates the findings from logistic regression. In order to fully demonstrate our
findings, we will use customers 354, 672 and 5203 as illustrations throughout the paper and discuss the specific applica-
tions of the different modeling techniques to each of these three cases. We will also provide a list of 10 potential customers
who have the highest churn probability so that the company can adopt a more proactive way to retain them.
Analysis:
Logistic Regression Analysis
Single Variable Approach
For QWE, Inc., the purpose is to predict if a customer will
terminate their contract based on certain factors and assess
the importance of factors, so that QWE has an accurate
model for future reference. The first step of our analysis was
to run logistic regression as the method is helpful for
analyzing a dataset that has one or more variables whose
outcome is dichotomous.
To figure out the best predictor, we first aim to determine
which variables are significant in relation to the result.
We evaluated the variables by looking at their p-value for
significance and their standardized coefficient values and
were able to filter out the following variables that aren’t
significant at all - Age (how long they have been a custom-
er with QWE), Support Case 0-1 (the difference between
service requests from Nov. to Dec. ), SP 0-1(the difference
between seriousness of cases reported from Nov. to Dec.),
blog articles 0-1 (the difference between blog articles posted
from Nov. to Dec.), views 0-1 (the difference between views
from Nov. to Dec.)
To illustrate this finding, we calculated the probability of
churn of customers 354, 672 and 5203 based on the change
of Days Since Last Login between Nov and Dec. Their
probabilities of churn are listed below.
We can see that the probability of churn has a positive
correlation with the difference of Days Since Last Login
between Dec and Nov. The smaller the difference of Days
Since Last Login between Dec and Nov, that is, the more
active a customer becomes, the lower probability a customer
would churn. We also noticed that even though a customer
becomes more active, namely, the difference of Days Since
Last Login between Nov and Dec decreases, the probability
of churn doesn’t show obvious decrease. All the probabili-
ties are pretty low.
Then we compared the absolute value of the coefficient of
standardized data of the remaining variables. The higher the
absolute value is, the bigger impact the variable has on
predicting churn probability. We found that “Days Since
Last Login” has the biggest absolute value. Therefore, we
conclude that “Days Since Last Login 0-1” is the most
impactful predictor. It makes sense intuitively because this
variable indicates the change in recency between Nov and
Dec, which tells whether a customer becomes more active
or not.
• USE these predictions of probability to help QWE improve
their business
• FIND out the best predictor of prediction of probability of
churn
• ENABLE QWE to use the wealth of data they possess to
identify customers who are most likely to leave

Then we used Receiver Operating Characteristic (ROC)
curve to evaluate the performance of the single classifier.
(Please refer to the term explanation below). We start with
the variable Days Since Last Login 0-1 to see if the logistic
regression model with the single variable is accurate enough.
After feeding the model customer behavior data, we can
generate probabilities of churn for all customers in the
database. Then we ran ROC analysis in SPSS and came up
with the following graph:
Therefore, if the entire ROC curve is even below the
benchmark, the model doesn’t perform well because
it is even worse than random guessing. In the case of
QWE, one part of the curve does go under the benchmark
curve. Moreover, the AUC of 0.589 is only slightly bigger
than 0.5, which is the AUC of diagonal. Combining these
two facts, we can conclude that the predict model using
Days Since Last Login is not sensible enough and does not
predict outcomes very accurately.
In fact, when performing the ROC analysis for the
remaining variables, we find that all AUC are similarly low
(slightly higher than 0.5). Hence using a single variable to
predict the customer churn probability may not give us the
best result. We think it’s necessary to devise a better model
that looks at multiple factors at once so QWE can more
accurately track the behavior of their customers and
understand what it means.
Three Factors Approach
Based on previous logistic regression analysis, we selected
the six variables with the highest significance, CHI Month
0, CHI 0-1, Support Cases Month 0, Support Priority 0-1,
logins 0-1 and Days Since Last Login 0-1, to be included in
logistic regression model. However, we found that three of
them actually have no significant impact on the predicted
probability and decided to only analyze the three that
actually did have high significance. With this method, we
determined Customer Happiness Index in December,
CHI change from November to December and
difference of Days Since Last Login between December
and November to be the three best factors for our
prediction modelling because they contribute the most
to the churn probability.
To illustrate, we used this updated model with three key
variables instead of just one to calculate the churn
probability for customers 354, 672 and 5203. Their
probabilities of churn are 4.73%, 3.59%, and 4.46%
respectively, which are pretty low.
The y-axis is TPR and the x-axis is FPR. (SPSS names
them as sensitivity and 1- Specificity respectively, but
they are essentially the same). The diagonal line
represents a benchmark ROC curve using simple guess
method (i.e.flip a coin) to predict positive or negative
outcome.
Term: ROC
ROC is the most commonly
used method to measure whether
your classification is effective and has
many important advantages. First, it gives you
the true positive rate (TPR) and false positive rate
(FPR) by considering all possible cut off points rather
than looking at just one specific cut off point.
Second, the area under curve (AUC) is a
useful metric that represents the overall
accuracy of the model.
Term: TPR and FPR
TPR denotes the percentage
of customers who finally churn
as we predicted out of customers who
actually churn, while FPR denotes the
percentage of customers who didn’t churn but
we predicted they will out of the amount
of customers who didn’t churn in
reality.

ID 354 672 5203
CHI Month 0 139 148 37
CHI 0-1 -29 1 32
Days Since
Last Login 0-1
-1 2 5
Probability of
Churn
4.73% 3.59% 4.46%
Based on churn probabilities resulting from the updated
logistic regression model, we created a list of the top ten
customers who have the highest likelihood of leaving the
company. The table below lists the IDs and corresponding
churn probabilities for these ten risky customers. Note that in
our dataset, the probability ranges from 0% to 22.4%. To put
it another way, even though the absolute value of probability
is not as big as 90% or 100%, it is big enough to show the risk
of churn when compared internally. Being able to identify
these risky customers is a huge opportunity for QWE because
it will allow them to understand the specific forces that lead
to churn. With that information, they can more reasonably
attempt to cut this problem off at the head by knowing that a
customer is probably going to terminate their contract before
that customer has even decided it themselves.
The upgraded logistic regression model also showed
vast improvement on ROC curve. The AUC of 0.634 is
higher than that of any single variable, and every part
of the curve is over the diagonal. We find this method
to be much more suitable and appropriate for QWE’s
prediction needs. It’s not one factor, but a combination
of factors that lead to churn and the model that predicts
it must reflect that nuance.
10 Risky Customers
1971
2076
1287
3671
1929
4245
1236
1616
2546
22%
21%
19%
18%
17%
16%
16%
16%
16%
16%
Possibility of Churn
109

Precision 19.25%
Accuracy 87.96%
TPR 42.72%
FPR 9.61%
Decision Tree Analysis
In order to give QWE the most thorough recommendations possible, we further modeled the data through a second
approach. Decision trees are a visually representative way for us to predict if a customer will churn or not by
generating a clear, specific path of rules that can easily be understood. The following image is the decision tree we
got from R using the QWE case data. After manually going through this decision process for customer 354, 672,
and 5203 one by one, we predict that these three customer won’t leave.
This method is useful for showing us which variables are key influencers of churn and partitioning of meaningful
patterns of breaking points. The higher the position of a variable (the node) in the tree, the more importance of the
variable. Both the decision tree and Logistic regression pick Days Since Last Login as the best predictor. Furthermore,
we evaluated the performance of decision tree method using different accuracy metrics. Although the TPR is about
average performance, the Accuracy (88%) is much higher than that of logistic regression. However, it’s the nature of
decision tree because it has the tendency to maximumly fit the training dataset so that the accuracy would even reach
to 100%. That is to say, if given new customers, the model may do bad job in prediction. Moreover, decision tree is
extremely sensible to small changes in dataset: the structure of the tree would change correspondingly. In reality, it is
likely to happen because some customers may edit their profile and change some information. In contrast to above
two downsides of decision tree approach, logistic regression can be tailored to particular business circumstances. In
this case, different cutoff point can be set depending on how the manager weight the cost of losing a customer against
the cost of retaining a customer. In conclusion, we recommend QWE.Inc to adopt logistic regression approach.

Factors Change in
this factor
How possibility
of churn will be
affacted
CHI Month 0
CHI 0-1
Days Since Last
Login 0-1
Recommendation:
Based on our analysis, we recommend QWE to con-
sider Customer Happiness Index in December, CHI
change from November to December and difference of
Days Since Last Login as three most important drivers
of prediction of churn. Focusing on these variables will
allow QWE to focus on customers who they are in the
highest danger of churn and identify points at which
their business might fail and these customers might
leave. This knowledge can be applied to strategy in all
areas of the business: marketing, product management,
etc. The models we created will help QWE tighten up
their business and better understand their customers and
their behavior. Specific examples of strategy include the
creation of a customer service outreach program where
QWE targets these bottom ten customers and sends
service representatives to engage with them and offer
them incentives to stay with the company.
Through logistical regression, we found a specific
association between these three factors and possibility of
churn:
Using the knowledge about these three priority variables, we
have devised the following recommendations for QWE in
terms of business operation:
Enhance user experience to increase Customer
Happiness Index. To achieve this goal, QWE can take appli-
cations like making user interface more friendly and acceler-
ating loading speed.
Increase user cohesiveness and interaction to
improve customer login recency. It’s critical to maintain
our users’ level of activity on our platform. There is a clear
relationship from being more active of the site in terms of
both content creation and simply volume of activity. For
example, QWE can use better calls to action in order to
incentivize traffic. Other than that, if they can make their
service more mobile-friendly, it will help increase using
frequency as well.

report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to report

Similar to report (20)

report