Churn Prediction on customer data

Problem
´ Explore the dataset to determine its quality and specify data quality issues
identified. Apply the pre-processing and visualisation techniques used in
completing the Day 2 Group work.
´ Describe how you dealt with the data quality issues encountered
´ Modify the R code provided to fit a logistic regression model to predict
churn
´ Identify the variables that contribute most to predicting churn
´ What business insights can be derived from the analysis?

Preprocessing Data
Data cleaning:
´ Dropped four unwanted columns .
´ Changed categorical variables to Numerical variables.
´ No NULL records found
Exploratory Data Analysis:
´ Graphs for numerical values show normal Distribution

´Gender count comparison
for churn – 0 and 1, shows
Male-Female count was very
high at around 2500 for churn-
0 compared to Male-female
count of < 1000 for churn-1, this
shows gender can be a good
predictor variable for
predicting churn.
Gender count churn-0 Vs churn-1
0
500
1000
1500
2000
2500
3000
0 1
Fema
le
Male

Gender & Partner Count
0
500
1000
1500
2000
2500
3000
0 1
Female
Male
0
500
1000
1500
2000
2500
3000
0 1
No
Yes

Dependent & PhoneService Count
0
500
1000
1500
2000
2500
3000
3500
4000
0 1
No
Yes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 1
No
Yes

MultipleLines & InternetService count
0
500
1000
1500
2000
2500
3000
0 1
No
No phone service
Yes
0
500
1000
1500
2000
2500
0 1
DSL
Fiber optic
No

OnlineSecurity & OnlineBackup
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes

Payment Method and PaperlessBilling
count
0
200
400
600
800
1000
1200
1400
0 1
Bank transfer (automatic)
Credit card (automatic)
Electronic check
Mailed check
0
500
1000
1500
2000
2500
3000
0 1
No
Yes

Contract & StreamingMovie Count
0
500
1000
1500
2000
2500
0 1
Month-to-month
One year
Two year
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes

StreamingTV & TechSupport Count
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 1
No
No internet service
Yes
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes

Logistic Regression Model
´ Binary LR model works on categorical dependent variables or qualitative
variables that can take 2 values, ex: Yes/No,
´ Multinomial LR models can work for three possible categories.
´ LR models estimates probability of occurrence of event using logarithmic
likelihood function and not least minimum square method used by
regression models.
´ Binary logistic regression model estimates probability of occurrence of
dependent variable Y, which present itself in dichotomous form(0/1).
´ 𝑧 = 𝛼 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑛𝑋𝑛
´ Z is known as logit., a,B1,B2,..Bk are the estimated parameters for
explanatory variables X1,X2,…Xk

LR model
´ Binary logistic regression defines Z logit as natural logarithm of odds, such
that:
´ Type equation here.
´ ℓ𝓃(pi/(1-pi))=zi
´ pi=1/(1+e-zi) or p=f(Z)
´ Substitute Categorical variables with dummy
Variables, then maximize the logarithmic
Likelihood function given by (yi).ln(pi)+(1-yi).ln(1-pi).
Keeping 𝛼1=𝛽1= 𝛽 2= 𝛽3=..= 𝛽n=0, calculate zi , pi, lli .

Logistic Regression
´ M3, M5,M7,M9 cells contain respective estimation parameters, use solver to
calculate z, pi, lli ,maximizing the sum of Lli using the data file.
´ There is no percentage of variance w.r.t predicting variables or R^2 as in
traditional regression models estimated by least minimum square., More
adequate criteria to choose best model is ROC curve(receiver operating
characteristic)
´ X2 test is used to verify the model significance , since its null hypothesis are
´ H0: =B2=B2=…=Bk=0
´ H1: there is atleast one Bi !=0

Logistic Regression
´ Confusion Matrix based on cutoff = 0.5, representing number of FP, TP, FN,
TN
´ OME- Overall model efficiency = (TP+ TN)/Total events
´ Sensitivity= % of hits, for a determined cutoff considering observations that
are in-fact events
´ Specificity = % of hits, for a determined cutoff considering observations that
are not events.

LR Results
´ Overall Accuracy: 0.80
´ Avg_AUC : 0.84
´ Ang_F1: 0.58

Churn Prediction on customer data

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Churn Prediction on customer data

Similar to Churn Prediction on customer data (20)

More from NidhiArora113

More from NidhiArora113 (7)

Recently uploaded

Recently uploaded (20)

Churn Prediction on customer data