Statistical Analysis, Pre-processing, Exploratory Analysis, Pivot Tables, charts and graphical representation of Information to improve business decision making.
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
Churn Prediction on customer data
1. Problem
´ Explore the dataset to determine its quality and specify data quality issues
identified. Apply the pre-processing and visualisation techniques used in
completing the Day 2 Group work.
´ Describe how you dealt with the data quality issues encountered
´ Modify the R code provided to fit a logistic regression model to predict
churn
´ Identify the variables that contribute most to predicting churn
´ What business insights can be derived from the analysis?
2. Preprocessing Data
Data cleaning:
´ Dropped four unwanted columns .
´ Changed categorical variables to Numerical variables.
´ No NULL records found
Exploratory Data Analysis:
´ Graphs for numerical values show normal Distribution
3. ´Gender count comparison
for churn – 0 and 1, shows
Male-Female count was very
high at around 2500 for churn-
0 compared to Male-female
count of < 1000 for churn-1, this
shows gender can be a good
predictor variable for
predicting churn.
Gender count churn-0 Vs churn-1
0
500
1000
1500
2000
2500
3000
0 1
Fema
le
Male
8. Payment Method and PaperlessBilling
count
0
200
400
600
800
1000
1200
1400
0 1
Bank transfer (automatic)
Credit card (automatic)
Electronic check
Mailed check
0
500
1000
1500
2000
2500
3000
0 1
No
Yes
9. Contract & StreamingMovie Count
0
500
1000
1500
2000
2500
0 1
Month-to-month
One year
Two year
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes
10. StreamingTV & TechSupport Count
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 1
No
No internet service
Yes
0
500
1000
1500
2000
2500
0 1
No
No internet service
Yes
11. Logistic Regression Model
´ Binary LR model works on categorical dependent variables or qualitative
variables that can take 2 values, ex: Yes/No,
´ Multinomial LR models can work for three possible categories.
´ LR models estimates probability of occurrence of event using logarithmic
likelihood function and not least minimum square method used by
regression models.
´ Binary logistic regression model estimates probability of occurrence of
dependent variable Y, which present itself in dichotomous form(0/1).
´ 𝑧 = 𝛼 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑛𝑋𝑛
´ Z is known as logit., a,B1,B2,..Bk are the estimated parameters for
explanatory variables X1,X2,…Xk
12. LR model
´ Binary logistic regression defines Z logit as natural logarithm of odds, such
that:
´ Type equation here.
´ ℓ𝓃(pi/(1-pi))=zi
´ pi=1/(1+e-zi) or p=f(Z)
´ Substitute Categorical variables with dummy
Variables, then maximize the logarithmic
Likelihood function given by (yi).ln(pi)+(1-yi).ln(1-pi).
Keeping 𝛼1=𝛽1= 𝛽 2= 𝛽3=..= 𝛽n=0, calculate zi , pi, lli .
13. Logistic Regression
´ M3, M5,M7,M9 cells contain respective estimation parameters, use solver to
calculate z, pi, lli ,maximizing the sum of Lli using the data file.
´ There is no percentage of variance w.r.t predicting variables or R^2 as in
traditional regression models estimated by least minimum square., More
adequate criteria to choose best model is ROC curve(receiver operating
characteristic)
´ X2 test is used to verify the model significance , since its null hypothesis are
´ H0: =B2=B2=…=Bk=0
´ H1: there is atleast one Bi !=0
14. Logistic Regression
´ Confusion Matrix based on cutoff = 0.5, representing number of FP, TP, FN,
TN
´ OME- Overall model efficiency = (TP+ TN)/Total events
´ Sensitivity= % of hits, for a determined cutoff considering observations that
are in-fact events
´ Specificity = % of hits, for a determined cutoff considering observations that
are not events.