SlideShare a Scribd company logo
1 of 12
1
Andrew Rogala STAT 481 Student Project
ResearchGoal and Data:
The goal of the analysis is to develop a statistical model to predict default status (Yes, No) of
credit card customers given the predictors income, balance, and student status (Yes, No).
Additionally, it would be advantageous to take the view point of a credit card company and
produce a model with a high true positive rate, a low false positive rate, and a high detection rate.
To accomplish this a low probability threshold is needed. The data used for this analysis is the
Default data set in the ISLR library. It is a simulated data set containing information on ten
thousand customers.
Response Variable:
The response variable is default status Yes or No.
Predictor Variables:
The predictor variables are annual income in dollars, average balance, in dollars, that the
customer has remaining on their credit card after making their monthly payment, and student
status Yes or No.
Statistical Methods:
Multiple logistic regression and the validation set approach will be used. The validation set
approach will gauge how well the model will perform on a new set of data.
Summary Statistics for the Default data set:
default student balance income
No : 9667 No : 7056 Min. : 0.0 Min. : 772
Yes: 333 Yes: 2944 1st Qu. : 481.7 1st Qu. : 21340
Median : 823.6 Median : 34553
Mean : 835.4 Mean : 33517
3rd Qu. : 1166.3 3rd Qu. : 43808
Max. : 2654.3 Max. : 73554
P(default = Yes) = 333/10,000 = 0.0333
P(default = No) = 9667/10,000 = 0.9667
(student)
No Yes
(default) No 6850 2817
Yes 206 127
P(default = Yes given student = Yes) = 127/(2817+127) = 0.04313859
P(default = Yes given student = No) = 206/(6850+206) = 0.02919501
2
Student = red Plot taken from “An Introduction to Statistical Learning” page 137
Non-Student = blue
Box Plots:
From the left box plot below it appears that those individuals who defaulted tended to have much
higher credit card balances. This solid relationship between the predictor variable balance and
the response variable default suggests there is a strong correlation between the two of them.
From the right box plot below it appears that those individuals who defaulted tended to have a
slightly lower median income. Thus, the predictor income is slightly correlated with the response
default. Next, I will plot box plots for student and balance as well as student and income to see if
there is any collinearity between these predictor variables.
No Yes
05001000150020002500
Default
CreditCardBalance
No Yes
0200004000060000
Default
Income
3
Analysis of the left boxplot below shows that students tend to have slightly higher credit card
balances than non-students and thus student and balance are correlated. Analysis of the right
boxplot below shows that students tend to have much lower incomes than non-students; thus the
student and income variables are correlated. Due to the collinearity between these predictor
variables I will initially leave the student variable out of the logistic regression and just fit a
model for default with income and balance as predictors. Later I will add all three and see which
model produces better results.
The 1st fit model on the training data is:
𝑙𝑜𝑔(
𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡)
1−𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡)
) = -11.27 + 0.00001788(income) + 0.005538(balance)
P(default) =
1
1+𝑒−(−11.27+0.00001788( 𝑖𝑛𝑐𝑜𝑚𝑒)+0.005538(𝑏𝑎𝑙𝑎𝑛𝑐𝑒))
P-value for income = 0.0117
P-value for balance < 2 x 10-16
Thus both income and balance are significant predictors of default
No Yes
05001000150020002500
Student Status
CreditCardBalance
No Yes
0200004000060000
Student Status
Income
4
Area under the curve = 0.9493
Probability threshold = 0.03540053
Confusion Matrix and Statistics
(Observed)
No Yes
(Predicted) No 4188 17
Yes 644 151
% Correctly Classified = Accuracy = 0.8678
True Positive Rate = Sensitivity = 0.8988 P(predict default/default)
True Negative Rate = Specificity = 0.8667 P(predict not default/ not default)
False Positive Rate = (1 - Specificity) = 0.1333
Prevalence = 0.0336
Detection Rate = 151/5000 = 0.0302
Detection Prevalence = (644+151)/5000 = 0.1590
%Misclassified = Test Error = (644+17)/5000 = 0.1322 = 13.22%
The 2nd fit model on the training data is:
𝑙𝑜𝑔(
𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡)
1−𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡)
) = -10.64 + 0.000001296(income) + 0.005615(balance) + -0.5947(student)
P(default) =
1
1+𝑒−(−10.64+0.000001296( 𝑖𝑛𝑐𝑜𝑚𝑒)+0.005615( 𝑏𝑎𝑙𝑎𝑛𝑐𝑒)+ −0.5947(𝑠𝑡𝑢𝑑𝑒𝑛𝑡))
ROC Curve 1st model
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
5
P-value for income = 0.9112
P-value for balance < 2 x 10-16
P-value for student = 0.0708
Thus balance is the only significant predictor of default in this model. However, depending on
the choice of alpha student may be considered a significant predictor as well.
Area under the curve = 0.9503
Probability threshold = 0.03197311
Confusion Matrix and Statistics
(Observed)
No Yes
(Predicted) No 4160 16
Yes 672 152
% Correctly Classified = Accuracy = 0.8624
True Positive Rate = Sensitivity = 0.9048 P(predict default/default)
True Negative Rate = Specificity = 0.8609 P(predict not default/ not default)
False Positive Rate = (1 - Specificity) = 0.1391
Prevalence = 0.0336
Detection Rate = 152/5000 = 0.0304
Detection Prevalence =(672+152)/5000 = 0.1648
%Misclassified = Test Error = (672+16)/5000 = 0.1376 = 13.76%
ROC Curve 2nd model
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
6
Interpretation of Results:
Overall, without taking income and balance into consideration students have a higher
probability of default (0.0431) as compared to the probability of default for non-students
(0.0292). Thus, if nothing is known about a customer’s income or credit card balance students
are a risker population. However, by examining the graph on page two it is clear that students
(red) with the same balance as non-students (blue) have a lower default rate than non-students.
The correlation between student and balance explains this paradox. Students tend to have higher
credit card balances than non-students, see box plot on page three, and it is known that customers
with higher balances are more likely to default; see box plot on page two. Even though the
student population is more likely to have higher credit card balances, which tend to be associated
with higher default rates, it is still possible for an individual student to have a lower probability
of default than a non-student given that they have the same income and balance. The conclusion
is that if no information is given about a customer’s balance and income students are risker;
however, a student is less risky than a non-student with the same balance and income.
The first model has a slightly smaller test error of 13.22%, as opposed to the second
model’s test error of 13.76%. In addition, the first model produced better p-values for its
predictors. However, with the addition of the student variable, the second model provides
significantly more information to justify using this model as the main method for predicting
default.
The second model has an area under the ROC curve of 0.9503 suggesting a good fit. It
also has a high true positive rate (0.9048), a reasonable false positive rate of (0.1391), a high
detection rate of 0.0304, and a test error of 13.76%. Also out of 5000 predictions only 16 were
predicted to be No and observed to be a Yes. In theory, using this model a credit card company
could reduce their default rate to 16/5000 = 0.32% as compared to the observed default rate of
3.36%.
By choosing a small probability threshold a high true positive rate was achieved, however
doing this does cause the test error to increase. A sacrifice worthwhile taking the view of a credit
card company trying to reduce their default rate. The threshold can be changed to modify this
model to fit the specific needs of users.
Now an interpretation of the second model is a follows. A one unit increase in balance is
associated with an increase in the log odds of default by 0.005615 units when holding all other
predictors constant. A one unit increase in income is associated with an increase in the log odds
of default by 0.000001296 units when holding all other predictors constant. Finally, a student is
associated with a decrease in default by 0.5947 units when holding all other predictors constant.
7
R Code and R OutPut:
>library(pROC)
>library(ROCR)
>library(mgcv)
>library(caret)
>library(e1071)
>library(ISLR)
> attach(Default)
> fix(Default)
> dim(Default)
[1] 10000 4
> ?Default
> summary(Default)
default student balance income
No :9667 No :7056 Min. : 0.0 Min. : 772
Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
Median : 823.6 Median :34553
Mean : 835.4 Mean :33517
3rd Qu.:1166.3 3rd Qu.:43808
Max. :2654.3 Max. :73554
> #P(Default = Yes)
> 333/10000
[1] 0.0333
> #P(Default = No)
> 9667/10000
[1] 0.9667
> #Some Conditional Probabilities
> table(Default$default,Default$student)
No Yes
No 6850 2817
Yes 206 127
> #P(default = Yes given student = Yes)
> 127/(2817+127)
[1] 0.04313859
> #P(default = Yes given student = No)
> 206/(6850+206)
[1] 0.02919501
>#Box Plots
> par(mfrow=c(1,2))
> plot(default, balance, xlab="Default", ylab="Credit Card Balance", col="red")
> plot(default, income, xlab="Default", ylab="Income", col="green")
> par(mfrow=c(1,2))
8
> plot(student,balance,xlab="Student Status",ylab="Credit Card Balance", col="red")
> plot(student,income,xlab="Student Status",ylab="Income", col="green")
> #Training and HoldOut Sets
> set.seed(23)
> ReSampleData = Default[sample(nrow(Default)),]
> Data.Set.Splits = cut(seq(1,nrow(ReSampleData)),breaks=2,labels=FALSE)
> tIndexes = which(Data.Set.Splits!=1,arr.ind=TRUE)
> Training.Set = ReSampleData[tIndexes, ]
> fix(Training.Set)
> HoldOut.Set = ReSampleData[-tIndexes,]
> fix(HoldOut.Set)
> #fit the 1st logistic regression on training data
> default.glm.training = glm(default~income + balance,
family=binomial(link="logit"),data=Training.Set)
> summary(default.glm.training)
Call:
glm(formula = default ~ income + balance, family = binomial(link = "logit"),
data = Training.Set)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4201 -0.1489 -0.0604 -0.0231 3.6961
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.127e+01 6.000e-01 -18.778 <2e-16 ***
income 1.788e-05 7.088e-06 2.522 0.0117 *
balance 5.538e-03 3.162e-04 17.516 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1450.21 on 4999 degrees of freedom
Residual deviance: 797.18 on 4997 degrees of freedom
AIC: 803.18
Number of Fisher Scoring iterations: 8
> #predicts probabilities for holdout set values using the training set model
> HoldOut.Set$predict.default.glm.hold=predict(default.glm.training,
type="response",newdata=data.frame(HoldOut.Set))
> fix(HoldOut.Set)
9
> # Plot the ROC curve
> perf.AUC.glm =
performance(prediction(HoldOut.Set$predict.default.glm.hold,HoldOut.Set$default),"tpr","fpr")
> par(mfrow=c(1,1))
> plot(perf.AUC.glm,col="blue",lwd=3,main="ROC Curve 1st model")
> # Estimate of AUC ROC
> ROC.glm.hold=roc(HoldOut.Set$default,HoldOut.Set$predict.default.glm.hold
,percent=FALSE,plot=FALSE,ci=TRUE)
> AUC.glm.hold=ROC.glm.hold$auc
> AUC.glm.hold.lb=ROC.glm.hold$ci[1]
> AUC.glm.hold.ub=ROC.glm.hold$ci[3]
> AUC.glm.hold
Area under the curve: 0.9493
> AUC.glm.hold.lb
[1] 0.935706
> AUC.glm.hold.ub
[1] 0.9629286
> #Probability Threshold
> thresh.glm.hold.youden=coords(ROC.glm.hold, x="best", input="threshold",
best.method="youden")
> thresh.glm.hold=thresh.glm.hold.youden[1]
> specif.glm.hold=thresh.glm.hold.youden[2]
> sensit.glm.hold=thresh.glm.hold.youden[3]
> thresh.glm.hold
threshold
0.03540053
> specif.glm.hold
specificity
0.8667219
> sensit.glm.hold
sensitivity
0.8988095
> #Confusion Matrix and Statistics
> glm.pred.hold=rep("No",nrow(HoldOut.Set))
> glm.pred.hold[HoldOut.Set$predict.default.glm.hold>thresh.glm.hold]="Yes"
> xtab.glm.hold=table(glm.pred.hold,HoldOut.Set$default)
> xtab.glm.hold
glm.pred.hold No Yes
No 4188 17
Yes 644 151
> confusionMatrix(xtab.glm.hold,positive="Yes")
Confusion Matrix and Statistics
glm.pred.hold No Yes
No 4188 17
Yes 644 151
10
Accuracy : 0.8678
95% CI : (0.8581, 0.8771)
No Information Rate : 0.9664
P-Value [Acc > NIR] : 1
Kappa : 0.2733
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8988
Specificity : 0.8667
Pos Pred Value : 0.1899
Neg Pred Value : 0.9960
Prevalence : 0.0336
Detection Rate : 0.0302
Detection Prevalence : 0.1590
Balanced Accuracy : 0.8828
'Positive' Class : Yes
> #fit the 2nd logistic regression on training data
> default.glm.training2 = glm(default~balance+income+student,
family=binomial(link="logit"),data=Training.Set)
> summary(default.glm.training2)
Call:
glm(formula = default ~ balance + income + student, family = binomial(link = "logit"),
data = Training.Set)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4127 -0.1455 -0.0595 -0.0226 3.7186
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.064e+01 6.846e-01 -15.536 <2e-16 ***
balance 5.615e-03 3.213e-04 17.476 <2e-16 ***
income 1.296e-06 1.161e-05 0.112 0.9112
studentYes -5.947e-01 3.292e-01 -1.807 0.0708 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1450.21 on 4999 degrees of freedom
Residual deviance: 793.95 on 4996 degrees of freedom
11
AIC: 801.95
Number of Fisher Scoring iterations: 8
> #predicts probabilities for holdout set values using the training set model
> HoldOut.Set$predict.default.glm.hold2=predict(default.glm.training2,
type="response",newdata=data.frame(HoldOut.Set))
> fix(HoldOut.Set)
> # Plot the ROC curve
> perf.AUC.glm2 =
performance(prediction(HoldOut.Set$predict.default.glm.hold2,HoldOut.Set$default),"tpr","fpr"
)
> par(mfrow=c(1,1))
> plot(perf.AUC.glm2,col="blue",lwd=3,main="ROC Curve 2nd model")
> #Estimate of AUC ROC
> ROC.glm.hold2=roc(HoldOut.Set$default,HoldOut.Set$predict.default.glm.hold2
,percent=FALSE,plot=FALSE,ci=TRUE)
> AUC.glm.hold2=ROC.glm.hold2$auc
> AUC.glm.hold2.lb=ROC.glm.hold2$ci[1]
> AUC.glm.hold2.ub=ROC.glm.hold2$ci[3]
> AUC.glm.hold2
Area under the curve: 0.9503
> AUC.glm.hold2.lb
[1] 0.9369902
> AUC.glm.hold2.ub
[1] 0.9635711
> #Probability Threshold
> thresh.glm.hold2.youden=coords(ROC.glm.hold2, x="best", input="threshold",
best.method="youden")
> thresh.glm.hold2=thresh.glm.hold2.youden[1]
> specif.glm.hold2=thresh.glm.hold2.youden[2]
> sensit.glm.hold2=thresh.glm.hold2.youden[3]
> thresh.glm.hold2
threshold
0.03197311
> specif.glm.hold2
specificity
0.8609272
> sensit.glm.hold2
sensitivity
0.9047619
> #Confusion Matrix and Statistics
> glm.pred.hold2=rep("No",nrow(HoldOut.Set))
> glm.pred.hold2[HoldOut.Set$predict.default.glm.hold2>thresh.glm.hold2]="Yes"
> xtab.glm.hold2=table(glm.pred.hold2,HoldOut.Set$default)
12
> xtab.glm.hold2
glm.pred.hold2 No Yes
No 4160 16
Yes 672 152
> confusionMatrix(xtab.glm.hold2,positive="Yes")
Confusion Matrix and Statistics
glm.pred.hold2 No Yes
No 4160 16
Yes 672 152
Accuracy : 0.8624
95% CI : (0.8525, 0.8718)
No Information Rate : 0.9664
P-Value [Acc > NIR] : 1
Kappa : 0.2654
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9048
Specificity : 0.8609
Pos Pred Value : 0.1845
Neg Pred Value : 0.9962
Prevalence : 0.0336
Detection Rate : 0.0304
Detection Prevalence : 0.1648
Balanced Accuracy : 0.8828
'Positive' Class : Yes

More Related Content

What's hot

MAT 540 AID Str Inspiring Innovation--mat540aid.com
MAT 540 AID Str Inspiring Innovation--mat540aid.comMAT 540 AID Str Inspiring Innovation--mat540aid.com
MAT 540 AID Str Inspiring Innovation--mat540aid.comwilliamwordsworth59
 
MAT 540 AID str Become Exceptional--mat540aid.com
MAT 540 AID str Become Exceptional--mat540aid.comMAT 540 AID str Become Exceptional--mat540aid.com
MAT 540 AID str Become Exceptional--mat540aid.comkopiko136
 
MAT 540 AID str Achievement Education--mat540aid.com
MAT 540 AID str Achievement Education--mat540aid.comMAT 540 AID str Achievement Education--mat540aid.com
MAT 540 AID str Achievement Education--mat540aid.comkopiko160
 
MAT 540 AID str Education Counseling / mat540aid.com
MAT 540 AID str Education Counseling / mat540aid.comMAT 540 AID str Education Counseling / mat540aid.com
MAT 540 AID str Education Counseling / mat540aid.comkopiko75
 
MAT 540 AID Str Education Planning--mat540aid.com
MAT 540 AID Str Education Planning--mat540aid.comMAT 540 AID Str Education Planning--mat540aid.com
MAT 540 AID Str Education Planning--mat540aid.comVTejeswini8
 
MAT 540str Education Counseling -- mat540.com
MAT 540str Education Counseling -- mat540.comMAT 540str Education Counseling -- mat540.com
MAT 540str Education Counseling -- mat540.comkopiko94
 
MAT 540 STR Great Stories /newtonhelp.com
MAT 540 STR Great Stories /newtonhelp.comMAT 540 STR Great Stories /newtonhelp.com
MAT 540 STR Great Stories /newtonhelp.combellflower184
 
MAT 540(STR) Effective Communication/tutorialrank.com
 MAT 540(STR) Effective Communication/tutorialrank.com MAT 540(STR) Effective Communication/tutorialrank.com
MAT 540(STR) Effective Communication/tutorialrank.comjonhson295
 
MAT 540 Str Redefined Education--mat540.com
MAT 540 Str Redefined Education--mat540.comMAT 540 Str Redefined Education--mat540.com
MAT 540 Str Redefined Education--mat540.comagathachristie223
 
MAT 540(Str) EXceptional Education/snaptutorial.COM
MAT 540(Str) EXceptional Education/snaptutorial.COMMAT 540(Str) EXceptional Education/snaptutorial.COM
MAT 540(Str) EXceptional Education/snaptutorial.COMMcdonaldRyan19
 
MAT 540(Str) Enhance teaching - snaptutorial.com
MAT 540(Str)  Enhance teaching - snaptutorial.comMAT 540(Str)  Enhance teaching - snaptutorial.com
MAT 540(Str) Enhance teaching - snaptutorial.comDavisMurphyA55
 
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
Example-Dependent Cost-Sensitive Credit Card Fraud DetectionExample-Dependent Cost-Sensitive Credit Card Fraud Detection
Example-Dependent Cost-Sensitive Credit Card Fraud DetectionAlejandro Correa Bahnsen, PhD
 
MAT 540(Str) Education Organization - snaptutorial.com
MAT 540(Str)  Education Organization - snaptutorial.comMAT 540(Str)  Education Organization - snaptutorial.com
MAT 540(Str) Education Organization - snaptutorial.comranga5
 
Strayer mat 540 midterm exam
Strayer mat 540 midterm examStrayer mat 540 midterm exam
Strayer mat 540 midterm exameyavagal
 
Strayer mat 540 midterm exam
Strayer mat 540 midterm examStrayer mat 540 midterm exam
Strayer mat 540 midterm examshyaminfo15
 

What's hot (16)

MAT 540 AID Str Inspiring Innovation--mat540aid.com
MAT 540 AID Str Inspiring Innovation--mat540aid.comMAT 540 AID Str Inspiring Innovation--mat540aid.com
MAT 540 AID Str Inspiring Innovation--mat540aid.com
 
MAT 540 AID str Become Exceptional--mat540aid.com
MAT 540 AID str Become Exceptional--mat540aid.comMAT 540 AID str Become Exceptional--mat540aid.com
MAT 540 AID str Become Exceptional--mat540aid.com
 
MAT 540 AID str Achievement Education--mat540aid.com
MAT 540 AID str Achievement Education--mat540aid.comMAT 540 AID str Achievement Education--mat540aid.com
MAT 540 AID str Achievement Education--mat540aid.com
 
MAT 540 AID str Education Counseling / mat540aid.com
MAT 540 AID str Education Counseling / mat540aid.comMAT 540 AID str Education Counseling / mat540aid.com
MAT 540 AID str Education Counseling / mat540aid.com
 
MAT 540 AID Str Education Planning--mat540aid.com
MAT 540 AID Str Education Planning--mat540aid.comMAT 540 AID Str Education Planning--mat540aid.com
MAT 540 AID Str Education Planning--mat540aid.com
 
MAT 540str Education Counseling -- mat540.com
MAT 540str Education Counseling -- mat540.comMAT 540str Education Counseling -- mat540.com
MAT 540str Education Counseling -- mat540.com
 
MAT 540 STR Great Stories /newtonhelp.com
MAT 540 STR Great Stories /newtonhelp.comMAT 540 STR Great Stories /newtonhelp.com
MAT 540 STR Great Stories /newtonhelp.com
 
MAT 540(STR) Effective Communication/tutorialrank.com
 MAT 540(STR) Effective Communication/tutorialrank.com MAT 540(STR) Effective Communication/tutorialrank.com
MAT 540(STR) Effective Communication/tutorialrank.com
 
MAT 540 Str Redefined Education--mat540.com
MAT 540 Str Redefined Education--mat540.comMAT 540 Str Redefined Education--mat540.com
MAT 540 Str Redefined Education--mat540.com
 
MAT 540(Str) EXceptional Education/snaptutorial.COM
MAT 540(Str) EXceptional Education/snaptutorial.COMMAT 540(Str) EXceptional Education/snaptutorial.COM
MAT 540(Str) EXceptional Education/snaptutorial.COM
 
MAT 540(Str) Enhance teaching - snaptutorial.com
MAT 540(Str)  Enhance teaching - snaptutorial.comMAT 540(Str)  Enhance teaching - snaptutorial.com
MAT 540(Str) Enhance teaching - snaptutorial.com
 
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
Example-Dependent Cost-Sensitive Credit Card Fraud DetectionExample-Dependent Cost-Sensitive Credit Card Fraud Detection
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
 
MAT 540(Str) Education Organization - snaptutorial.com
MAT 540(Str)  Education Organization - snaptutorial.comMAT 540(Str)  Education Organization - snaptutorial.com
MAT 540(Str) Education Organization - snaptutorial.com
 
Chi square
Chi squareChi square
Chi square
 
Strayer mat 540 midterm exam
Strayer mat 540 midterm examStrayer mat 540 midterm exam
Strayer mat 540 midterm exam
 
Strayer mat 540 midterm exam
Strayer mat 540 midterm examStrayer mat 540 midterm exam
Strayer mat 540 midterm exam
 

Viewers also liked

Shipping officer performance appraisal
Shipping officer performance appraisalShipping officer performance appraisal
Shipping officer performance appraisalRobertoCarlos012
 
Special education director performance appraisal
Special education director performance appraisalSpecial education director performance appraisal
Special education director performance appraisalRobertoCarlos012
 
Shift engineer performance appraisal
Shift engineer performance appraisalShift engineer performance appraisal
Shift engineer performance appraisalRobertoCarlos012
 
Sponsorship coordinator performance appraisal
Sponsorship coordinator performance appraisalSponsorship coordinator performance appraisal
Sponsorship coordinator performance appraisalRobertoCarlos012
 
Software quality assurance engineer performance appraisal
Software quality assurance engineer performance appraisalSoftware quality assurance engineer performance appraisal
Software quality assurance engineer performance appraisalRobertoCarlos012
 
Software qa engineer performance appraisal
Software qa engineer performance appraisalSoftware qa engineer performance appraisal
Software qa engineer performance appraisalRobertoCarlos012
 
Site clerk performance appraisal
Site clerk performance appraisalSite clerk performance appraisal
Site clerk performance appraisalRobertoCarlos012
 
Rhodes Griffin ppp_powerpoint
Rhodes Griffin ppp_powerpointRhodes Griffin ppp_powerpoint
Rhodes Griffin ppp_powerpointGriffinRhodes
 
Shipping and receiving manager
Shipping and receiving managerShipping and receiving manager
Shipping and receiving managerRobertoCarlos012
 
Special education coordinator performance appraisal
Special education coordinator performance appraisalSpecial education coordinator performance appraisal
Special education coordinator performance appraisalRobertoCarlos012
 
Solutions consultant performance appraisal
Solutions consultant performance appraisalSolutions consultant performance appraisal
Solutions consultant performance appraisalRobertoCarlos012
 
Quédense en la trinchera
Quédense en la trincheraQuédense en la trinchera
Quédense en la trincherazimur3
 
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...Iowa State University Digital Repository
 
Showcasing Research and Scholarship from the College of Veterinary Medicine
Showcasing Research and Scholarship from the College of Veterinary MedicineShowcasing Research and Scholarship from the College of Veterinary Medicine
Showcasing Research and Scholarship from the College of Veterinary MedicineIowa State University Digital Repository
 

Viewers also liked (18)

Shipping officer performance appraisal
Shipping officer performance appraisalShipping officer performance appraisal
Shipping officer performance appraisal
 
Special education director performance appraisal
Special education director performance appraisalSpecial education director performance appraisal
Special education director performance appraisal
 
Plourde-Lescelleur et al. 2015
Plourde-Lescelleur et al. 2015Plourde-Lescelleur et al. 2015
Plourde-Lescelleur et al. 2015
 
Shift engineer performance appraisal
Shift engineer performance appraisalShift engineer performance appraisal
Shift engineer performance appraisal
 
Sponsorship coordinator performance appraisal
Sponsorship coordinator performance appraisalSponsorship coordinator performance appraisal
Sponsorship coordinator performance appraisal
 
Software quality assurance engineer performance appraisal
Software quality assurance engineer performance appraisalSoftware quality assurance engineer performance appraisal
Software quality assurance engineer performance appraisal
 
Increasing the Visibility and Impact of HDFS Research and Scholarship
Increasing the Visibility and Impact of HDFS Research and ScholarshipIncreasing the Visibility and Impact of HDFS Research and Scholarship
Increasing the Visibility and Impact of HDFS Research and Scholarship
 
Software qa engineer performance appraisal
Software qa engineer performance appraisalSoftware qa engineer performance appraisal
Software qa engineer performance appraisal
 
Site clerk performance appraisal
Site clerk performance appraisalSite clerk performance appraisal
Site clerk performance appraisal
 
Rhodes Griffin ppp_powerpoint
Rhodes Griffin ppp_powerpointRhodes Griffin ppp_powerpoint
Rhodes Griffin ppp_powerpoint
 
Shipping and receiving manager
Shipping and receiving managerShipping and receiving manager
Shipping and receiving manager
 
Special education coordinator performance appraisal
Special education coordinator performance appraisalSpecial education coordinator performance appraisal
Special education coordinator performance appraisal
 
Solutions consultant performance appraisal
Solutions consultant performance appraisalSolutions consultant performance appraisal
Solutions consultant performance appraisal
 
Quédense en la trinchera
Quédense en la trincheraQuédense en la trinchera
Quédense en la trinchera
 
Digital Repository @ Iowa State University
Digital Repository @ Iowa State UniversityDigital Repository @ Iowa State University
Digital Repository @ Iowa State University
 
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...
Increasing the Visibility and Impact of IMSE Scholarship, Digital Repository ...
 
Showcasing Research and Scholarship from the College of Veterinary Medicine
Showcasing Research and Scholarship from the College of Veterinary MedicineShowcasing Research and Scholarship from the College of Veterinary Medicine
Showcasing Research and Scholarship from the College of Veterinary Medicine
 
Extending the Visibility and Impact of Design
Extending the Visibility and Impact of DesignExtending the Visibility and Impact of Design
Extending the Visibility and Impact of Design
 

Similar to CreditCardDefaultModel

Part 1 of 8 - Question 1 of 17 1.0 Points A pha.docx
Part 1 of 8 -  Question 1 of 17 1.0 Points A pha.docxPart 1 of 8 -  Question 1 of 17 1.0 Points A pha.docx
Part 1 of 8 - Question 1 of 17 1.0 Points A pha.docxherbertwilson5999
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...Smarten Augmented Analytics
 
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
What is SVM Classification Analysis and How Can It Benefit Business Analytics?What is SVM Classification Analysis and How Can It Benefit Business Analytics?
What is SVM Classification Analysis and How Can It Benefit Business Analytics?Smarten Augmented Analytics
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterJieming Wei
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
Stat_AMBA_600_Problem Set3
Stat_AMBA_600_Problem Set3Stat_AMBA_600_Problem Set3
Stat_AMBA_600_Problem Set3Tyler Anton
 
Findings, Conclusions, & RecommendationsReport Writing
Findings, Conclusions, & RecommendationsReport WritingFindings, Conclusions, & RecommendationsReport Writing
Findings, Conclusions, & RecommendationsReport WritingShainaBoling829
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROAnthony Kilili
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance OptimizationAlbert Chu
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 
Lab TA_______________ID_________________Name__________________.docx
 Lab TA_______________ID_________________Name__________________.docx Lab TA_______________ID_________________Name__________________.docx
Lab TA_______________ID_________________Name__________________.docxMARRY7
 
IIM Rohtak Case Study Competition
IIM Rohtak Case Study Competition IIM Rohtak Case Study Competition
IIM Rohtak Case Study Competition devarkfirst
 
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...Biswadeep Ghosh Hazra
 
Linear Regression Model-The National Veterans' Organization
Linear Regression Model-The National Veterans' OrganizationLinear Regression Model-The National Veterans' Organization
Linear Regression Model-The National Veterans' OrganizationKirandeep Kaur
 
Some study materials
Some study materialsSome study materials
Some study materialsSatishH5
 
Question 2 of 231.0 PointsA company operates four machines dur.docx
Question 2 of 231.0 PointsA company operates four machines dur.docxQuestion 2 of 231.0 PointsA company operates four machines dur.docx
Question 2 of 231.0 PointsA company operates four machines dur.docxwraythallchan
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card DefultsKrunal Khatri
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessmentLeonardo Auslender
 

Similar to CreditCardDefaultModel (20)

Part 1 of 8 - Question 1 of 17 1.0 Points A pha.docx
Part 1 of 8 -  Question 1 of 17 1.0 Points A pha.docxPart 1 of 8 -  Question 1 of 17 1.0 Points A pha.docx
Part 1 of 8 - Question 1 of 17 1.0 Points A pha.docx
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
 
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
What is SVM Classification Analysis and How Can It Benefit Business Analytics?What is SVM Classification Analysis and How Can It Benefit Business Analytics?
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Stat_AMBA_600_Problem Set3
Stat_AMBA_600_Problem Set3Stat_AMBA_600_Problem Set3
Stat_AMBA_600_Problem Set3
 
Findings, Conclusions, & RecommendationsReport Writing
Findings, Conclusions, & RecommendationsReport WritingFindings, Conclusions, & RecommendationsReport Writing
Findings, Conclusions, & RecommendationsReport Writing
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance Optimization
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
Lab TA_______________ID_________________Name__________________.docx
 Lab TA_______________ID_________________Name__________________.docx Lab TA_______________ID_________________Name__________________.docx
Lab TA_______________ID_________________Name__________________.docx
 
IIM Rohtak Case Study Competition
IIM Rohtak Case Study Competition IIM Rohtak Case Study Competition
IIM Rohtak Case Study Competition
 
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...
 
Linear Regression Model-The National Veterans' Organization
Linear Regression Model-The National Veterans' OrganizationLinear Regression Model-The National Veterans' Organization
Linear Regression Model-The National Veterans' Organization
 
Some study materials
Some study materialsSome study materials
Some study materials
 
Question 2 of 231.0 PointsA company operates four machines dur.docx
Question 2 of 231.0 PointsA company operates four machines dur.docxQuestion 2 of 231.0 PointsA company operates four machines dur.docx
Question 2 of 231.0 PointsA company operates four machines dur.docx
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card Defults
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessment
 

CreditCardDefaultModel

  • 1. 1 Andrew Rogala STAT 481 Student Project ResearchGoal and Data: The goal of the analysis is to develop a statistical model to predict default status (Yes, No) of credit card customers given the predictors income, balance, and student status (Yes, No). Additionally, it would be advantageous to take the view point of a credit card company and produce a model with a high true positive rate, a low false positive rate, and a high detection rate. To accomplish this a low probability threshold is needed. The data used for this analysis is the Default data set in the ISLR library. It is a simulated data set containing information on ten thousand customers. Response Variable: The response variable is default status Yes or No. Predictor Variables: The predictor variables are annual income in dollars, average balance, in dollars, that the customer has remaining on their credit card after making their monthly payment, and student status Yes or No. Statistical Methods: Multiple logistic regression and the validation set approach will be used. The validation set approach will gauge how well the model will perform on a new set of data. Summary Statistics for the Default data set: default student balance income No : 9667 No : 7056 Min. : 0.0 Min. : 772 Yes: 333 Yes: 2944 1st Qu. : 481.7 1st Qu. : 21340 Median : 823.6 Median : 34553 Mean : 835.4 Mean : 33517 3rd Qu. : 1166.3 3rd Qu. : 43808 Max. : 2654.3 Max. : 73554 P(default = Yes) = 333/10,000 = 0.0333 P(default = No) = 9667/10,000 = 0.9667 (student) No Yes (default) No 6850 2817 Yes 206 127 P(default = Yes given student = Yes) = 127/(2817+127) = 0.04313859 P(default = Yes given student = No) = 206/(6850+206) = 0.02919501
  • 2. 2 Student = red Plot taken from “An Introduction to Statistical Learning” page 137 Non-Student = blue Box Plots: From the left box plot below it appears that those individuals who defaulted tended to have much higher credit card balances. This solid relationship between the predictor variable balance and the response variable default suggests there is a strong correlation between the two of them. From the right box plot below it appears that those individuals who defaulted tended to have a slightly lower median income. Thus, the predictor income is slightly correlated with the response default. Next, I will plot box plots for student and balance as well as student and income to see if there is any collinearity between these predictor variables. No Yes 05001000150020002500 Default CreditCardBalance No Yes 0200004000060000 Default Income
  • 3. 3 Analysis of the left boxplot below shows that students tend to have slightly higher credit card balances than non-students and thus student and balance are correlated. Analysis of the right boxplot below shows that students tend to have much lower incomes than non-students; thus the student and income variables are correlated. Due to the collinearity between these predictor variables I will initially leave the student variable out of the logistic regression and just fit a model for default with income and balance as predictors. Later I will add all three and see which model produces better results. The 1st fit model on the training data is: 𝑙𝑜𝑔( 𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡) 1−𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡) ) = -11.27 + 0.00001788(income) + 0.005538(balance) P(default) = 1 1+𝑒−(−11.27+0.00001788( 𝑖𝑛𝑐𝑜𝑚𝑒)+0.005538(𝑏𝑎𝑙𝑎𝑛𝑐𝑒)) P-value for income = 0.0117 P-value for balance < 2 x 10-16 Thus both income and balance are significant predictors of default No Yes 05001000150020002500 Student Status CreditCardBalance No Yes 0200004000060000 Student Status Income
  • 4. 4 Area under the curve = 0.9493 Probability threshold = 0.03540053 Confusion Matrix and Statistics (Observed) No Yes (Predicted) No 4188 17 Yes 644 151 % Correctly Classified = Accuracy = 0.8678 True Positive Rate = Sensitivity = 0.8988 P(predict default/default) True Negative Rate = Specificity = 0.8667 P(predict not default/ not default) False Positive Rate = (1 - Specificity) = 0.1333 Prevalence = 0.0336 Detection Rate = 151/5000 = 0.0302 Detection Prevalence = (644+151)/5000 = 0.1590 %Misclassified = Test Error = (644+17)/5000 = 0.1322 = 13.22% The 2nd fit model on the training data is: 𝑙𝑜𝑔( 𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡) 1−𝑝( 𝑑𝑒𝑓𝑎𝑢𝑙𝑡) ) = -10.64 + 0.000001296(income) + 0.005615(balance) + -0.5947(student) P(default) = 1 1+𝑒−(−10.64+0.000001296( 𝑖𝑛𝑐𝑜𝑚𝑒)+0.005615( 𝑏𝑎𝑙𝑎𝑛𝑐𝑒)+ −0.5947(𝑠𝑡𝑢𝑑𝑒𝑛𝑡)) ROC Curve 1st model False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0
  • 5. 5 P-value for income = 0.9112 P-value for balance < 2 x 10-16 P-value for student = 0.0708 Thus balance is the only significant predictor of default in this model. However, depending on the choice of alpha student may be considered a significant predictor as well. Area under the curve = 0.9503 Probability threshold = 0.03197311 Confusion Matrix and Statistics (Observed) No Yes (Predicted) No 4160 16 Yes 672 152 % Correctly Classified = Accuracy = 0.8624 True Positive Rate = Sensitivity = 0.9048 P(predict default/default) True Negative Rate = Specificity = 0.8609 P(predict not default/ not default) False Positive Rate = (1 - Specificity) = 0.1391 Prevalence = 0.0336 Detection Rate = 152/5000 = 0.0304 Detection Prevalence =(672+152)/5000 = 0.1648 %Misclassified = Test Error = (672+16)/5000 = 0.1376 = 13.76% ROC Curve 2nd model False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0
  • 6. 6 Interpretation of Results: Overall, without taking income and balance into consideration students have a higher probability of default (0.0431) as compared to the probability of default for non-students (0.0292). Thus, if nothing is known about a customer’s income or credit card balance students are a risker population. However, by examining the graph on page two it is clear that students (red) with the same balance as non-students (blue) have a lower default rate than non-students. The correlation between student and balance explains this paradox. Students tend to have higher credit card balances than non-students, see box plot on page three, and it is known that customers with higher balances are more likely to default; see box plot on page two. Even though the student population is more likely to have higher credit card balances, which tend to be associated with higher default rates, it is still possible for an individual student to have a lower probability of default than a non-student given that they have the same income and balance. The conclusion is that if no information is given about a customer’s balance and income students are risker; however, a student is less risky than a non-student with the same balance and income. The first model has a slightly smaller test error of 13.22%, as opposed to the second model’s test error of 13.76%. In addition, the first model produced better p-values for its predictors. However, with the addition of the student variable, the second model provides significantly more information to justify using this model as the main method for predicting default. The second model has an area under the ROC curve of 0.9503 suggesting a good fit. It also has a high true positive rate (0.9048), a reasonable false positive rate of (0.1391), a high detection rate of 0.0304, and a test error of 13.76%. Also out of 5000 predictions only 16 were predicted to be No and observed to be a Yes. In theory, using this model a credit card company could reduce their default rate to 16/5000 = 0.32% as compared to the observed default rate of 3.36%. By choosing a small probability threshold a high true positive rate was achieved, however doing this does cause the test error to increase. A sacrifice worthwhile taking the view of a credit card company trying to reduce their default rate. The threshold can be changed to modify this model to fit the specific needs of users. Now an interpretation of the second model is a follows. A one unit increase in balance is associated with an increase in the log odds of default by 0.005615 units when holding all other predictors constant. A one unit increase in income is associated with an increase in the log odds of default by 0.000001296 units when holding all other predictors constant. Finally, a student is associated with a decrease in default by 0.5947 units when holding all other predictors constant.
  • 7. 7 R Code and R OutPut: >library(pROC) >library(ROCR) >library(mgcv) >library(caret) >library(e1071) >library(ISLR) > attach(Default) > fix(Default) > dim(Default) [1] 10000 4 > ?Default > summary(Default) default student balance income No :9667 No :7056 Min. : 0.0 Min. : 772 Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340 Median : 823.6 Median :34553 Mean : 835.4 Mean :33517 3rd Qu.:1166.3 3rd Qu.:43808 Max. :2654.3 Max. :73554 > #P(Default = Yes) > 333/10000 [1] 0.0333 > #P(Default = No) > 9667/10000 [1] 0.9667 > #Some Conditional Probabilities > table(Default$default,Default$student) No Yes No 6850 2817 Yes 206 127 > #P(default = Yes given student = Yes) > 127/(2817+127) [1] 0.04313859 > #P(default = Yes given student = No) > 206/(6850+206) [1] 0.02919501 >#Box Plots > par(mfrow=c(1,2)) > plot(default, balance, xlab="Default", ylab="Credit Card Balance", col="red") > plot(default, income, xlab="Default", ylab="Income", col="green") > par(mfrow=c(1,2))
  • 8. 8 > plot(student,balance,xlab="Student Status",ylab="Credit Card Balance", col="red") > plot(student,income,xlab="Student Status",ylab="Income", col="green") > #Training and HoldOut Sets > set.seed(23) > ReSampleData = Default[sample(nrow(Default)),] > Data.Set.Splits = cut(seq(1,nrow(ReSampleData)),breaks=2,labels=FALSE) > tIndexes = which(Data.Set.Splits!=1,arr.ind=TRUE) > Training.Set = ReSampleData[tIndexes, ] > fix(Training.Set) > HoldOut.Set = ReSampleData[-tIndexes,] > fix(HoldOut.Set) > #fit the 1st logistic regression on training data > default.glm.training = glm(default~income + balance, family=binomial(link="logit"),data=Training.Set) > summary(default.glm.training) Call: glm(formula = default ~ income + balance, family = binomial(link = "logit"), data = Training.Set) Deviance Residuals: Min 1Q Median 3Q Max -2.4201 -0.1489 -0.0604 -0.0231 3.6961 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.127e+01 6.000e-01 -18.778 <2e-16 *** income 1.788e-05 7.088e-06 2.522 0.0117 * balance 5.538e-03 3.162e-04 17.516 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1450.21 on 4999 degrees of freedom Residual deviance: 797.18 on 4997 degrees of freedom AIC: 803.18 Number of Fisher Scoring iterations: 8 > #predicts probabilities for holdout set values using the training set model > HoldOut.Set$predict.default.glm.hold=predict(default.glm.training, type="response",newdata=data.frame(HoldOut.Set)) > fix(HoldOut.Set)
  • 9. 9 > # Plot the ROC curve > perf.AUC.glm = performance(prediction(HoldOut.Set$predict.default.glm.hold,HoldOut.Set$default),"tpr","fpr") > par(mfrow=c(1,1)) > plot(perf.AUC.glm,col="blue",lwd=3,main="ROC Curve 1st model") > # Estimate of AUC ROC > ROC.glm.hold=roc(HoldOut.Set$default,HoldOut.Set$predict.default.glm.hold ,percent=FALSE,plot=FALSE,ci=TRUE) > AUC.glm.hold=ROC.glm.hold$auc > AUC.glm.hold.lb=ROC.glm.hold$ci[1] > AUC.glm.hold.ub=ROC.glm.hold$ci[3] > AUC.glm.hold Area under the curve: 0.9493 > AUC.glm.hold.lb [1] 0.935706 > AUC.glm.hold.ub [1] 0.9629286 > #Probability Threshold > thresh.glm.hold.youden=coords(ROC.glm.hold, x="best", input="threshold", best.method="youden") > thresh.glm.hold=thresh.glm.hold.youden[1] > specif.glm.hold=thresh.glm.hold.youden[2] > sensit.glm.hold=thresh.glm.hold.youden[3] > thresh.glm.hold threshold 0.03540053 > specif.glm.hold specificity 0.8667219 > sensit.glm.hold sensitivity 0.8988095 > #Confusion Matrix and Statistics > glm.pred.hold=rep("No",nrow(HoldOut.Set)) > glm.pred.hold[HoldOut.Set$predict.default.glm.hold>thresh.glm.hold]="Yes" > xtab.glm.hold=table(glm.pred.hold,HoldOut.Set$default) > xtab.glm.hold glm.pred.hold No Yes No 4188 17 Yes 644 151 > confusionMatrix(xtab.glm.hold,positive="Yes") Confusion Matrix and Statistics glm.pred.hold No Yes No 4188 17 Yes 644 151
  • 10. 10 Accuracy : 0.8678 95% CI : (0.8581, 0.8771) No Information Rate : 0.9664 P-Value [Acc > NIR] : 1 Kappa : 0.2733 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.8988 Specificity : 0.8667 Pos Pred Value : 0.1899 Neg Pred Value : 0.9960 Prevalence : 0.0336 Detection Rate : 0.0302 Detection Prevalence : 0.1590 Balanced Accuracy : 0.8828 'Positive' Class : Yes > #fit the 2nd logistic regression on training data > default.glm.training2 = glm(default~balance+income+student, family=binomial(link="logit"),data=Training.Set) > summary(default.glm.training2) Call: glm(formula = default ~ balance + income + student, family = binomial(link = "logit"), data = Training.Set) Deviance Residuals: Min 1Q Median 3Q Max -2.4127 -0.1455 -0.0595 -0.0226 3.7186 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.064e+01 6.846e-01 -15.536 <2e-16 *** balance 5.615e-03 3.213e-04 17.476 <2e-16 *** income 1.296e-06 1.161e-05 0.112 0.9112 studentYes -5.947e-01 3.292e-01 -1.807 0.0708 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1450.21 on 4999 degrees of freedom Residual deviance: 793.95 on 4996 degrees of freedom
  • 11. 11 AIC: 801.95 Number of Fisher Scoring iterations: 8 > #predicts probabilities for holdout set values using the training set model > HoldOut.Set$predict.default.glm.hold2=predict(default.glm.training2, type="response",newdata=data.frame(HoldOut.Set)) > fix(HoldOut.Set) > # Plot the ROC curve > perf.AUC.glm2 = performance(prediction(HoldOut.Set$predict.default.glm.hold2,HoldOut.Set$default),"tpr","fpr" ) > par(mfrow=c(1,1)) > plot(perf.AUC.glm2,col="blue",lwd=3,main="ROC Curve 2nd model") > #Estimate of AUC ROC > ROC.glm.hold2=roc(HoldOut.Set$default,HoldOut.Set$predict.default.glm.hold2 ,percent=FALSE,plot=FALSE,ci=TRUE) > AUC.glm.hold2=ROC.glm.hold2$auc > AUC.glm.hold2.lb=ROC.glm.hold2$ci[1] > AUC.glm.hold2.ub=ROC.glm.hold2$ci[3] > AUC.glm.hold2 Area under the curve: 0.9503 > AUC.glm.hold2.lb [1] 0.9369902 > AUC.glm.hold2.ub [1] 0.9635711 > #Probability Threshold > thresh.glm.hold2.youden=coords(ROC.glm.hold2, x="best", input="threshold", best.method="youden") > thresh.glm.hold2=thresh.glm.hold2.youden[1] > specif.glm.hold2=thresh.glm.hold2.youden[2] > sensit.glm.hold2=thresh.glm.hold2.youden[3] > thresh.glm.hold2 threshold 0.03197311 > specif.glm.hold2 specificity 0.8609272 > sensit.glm.hold2 sensitivity 0.9047619 > #Confusion Matrix and Statistics > glm.pred.hold2=rep("No",nrow(HoldOut.Set)) > glm.pred.hold2[HoldOut.Set$predict.default.glm.hold2>thresh.glm.hold2]="Yes" > xtab.glm.hold2=table(glm.pred.hold2,HoldOut.Set$default)
  • 12. 12 > xtab.glm.hold2 glm.pred.hold2 No Yes No 4160 16 Yes 672 152 > confusionMatrix(xtab.glm.hold2,positive="Yes") Confusion Matrix and Statistics glm.pred.hold2 No Yes No 4160 16 Yes 672 152 Accuracy : 0.8624 95% CI : (0.8525, 0.8718) No Information Rate : 0.9664 P-Value [Acc > NIR] : 1 Kappa : 0.2654 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.9048 Specificity : 0.8609 Pos Pred Value : 0.1845 Neg Pred Value : 0.9962 Prevalence : 0.0336 Detection Rate : 0.0304 Detection Prevalence : 0.1648 Balanced Accuracy : 0.8828 'Positive' Class : Yes