Harsha Sinha (16125018)
Kriti Doneria (16125022)
Prakhar Barole (16125028)
MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017
On completion of this project, we would like to thank our faculty, Dr. Devlina Chatterjee for
giving us the opportunity to pursue the project as a part of the curriculum and also being a
constant source of support throughout the project.
We would also like to thank our classmates and friends, who helped us in the
conceptualization of the problem statement.
Lastly, we thank all the researchers, bloggers and people from the community at large for
providing us a starting point for our project through their documentation, research and
R CODE OUTPUT......................................................................................................................................12
To explore qualitatively and quantitatively the risks associated with giving out credit for personal and
commercial purposes, and to model the risk factor using a widely used machine learning classification
method; Logistic Regression.
Credit risk modelling tries to answer the question:
Assuming past behavior is predictive of future behavior, what is the probability that a
debtor will not repay the debt-holder?
The analysis of credit risk is of utmost importance for financial institutions. Historically, it was done by
taking into account the net assets a borrower had and if it was enough to cover the debt. Being manual
in nature, it was prone to human biases and corruption. In the past two decades, technology has
transformed and automate the process, making it easier to deal with the volume of debtors (for banks)
as well as variety of debt.
A milestone has been the development of CIBIL score in India.
A Credit Score or the CIBIL Score is a three-digit numeric summary of your credit history. The score is
derived using the credit history found in the CIR. A CIR is an individual's credit payment history across
loan types and credit institutions over a period of time. The minimum CIBIL score for a personal loan is
generally 750. Anything above this would mean that the applicant is creditworthy and applications are
processed without hassle. In general credit scores range from 300 to 900.
Model used: Standard logistic heteroskedastic robust regression model.
Logistic regression is the type of regression we use for a response variable (Y) that follows a binomial
 Y ~ Binomial(n, p)
 n independent trials
 p = probability of success on each trial
 Y = number of successes out of n trials
 (e.g., Y= number of heads)
P= exp (𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ ) /1 +
exp(𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ 𝑥𝑛 )
 p is the probability of default
 xi is the explanatory factor i
 βi is the regression coefficient of the explanatory factor i
 n is the number of explanatory variables
The reasons why Logistic regression is better suited to credit risk analysis are:
1. The independent variable (credit type and duration, income etc) are categorical in
nature. Categories make better predictors in this analysis than actual value.
2. The end result has to be in probability or percentage (like person A is x% likely to default
on the given credit), which is not possible for linear regression model since its values
vary between both ends of the number line.
3. The variability of the dependent variable (Y) is not constant, as in the case of a normal
distribution. Variance of a binomial distribution is given by npq, while it’s the standard
deviation constant for a normal distribution inherent assumed in linear regression
 Absence of perfect multicollinearity
 No outliers
 Independence of errors
 Ratio of cases to variables – using discrete variables requires that there are enough responses in
every given category
 Not many missing variables
R scripting Language, RStudio IDE for Windows.
The dataset is taken as bank’s record about the status of loan defaults and the profile of customers. The
dataset contains information like age, annual income, home ownership, grade of employee that affect
the loan paying capacity of the customer.
This data is taken from
1. Contains 29092 rows and 8 columns.
2. Contains 2043 rows with missing data.
3. The columns are namely:
loan_status: 0 if successful, 1 if defaulted
loan_amnt: total amount of loan taken
int_rate: interest rate
grade: grade of employment
emp_length: duration of employement
home_ownership: type of ownership of house
annual_inc: annual income
age: age of loan taker.
4. In the columns, loan_Status is binary variable, loan_amount, int_rate, annual_inc and age are
all numeric continuous variables, while grade and home ownership are categorical variables
with 7 and 4 categories respectively.
By including and excluding some independent variables, three logistic regression models were built.
The dataset was divided into Training (75%) and Testing (25%) set. The objective of modelling was to
minimize the residual deviance on the testing data, using respective co-efficient computed using training
Model selection was done on the basis of lowest AIC (Akaike information criterion), lowest median
residual Deviance and highest number of significant variables at a confidence of 99.95% and above.
Model 3 did well on all the three parameters.
The result obtained for the test dataset were decimal values. To make it categorical, values with
different cut off limits were used and an accuracy of 77.4% was reached. To avoid over-fitting and save
potential loss of profit, cut-offs were not increased beyond this limit.
loan_amnt int_rate
emp_length home_ow
3 4 10
RESIDUALS -0.4331 -0.4321 -0.4312
AIC 13236 13235 12667
So, the third model is better than the other two.
Setting the cutoff at .x means that there is a probability of x% that a person will default on the given
A confusion matrix is a table used to describe the performance of a classification model on a set of test
data for which the true values are known.
Its general structure is:
The accuracy of a model is computed as True positives+ True negatives/number of rows in test data.
confmat1 #.15
0 1
0 4494 1173
1 446 256
Accuracy: 65.31%
confmat2 #.20
0 1
0 5363 304
1 614 88
Accuracy: 74.94%
confmat3 #.25
0 1
0 5605 62
1 674 28
Accuracy: 77.45%
The co-efficient of the following are positive:
loan_amnt, int_rate, gradeB, gradeC, gradeD, gradeE , gradeF , gradeG, emp_length,
This means the probability of defaulting on the given credit varies directly with these factors ie more the
value, more the risk of losing credit. Common sense suggests the same.
For Other types of home ownership (other than home or rent, like a demolished/mortgaged home), the
probability of defaulting increases.
And the following have negative co-efficient:
home_ownershipOWN, home_ownershipRENT, annual_inc, age
This means that the probability of defaulting is inversely proportional to the factors mentioned above.
Intuitively too, it makes perfect sense.
Variables having at-least one star in the coefficients table are significant. Positive coefficient means
higher the value of that variable, higher the risk of default, and vice versa. The significance levels are
determined using standard Z tests.
Reject Inference The data given by banks is inherently biased towards the rejected applications, and
hence isn’t a true representation of a client who comes through the door. Stratified sampling can help
take care of this.
Omitted Variable bias can never be fully eliminated from any type of regression. This is because of the
uncertainties in the real world.
In logistic regression no assumptions are made about the linear distribution and absence of high
degree of interaction between the explanatory variables.
Over fitting: Logistic regression sometimes tend to over-fit the sample, appearing to be more confident
than it really is. In this case, it is fine but in other cases, it might be undesirable.
 Three logit models were used to predict the loan status, the model with the least residual error was
selected. Different cut off gave different accuracy. The first model had a Akaike information criterion
score of 13236, while second model has score of 13235 and the third model has a score of 12667 w
hich has a significant improvement from other two models. Hence the most precise model was selec
 Different cut off were used to decide if the loan should be granted to be or not and cut off of .15 gav
e accuracy of 65.31% while cut off of .20 gave accuracy of 74.94% and cut off of .25 gave accuracy of
77.45%. Hence most accurate model was chosen. The decision to set a cutoff is arbitrary and higher
cut off increases the risk so a level of .25 was decided to be optimum. The area under the curve also
gives a measure of accuracy, which came out to be 64% approx.
data1<- readRDS("Loandata.rds") #reading data
head(data1) #reading the first few lines off the dataset
traindata<- sample(data1,0.75*nrow(data1))#preparing training data
testdata<-sample(data1,-.75*nrow(data1))#preparing test data
#model 1 with loan amount, interest rate, annual income, age
#model 2 with loan amount, interest amount, annual income, age and home ownership
#model 3 with loan amount, interest rate, grade, employment length, annual income, age, home
summary (result2)
#Least residual deviance
#predicting the result on test data
#Varying cut off for the best predictor on the model with least residual deviance
#at if value below .15 then it is declined else excepted
#at if value below .2 then it is declined else excepted
#at if value below .25 then it is declined else excepted
#confusion matrix to show Type 1 and 2 errors
#checking accuracy of different models
glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc +
age, family = "binomial", data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0794 -0.5334 -0.4331 -0.3421 3.7236
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.265e+00 1.400e-01 -23.318 <2e-16 ***
loan_amnt 1.762e-07 4.127e-06 0.043 0.966
int_rate 1.517e-01 7.257e-03 20.902 <2e-16 ***
annual_inc -6.935e-06 7.700e-07 -9.005 <2e-16 ***
age -5.271e-03 3.843e-03 -1.372 0.170
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13800 on 19775 degrees of freedom
Residual deviance: 13226 on 19771 degrees of freedom
(2043 observations deleted due to missingness)
AIC: 13236
Number of Fisher Scoring iterations: 5
glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc +
age + home_ownership, family = "binomial", data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0816 -0.5339 -0.4321 -0.3420 3.7963
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.217e+00 1.442e-01 -22.311 <2e-16 ***
loan_amnt -2.785e-08 4.133e-06 -0.007 0.9946
int_rate 1.527e-01 7.329e-03 20.837 <2e-16 ***
annual_inc -7.265e-06 8.070e-07 -9.002 <2e-16 ***
age -5.120e-03 3.843e-03 -1.332 0.1828
home_ownershipOTHER 6.196e-01 3.072e-01 2.017 0.0437 *
home_ownershipOWN -1.487e-01 9.310e-02 -1.597 0.1103
home_ownershipRENT -6.259e-02 5.185e-02 -1.207 0.2274
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13800 on 19775 degrees of freedom
Residual deviance: 13219 on 19768 degrees of freedom
(2043 observations deleted due to missingness)
AIC: 13235
Number of Fisher Scoring iterations: 5
glm(formula = loan_status ~ ., family = "binomial", data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0905 -0.5315 -0.4312 -0.3321 3.7253
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.830e+00 2.166e-01 -13.066 < 2e-16 ***
loan_amnt 2.691e-07 4.230e-06 0.064 0.949276
int_rate 8.519e-02 2.314e-02 3.681 0.000232 ***
gradeB 3.390e-01 1.092e-01 3.104 0.001909 **
gradeC 5.366e-01 1.581e-01 3.394 0.000688 ***
gradeD 6.203e-01 2.010e-01 3.086 0.002031 **
gradeE 7.253e-01 2.507e-01 2.893 0.003819 **
gradeF 9.959e-01 3.345e-01 2.977 0.002911 **
gradeG 1.192e+00 4.401e-01 2.707 0.006783 **
emp_length 3.406e-03 3.718e-03 0.916 0.359671
home_ownershipOTHER 6.501e-01 3.085e-01 2.107 0.035129 *
home_ownershipOWN -1.740e-01 9.798e-02 -1.776 0.075728 .
home_ownershipRENT -5.825e-02 5.383e-02 -1.082 0.279175
annual_inc -6.929e-06 8.191e-07 -8.460 < 2e-16 ***
age -6.457e-03 3.963e-03 -1.629 0.103211
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13214 on 19201 degrees of freedom
Residual deviance: 12637 on 19187 degrees of freedom
(2617 observations deleted due to missingness)
AIC: 12667
Number of Fisher Scoring iterations: 5
We then set different cutoff to know which loan application to be denied and which to be accepted
0 1
0 4494 1173
1 446 256
0 1
0 5363 304
1 614 88
0 1
0 5605 62
1 674 28
Here, cutoff1=.15, cutoff2=.20 and cutoff3=.25
Accuracy at different cutoff were
[1] 0.6531005
[1] 0.7494844
[1] 0.7745085

Recently uploaded (20)

BONKMILLON Unleashes Its Bonkers Potential on Solana.pdf
BONKMILLON Unleashes Its Bonkers Potential on Solana.pdfBONKMILLON Unleashes Its Bonkers Potential on Solana.pdf
BONKMILLON Unleashes Its Bonkers Potential on Solana.pdf
The European Unemployment Puzzle: implications from population aging
The European Unemployment Puzzle: implications from population agingThe European Unemployment Puzzle: implications from population aging
The European Unemployment Puzzle: implications from population aging
Analyzing the instability of equilibrium in thr harrod domar model
Analyzing the instability of equilibrium in thr harrod domar modelAnalyzing the instability of equilibrium in thr harrod domar model
Analyzing the instability of equilibrium in thr harrod domar model
2. Elemental Economics - Mineral demand.pdf
2. Elemental Economics - Mineral demand.pdf2. Elemental Economics - Mineral demand.pdf
2. Elemental Economics - Mineral demand.pdf
Pensions and housing - Pensions PlayPen - 4 June 2024 v3 (1).pdf
Pensions and housing - Pensions PlayPen - 4 June 2024 v3 (1).pdfPensions and housing - Pensions PlayPen - 4 June 2024 v3 (1).pdf
Pensions and housing - Pensions PlayPen - 4 June 2024 v3 (1).pdf
Instant Issue Debit Cards
Instant Issue Debit CardsInstant Issue Debit Cards
Instant Issue Debit Cards
How to get verified on Coinbase Account?_.docx
How to get verified on Coinbase Account?_.docxHow to get verified on Coinbase Account?_.docx
How to get verified on Coinbase Account?_.docx
How Non-Banking Financial Companies Empower Startups With Venture Debt Financing
How Non-Banking Financial Companies Empower Startups With Venture Debt FinancingHow Non-Banking Financial Companies Empower Startups With Venture Debt Financing
How Non-Banking Financial Companies Empower Startups With Venture Debt Financing
1. Elemental Economics - Introduction to mining.pdf
1. Elemental Economics - Introduction to mining.pdf1. Elemental Economics - Introduction to mining.pdf
1. Elemental Economics - Introduction to mining.pdf
An Overview of the Prosocial dHEDGE Vault works
An Overview of the Prosocial dHEDGE Vault worksAn Overview of the Prosocial dHEDGE Vault works
An Overview of the Prosocial dHEDGE Vault works
SWAIAP Fraud Risk Mitigation Prof Oyedokun.pptx
SWAIAP Fraud Risk Mitigation   Prof Oyedokun.pptxSWAIAP Fraud Risk Mitigation   Prof Oyedokun.pptx
SWAIAP Fraud Risk Mitigation Prof Oyedokun.pptx
Tdasx: Unveiling the Trillion-Dollar Potential of Bitcoin DeFi
Tdasx: Unveiling the Trillion-Dollar Potential of Bitcoin DeFiTdasx: Unveiling the Trillion-Dollar Potential of Bitcoin DeFi
Tdasx: Unveiling the Trillion-Dollar Potential of Bitcoin DeFi
Globalization (Nike) Presentation PPT Poster Infographic.pdf
Globalization (Nike) Presentation PPT Poster Infographic.pdfGlobalization (Nike) Presentation PPT Poster Infographic.pdf
Globalization (Nike) Presentation PPT Poster Infographic.pdf
The secret way to sell pi coins effortlessly.
The secret way to sell pi coins effortlessly.The secret way to sell pi coins effortlessly.
The secret way to sell pi coins effortlessly.
how to sell pi coins in South Korea profitably.
how to sell pi coins in South Korea to sell pi coins in South Korea profitably.
how to sell pi coins in South Korea profitably.
Tumelo-deep-dive-into-pass-through-voting-Feb23 (1).pdf
Tumelo-deep-dive-into-pass-through-voting-Feb23 (1).pdfTumelo-deep-dive-into-pass-through-voting-Feb23 (1).pdf
Tumelo-deep-dive-into-pass-through-voting-Feb23 (1).pdf
Intro_Economics_ GPresentation Week 4.pptx
Intro_Economics_ GPresentation Week 4.pptxIntro_Economics_ GPresentation Week 4.pptx
Intro_Economics_ GPresentation Week 4.pptx
Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designs

Credit risk modelling using logistic regression in R

  • 2. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 ACKNOWLEDGEMENTS On completion of this project, we would like to thank our faculty, Dr. Devlina Chatterjee for giving us the opportunity to pursue the project as a part of the curriculum and also being a constant source of support throughout the project. We would also like to thank our classmates and friends, who helped us in the conceptualization of the problem statement. Lastly, we thank all the researchers, bloggers and people from the community at large for providing us a starting point for our project through their documentation, research and articles.
  • 3. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 TABLE OF CONTENTS ACKNOWLEDGEMENTS.................................................................................................................................1 OBJECTIVE.....................................................................................................................................................3 INTRODUCTION.............................................................................................................................................3 CIBIL SCORE...............................................................................................................................................3 METHODOLOGY ............................................................................................................................................4 LOGISTIC REGRESSION..............................................................................................................................4 REGRESSION EQUATION .......................................................................................................................4 ASSUMPTIONS IN LOGISTIC REGRESSION.............................................................................................5 TOOLS, TECHNOLOGIES AND DATASET:........................................................................................................5 TOOLS AND TECHNOLOGIES .....................................................................................................................5 DATASET....................................................................................................................................................5 DATASET DESCRIPTION.........................................................................................................................5 MODELLING PROCESS, SELECTION AND FINE-TUNING.................................................................................6 PROCESS....................................................................................................................................................6 SELECTION.................................................................................................................................................6 FINE TUNING.............................................................................................................................................7 OBSERVATIONS.............................................................................................................................................7 SELECTING THE MODEL.............................................................................................................................7 SELECTING THE CUT-OFF...........................................................................................................................7 QUALITATIVE ANALYSIS OF THE RESULTS.....................................................................................................8 DIRECT AND INVERSE VARIATIONS...........................................................................................................8 LEVEL OF SIGNIFICANCE............................................................................................................................9 LIMITATIONS OF THE MODEL .......................................................................................................................9 Reject Inference....................................................................................................................................9 Omitted Variable bias ...........................................................................................................................9 Over fitting............................................................................................................................................9 CONCLUSION.................................................................................................................................................9 REFERENCES................................................................................................................................................10 APPENDIX....................................................................................................................................................11 R CODE ....................................................................................................................................................11 R CODE OUTPUT......................................................................................................................................12
  • 4. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 OBJECTIVE To explore qualitatively and quantitatively the risks associated with giving out credit for personal and commercial purposes, and to model the risk factor using a widely used machine learning classification method; Logistic Regression. INTRODUCTION Credit risk modelling tries to answer the question: Assuming past behavior is predictive of future behavior, what is the probability that a debtor will not repay the debt-holder? The analysis of credit risk is of utmost importance for financial institutions. Historically, it was done by taking into account the net assets a borrower had and if it was enough to cover the debt. Being manual in nature, it was prone to human biases and corruption. In the past two decades, technology has transformed and automate the process, making it easier to deal with the volume of debtors (for banks) as well as variety of debt. A milestone has been the development of CIBIL score in India. CIBIL SCORE A Credit Score or the CIBIL Score is a three-digit numeric summary of your credit history. The score is derived using the credit history found in the CIR. A CIR is an individual's credit payment history across loan types and credit institutions over a period of time. The minimum CIBIL score for a personal loan is generally 750. Anything above this would mean that the applicant is creditworthy and applications are processed without hassle. In general credit scores range from 300 to 900.
  • 5. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 METHODOLOGY Model used: Standard logistic heteroskedastic robust regression model. LOGISTIC REGRESSION Logistic regression is the type of regression we use for a response variable (Y) that follows a binomial distribution.  Y ~ Binomial(n, p)  n independent trials  p = probability of success on each trial  Y = number of successes out of n trials  (e.g., Y= number of heads) REGRESSION EQUATION P= exp (𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ ) /1 + exp(𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ 𝑥𝑛 )  p is the probability of default  xi is the explanatory factor i  βi is the regression coefficient of the explanatory factor i  n is the number of explanatory variables The reasons why Logistic regression is better suited to credit risk analysis are: 1. The independent variable (credit type and duration, income etc) are categorical in nature. Categories make better predictors in this analysis than actual value. 2. The end result has to be in probability or percentage (like person A is x% likely to default on the given credit), which is not possible for linear regression model since its values vary between both ends of the number line.
  • 6. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 3. The variability of the dependent variable (Y) is not constant, as in the case of a normal distribution. Variance of a binomial distribution is given by npq, while it’s the standard deviation constant for a normal distribution inherent assumed in linear regression model. ASSUMPTIONS IN LOGISTIC REGRESSION  Absence of perfect multicollinearity  No outliers  Independence of errors  Ratio of cases to variables – using discrete variables requires that there are enough responses in every given category  Not many missing variables TOOLS, TECHNOLOGIES AND DATASET: TOOLS AND TECHNOLOGIES R scripting Language, RStudio IDE for Windows. DATASET The dataset is taken as bank’s record about the status of loan defaults and the profile of customers. The dataset contains information like age, annual income, home ownership, grade of employee that affect the loan paying capacity of the customer. DATASET DESCRIPTION This data is taken from 1. Contains 29092 rows and 8 columns. 2. Contains 2043 rows with missing data. 3. The columns are namely: loan_status: 0 if successful, 1 if defaulted loan_amnt: total amount of loan taken int_rate: interest rate grade: grade of employment emp_length: duration of employement home_ownership: type of ownership of house annual_inc: annual income age: age of loan taker. 4. In the columns, loan_Status is binary variable, loan_amount, int_rate, annual_inc and age are all numeric continuous variables, while grade and home ownership are categorical variables with 7 and 4 categories respectively.
  • 7. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 MODELLING PROCESS, SELECTION AND FINE-TUNING PROCESS By including and excluding some independent variables, three logistic regression models were built. The dataset was divided into Training (75%) and Testing (25%) set. The objective of modelling was to minimize the residual deviance on the testing data, using respective co-efficient computed using training data. SELECTION Model selection was done on the basis of lowest AIC (Akaike information criterion), lowest median residual Deviance and highest number of significant variables at a confidence of 99.95% and above. Model 3 did well on all the three parameters.
  • 8. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 FINE TUNING The result obtained for the test dataset were decimal values. To make it categorical, values with different cut off limits were used and an accuracy of 77.4% was reached. To avoid over-fitting and save potential loss of profit, cut-offs were not increased beyond this limit. OBSERVATIONS SELECTING THE MODEL MODEL 1 MODEL 2 MODEL 3 INDEPENDENT VARIABLES loan_amnt int_rate annual_inc age loan_amnt int_rate annual_inc age home_ownership loan_amnt int_rate gradeB gradeC gradeD gradeE gradeF gradeG emp_length home_ow nershipOTHER home_ownershipOWN home_ownershipRENT NUMBER OF STATISTICALLY SIGNIFICANT INDEPENDENT VARIABLES (At-least .05%) 3 4 10 MEDIAN DEVIANCE RESIDUALS -0.4331 -0.4321 -0.4312 AIC 13236 13235 12667 So, the third model is better than the other two. SELECTING THE CUT-OFF Setting the cutoff at .x means that there is a probability of x% that a person will default on the given credit. A confusion matrix is a table used to describe the performance of a classification model on a set of test data for which the true values are known. Its general structure is:
  • 9. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 The accuracy of a model is computed as True positives+ True negatives/number of rows in test data. confmat1 #.15 cutoff1 0 1 0 4494 1173 1 446 256 Accuracy: 65.31% confmat2 #.20 cutoff2 0 1 0 5363 304 1 614 88 Accuracy: 74.94% confmat3 #.25 cutoff3 0 1 0 5605 62 1 674 28 Accuracy: 77.45% QUALITATIVE ANALYSIS OF THE RESULTS DIRECT AND INVERSE VARIATIONS The co-efficient of the following are positive: loan_amnt, int_rate, gradeB, gradeC, gradeD, gradeE , gradeF , gradeG, emp_length, home_ownershipOTHER This means the probability of defaulting on the given credit varies directly with these factors ie more the value, more the risk of losing credit. Common sense suggests the same.
  • 10. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 For Other types of home ownership (other than home or rent, like a demolished/mortgaged home), the probability of defaulting increases. And the following have negative co-efficient: home_ownershipOWN, home_ownershipRENT, annual_inc, age This means that the probability of defaulting is inversely proportional to the factors mentioned above. Intuitively too, it makes perfect sense. LEVEL OF SIGNIFICANCE Variables having at-least one star in the coefficients table are significant. Positive coefficient means higher the value of that variable, higher the risk of default, and vice versa. The significance levels are determined using standard Z tests. LIMITATIONS OF THE MODEL Reject Inference The data given by banks is inherently biased towards the rejected applications, and hence isn’t a true representation of a client who comes through the door. Stratified sampling can help take care of this. Omitted Variable bias can never be fully eliminated from any type of regression. This is because of the uncertainties in the real world. In logistic regression no assumptions are made about the linear distribution and absence of high degree of interaction between the explanatory variables. Over fitting: Logistic regression sometimes tend to over-fit the sample, appearing to be more confident than it really is. In this case, it is fine but in other cases, it might be undesirable. CONCLUSION  Three logit models were used to predict the loan status, the model with the least residual error was selected. Different cut off gave different accuracy. The first model had a Akaike information criterion score of 13236, while second model has score of 13235 and the third model has a score of 12667 w hich has a significant improvement from other two models. Hence the most precise model was selec ted.  Different cut off were used to decide if the loan should be granted to be or not and cut off of .15 gav e accuracy of 65.31% while cut off of .20 gave accuracy of 74.94% and cut off of .25 gave accuracy of 77.45%. Hence most accurate model was chosen. The decision to set a cutoff is arbitrary and higher cut off increases the risk so a level of .25 was decided to be optimum. The area under the curve also gives a measure of accuracy, which came out to be 64% approx.
  • 11. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 REFERENCES [1.] [2.] [3.] [4.] [5.]
  • 12. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 APPENDIX R CODE data1<- readRDS("Loandata.rds") #reading data head(data1) #reading the first few lines off the dataset traindata<- sample(data1,0.75*nrow(data1))#preparing training data testdata<-sample(data1,-.75*nrow(data1))#preparing test data #model 1 with loan amount, interest rate, annual income, age result<- glm(formula=loan_status~loan_amnt+int_rate+annual_inc+age,family="binomial",data=traindata) summary(result) #model 2 with loan amount, interest amount, annual income, age and home ownership result1<- glm(formula=loan_status~loan_amnt+int_rate+annual_inc+age+home_ownership,family="binomial",da ta=traindata) summary(result1) #model 3 with loan amount, interest rate, grade, employment length, annual income, age, home ownership result2<-glm(loan_status~.,family="binomial",data=traindata) summary (result2) #Least residual deviance #predicting the result on test data pred1<-predict(result,testdata,type="response") pred2<-predict(result1,testdata,type="response") pred<-predict(result2,testdata,type="response") #Varying cut off for the best predictor on the model with least residual deviance #at if value below .15 then it is declined else excepted cutoff1<-ifelse(pred>.15,1,0) #at if value below .2 then it is declined else excepted cutoff2<-ifelse(pred>.2,1,0) #at if value below .25 then it is declined else excepted cutoff3<-ifelse(pred>.25,1,0) #confusion matrix to show Type 1 and 2 errors confmat1<-table(testdata$loan_status,cutoff1) confmat1 confmat2<-table(testdata$loan_status,cutoff2) confmat2
  • 13. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 confmat3<-table(testdata$loan_status,cutoff3) confmat3 #checking accuracy of different models logit1<-sum(diag(confmat1))/nrow(testdata) logit1 logit2<-sum(diag(confmat2))/nrow(testdata) logit2 logit3<-sum(diag(confmat3))/nrow(testdata) logit3 R CODE OUTPUT summary(result) Call: glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc + age, family = "binomial", data = traindata) Deviance Residuals: Min 1Q Median 3Q Max -1.0794 -0.5334 -0.4331 -0.3421 3.7236 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.265e+00 1.400e-01 -23.318 <2e-16 *** loan_amnt 1.762e-07 4.127e-06 0.043 0.966 int_rate 1.517e-01 7.257e-03 20.902 <2e-16 *** annual_inc -6.935e-06 7.700e-07 -9.005 <2e-16 *** age -5.271e-03 3.843e-03 -1.372 0.170 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13800 on 19775 degrees of freedom Residual deviance: 13226 on 19771 degrees of freedom (2043 observations deleted due to missingness) AIC: 13236 Number of Fisher Scoring iterations: 5
  • 14. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 summary(result1) Call: glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc + age + home_ownership, family = "binomial", data = traindata) Deviance Residuals: Min 1Q Median 3Q Max -1.0816 -0.5339 -0.4321 -0.3420 3.7963 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.217e+00 1.442e-01 -22.311 <2e-16 *** loan_amnt -2.785e-08 4.133e-06 -0.007 0.9946 int_rate 1.527e-01 7.329e-03 20.837 <2e-16 *** annual_inc -7.265e-06 8.070e-07 -9.002 <2e-16 *** age -5.120e-03 3.843e-03 -1.332 0.1828 home_ownershipOTHER 6.196e-01 3.072e-01 2.017 0.0437 * home_ownershipOWN -1.487e-01 9.310e-02 -1.597 0.1103
  • 15. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 home_ownershipRENT -6.259e-02 5.185e-02 -1.207 0.2274 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13800 on 19775 degrees of freedom Residual deviance: 13219 on 19768 degrees of freedom (2043 observations deleted due to missingness) AIC: 13235 Number of Fisher Scoring iterations: 5 summary(result2) Call: glm(formula = loan_status ~ ., family = "binomial", data = traindata) Deviance Residuals: Min 1Q Median 3Q Max
  • 16. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 -1.0905 -0.5315 -0.4312 -0.3321 3.7253 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.830e+00 2.166e-01 -13.066 < 2e-16 *** loan_amnt 2.691e-07 4.230e-06 0.064 0.949276 int_rate 8.519e-02 2.314e-02 3.681 0.000232 *** gradeB 3.390e-01 1.092e-01 3.104 0.001909 ** gradeC 5.366e-01 1.581e-01 3.394 0.000688 *** gradeD 6.203e-01 2.010e-01 3.086 0.002031 ** gradeE 7.253e-01 2.507e-01 2.893 0.003819 ** gradeF 9.959e-01 3.345e-01 2.977 0.002911 ** gradeG 1.192e+00 4.401e-01 2.707 0.006783 ** emp_length 3.406e-03 3.718e-03 0.916 0.359671 home_ownershipOTHER 6.501e-01 3.085e-01 2.107 0.035129 * home_ownershipOWN -1.740e-01 9.798e-02 -1.776 0.075728 . home_ownershipRENT -5.825e-02 5.383e-02 -1.082 0.279175 annual_inc -6.929e-06 8.191e-07 -8.460 < 2e-16 *** age -6.457e-03 3.963e-03 -1.629 0.103211 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13214 on 19201 degrees of freedom Residual deviance: 12637 on 19187 degrees of freedom (2617 observations deleted due to missingness) AIC: 12667 Number of Fisher Scoring iterations: 5
  • 17. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 We then set different cutoff to know which loan application to be denied and which to be accepted confmat1 cutoff1 0 1 0 4494 1173 1 446 256 confmat2 cutoff2 0 1 0 5363 304 1 614 88 confmat3 cutoff3 0 1 0 5605 62 1 674 28
  • 18. MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017 Here, cutoff1=.15, cutoff2=.20 and cutoff3=.25 Accuracy at different cutoff were logit1 [1] 0.6531005 logit2 [1] 0.7494844 logit3 [1] 0.7745085 *