Credit risk modelling using logistic regression in R

By:
Harsha Sinha (16125018)
Kriti Doneria (16125022)
Prakhar Barole (16125028)
CREDIT RISK MODELLING USING
LOGISTIC REGRESSION
STATISTICAL METHODS FOR BUSINESS ANALYTICS PROJECT REPORT

MBA652A Course Instructor: Dr. Devlina Chatterjee April 2017
ACKNOWLEDGEMENTS
On completion of this project, we would like to thank our faculty, Dr. Devlina Chatterjee for
giving us the opportunity to pursue the project as a part of the curriculum and also being a
constant source of support throughout the project.
We would also like to thank our classmates and friends, who helped us in the
conceptualization of the problem statement.
Lastly, we thank all the researchers, bloggers and people from the community at large for
providing us a starting point for our project through their documentation, research and
articles.

TABLE OF CONTENTS
ACKNOWLEDGEMENTS.................................................................................................................................1
OBJECTIVE.....................................................................................................................................................3
INTRODUCTION.............................................................................................................................................3
CIBIL SCORE...............................................................................................................................................3
METHODOLOGY ............................................................................................................................................4
LOGISTIC REGRESSION..............................................................................................................................4
REGRESSION EQUATION .......................................................................................................................4
ASSUMPTIONS IN LOGISTIC REGRESSION.............................................................................................5
TOOLS, TECHNOLOGIES AND DATASET:........................................................................................................5
TOOLS AND TECHNOLOGIES .....................................................................................................................5
DATASET....................................................................................................................................................5
DATASET DESCRIPTION.........................................................................................................................5
MODELLING PROCESS, SELECTION AND FINE-TUNING.................................................................................6
PROCESS....................................................................................................................................................6
SELECTION.................................................................................................................................................6
FINE TUNING.............................................................................................................................................7
OBSERVATIONS.............................................................................................................................................7
SELECTING THE MODEL.............................................................................................................................7
SELECTING THE CUT-OFF...........................................................................................................................7
QUALITATIVE ANALYSIS OF THE RESULTS.....................................................................................................8
DIRECT AND INVERSE VARIATIONS...........................................................................................................8
LEVEL OF SIGNIFICANCE............................................................................................................................9
LIMITATIONS OF THE MODEL .......................................................................................................................9
Reject Inference....................................................................................................................................9
Omitted Variable bias ...........................................................................................................................9
Over fitting............................................................................................................................................9
CONCLUSION.................................................................................................................................................9
REFERENCES................................................................................................................................................10
APPENDIX....................................................................................................................................................11
R CODE ....................................................................................................................................................11
R CODE OUTPUT......................................................................................................................................12

OBJECTIVE
To explore qualitatively and quantitatively the risks associated with giving out credit for personal and
commercial purposes, and to model the risk factor using a widely used machine learning classification
method; Logistic Regression.
INTRODUCTION
Credit risk modelling tries to answer the question:
Assuming past behavior is predictive of future behavior, what is the probability that a
debtor will not repay the debt-holder?
The analysis of credit risk is of utmost importance for financial institutions. Historically, it was done by
taking into account the net assets a borrower had and if it was enough to cover the debt. Being manual
in nature, it was prone to human biases and corruption. In the past two decades, technology has
transformed and automate the process, making it easier to deal with the volume of debtors (for banks)
as well as variety of debt.
A milestone has been the development of CIBIL score in India.
CIBIL SCORE
A Credit Score or the CIBIL Score is a three-digit numeric summary of your credit history. The score is
derived using the credit history found in the CIR. A CIR is an individual's credit payment history across
loan types and credit institutions over a period of time. The minimum CIBIL score for a personal loan is
generally 750. Anything above this would mean that the applicant is creditworthy and applications are
processed without hassle. In general credit scores range from 300 to 900.

METHODOLOGY
Model used: Standard logistic heteroskedastic robust regression model.
LOGISTIC REGRESSION
Logistic regression is the type of regression we use for a response variable (Y) that follows a binomial
distribution.
 Y ~ Binomial(n, p)
 n independent trials
 p = probability of success on each trial
 Y = number of successes out of n trials
 (e.g., Y= number of heads)
REGRESSION EQUATION
P= exp (𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ ) /1 +
exp(𝛽0 + 𝛽1 ∙ 𝑥1 + ⋯ + 𝛽𝑛 ∙ 𝑥𝑛 )
 p is the probability of default
 xi is the explanatory factor i
 βi is the regression coefficient of the explanatory factor i
 n is the number of explanatory variables
The reasons why Logistic regression is better suited to credit risk analysis are:
1. The independent variable (credit type and duration, income etc) are categorical in
nature. Categories make better predictors in this analysis than actual value.
2. The end result has to be in probability or percentage (like person A is x% likely to default
on the given credit), which is not possible for linear regression model since its values
vary between both ends of the number line.

3. The variability of the dependent variable (Y) is not constant, as in the case of a normal
distribution. Variance of a binomial distribution is given by npq, while it’s the standard
deviation constant for a normal distribution inherent assumed in linear regression
model.
ASSUMPTIONS IN LOGISTIC REGRESSION
 Absence of perfect multicollinearity
 No outliers
 Independence of errors
 Ratio of cases to variables – using discrete variables requires that there are enough responses in
every given category
 Not many missing variables
TOOLS, TECHNOLOGIES AND DATASET:
TOOLS AND TECHNOLOGIES
R scripting Language, RStudio IDE for Windows.
DATASET
The dataset is taken as bank’s record about the status of loan defaults and the profile of customers. The
dataset contains information like age, annual income, home ownership, grade of employee that affect
the loan paying capacity of the customer.
DATASET DESCRIPTION
This data is taken from https://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html
1. Contains 29092 rows and 8 columns.
2. Contains 2043 rows with missing data.
3. The columns are namely:
loan_status: 0 if successful, 1 if defaulted
loan_amnt: total amount of loan taken
int_rate: interest rate
grade: grade of employment
emp_length: duration of employement
home_ownership: type of ownership of house
annual_inc: annual income
age: age of loan taker.
4. In the columns, loan_Status is binary variable, loan_amount, int_rate, annual_inc and age are
all numeric continuous variables, while grade and home ownership are categorical variables
with 7 and 4 categories respectively.

MODELLING PROCESS, SELECTION AND FINE-TUNING
PROCESS
By including and excluding some independent variables, three logistic regression models were built.
The dataset was divided into Training (75%) and Testing (25%) set. The objective of modelling was to
minimize the residual deviance on the testing data, using respective co-efficient computed using training
data.
SELECTION
Model selection was done on the basis of lowest AIC (Akaike information criterion), lowest median
residual Deviance and highest number of significant variables at a confidence of 99.95% and above.
Model 3 did well on all the three parameters.

FINE TUNING
The result obtained for the test dataset were decimal values. To make it categorical, values with
different cut off limits were used and an accuracy of 77.4% was reached. To avoid over-fitting and save
potential loss of profit, cut-offs were not increased beyond this limit.
OBSERVATIONS
SELECTING THE MODEL
MODEL 1 MODEL 2 MODEL 3
INDEPENDENT
VARIABLES
loan_amnt
int_rate
annual_inc
age
loan_amnt
int_rate
annual_inc
age
home_ownership
loan_amnt int_rate
gradeB
gradeC
gradeD
gradeE
gradeF
gradeG
emp_length home_ow
nershipOTHER
home_ownershipOWN
home_ownershipRENT
NUMBER OF
STATISTICALLY
SIGNIFICANT
INDEPENDENT
VARIABLES (At-least
.05%)
3 4 10
MEDIAN DEVIANCE
RESIDUALS -0.4331 -0.4321 -0.4312
AIC 13236 13235 12667
So, the third model is better than the other two.
SELECTING THE CUT-OFF
Setting the cutoff at .x means that there is a probability of x% that a person will default on the given
credit.
A confusion matrix is a table used to describe the performance of a classification model on a set of test
data for which the true values are known.
Its general structure is:

The accuracy of a model is computed as True positives+ True negatives/number of rows in test data.
confmat1 #.15
cutoff1
0 1
0 4494 1173
1 446 256
Accuracy: 65.31%
confmat2 #.20
cutoff2
0 1
0 5363 304
1 614 88
Accuracy: 74.94%
confmat3 #.25
cutoff3
0 1
0 5605 62
1 674 28
Accuracy: 77.45%
QUALITATIVE ANALYSIS OF THE RESULTS
DIRECT AND INVERSE VARIATIONS
The co-efficient of the following are positive:
loan_amnt, int_rate, gradeB, gradeC, gradeD, gradeE , gradeF , gradeG, emp_length,
home_ownershipOTHER
This means the probability of defaulting on the given credit varies directly with these factors ie more the
value, more the risk of losing credit. Common sense suggests the same.

For Other types of home ownership (other than home or rent, like a demolished/mortgaged home), the
probability of defaulting increases.
And the following have negative co-efficient:
home_ownershipOWN, home_ownershipRENT, annual_inc, age
This means that the probability of defaulting is inversely proportional to the factors mentioned above.
Intuitively too, it makes perfect sense.
LEVEL OF SIGNIFICANCE
Variables having at-least one star in the coefficients table are significant. Positive coefficient means
higher the value of that variable, higher the risk of default, and vice versa. The significance levels are
determined using standard Z tests.
LIMITATIONS OF THE MODEL
Reject Inference The data given by banks is inherently biased towards the rejected applications, and
hence isn’t a true representation of a client who comes through the door. Stratified sampling can help
take care of this.
Omitted Variable bias can never be fully eliminated from any type of regression. This is because of the
uncertainties in the real world.
In logistic regression no assumptions are made about the linear distribution and absence of high
degree of interaction between the explanatory variables.
Over fitting: Logistic regression sometimes tend to over-fit the sample, appearing to be more confident
than it really is. In this case, it is fine but in other cases, it might be undesirable.
CONCLUSION
 Three logit models were used to predict the loan status, the model with the least residual error was
selected. Different cut off gave different accuracy. The first model had a Akaike information criterion
score of 13236, while second model has score of 13235 and the third model has a score of 12667 w
hich has a significant improvement from other two models. Hence the most precise model was selec
ted.
 Different cut off were used to decide if the loan should be granted to be or not and cut off of .15 gav
e accuracy of 65.31% while cut off of .20 gave accuracy of 74.94% and cut off of .25 gave accuracy of
77.45%. Hence most accurate model was chosen. The decision to set a cutoff is arbitrary and higher
cut off increases the risk so a level of .25 was decided to be optimum. The area under the curve also
gives a measure of accuracy, which came out to be 64% approx.

REFERENCES
[1.] www.wikihow.com/Check-Your-Credit-Score-Online-in-India
[2.] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1065119/
[3.] https://www2.deloitte.com/
[4.] Hackerearth.com
[5.] Analyticsvidya.com

APPENDIX
R CODE
data1<- readRDS("Loandata.rds") #reading data
head(data1) #reading the first few lines off the dataset
traindata<- sample(data1,0.75*nrow(data1))#preparing training data
testdata<-sample(data1,-.75*nrow(data1))#preparing test data
#model 1 with loan amount, interest rate, annual income, age
result<-
glm(formula=loan_status~loan_amnt+int_rate+annual_inc+age,family="binomial",data=traindata)
summary(result)
#model 2 with loan amount, interest amount, annual income, age and home ownership
result1<-
glm(formula=loan_status~loan_amnt+int_rate+annual_inc+age+home_ownership,family="binomial",da
ta=traindata)
summary(result1)
#model 3 with loan amount, interest rate, grade, employment length, annual income, age, home
ownership
result2<-glm(loan_status~.,family="binomial",data=traindata)
summary (result2)
#Least residual deviance
#predicting the result on test data
pred1<-predict(result,testdata,type="response")
pred2<-predict(result1,testdata,type="response")
pred<-predict(result2,testdata,type="response")
#Varying cut off for the best predictor on the model with least residual deviance
#at if value below .15 then it is declined else excepted
cutoff1<-ifelse(pred>.15,1,0)
#confusion matrix to show Type 1 and 2 errors
confmat1<-table(testdata$loan_status,cutoff1)
confmat1
confmat2

confmat3
#checking accuracy of different models
logit1<-sum(diag(confmat1))/nrow(testdata)
logit1
logit2
logit3
R CODE OUTPUT
summary(result)
Call:
glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc +
age, family = "binomial", data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0794 -0.5334 -0.4331 -0.3421 3.7236
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.265e+00 1.400e-01 -23.318 <2e-16 ***
loan_amnt 1.762e-07 4.127e-06 0.043 0.966
int_rate 1.517e-01 7.257e-03 20.902 <2e-16 ***
annual_inc -6.935e-06 7.700e-07 -9.005 <2e-16 ***
age -5.271e-03 3.843e-03 -1.372 0.170
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13800 on 19775 degrees of freedom
Residual deviance: 13226 on 19771 degrees of freedom
(2043 observations deleted due to missingness)
AIC: 13236
Number of Fisher Scoring iterations: 5

summary(result1)
Call:
glm(formula = loan_status ~ loan_amnt + int_rate + annual_inc +
age + home_ownership, family = "binomial", data = traindata)
Deviance Residuals:
-1.0816 -0.5339 -0.4321 -0.3420 3.7963
Coefficients:
(Intercept) -3.217e+00 1.442e-01 -22.311 <2e-16 ***
loan_amnt -2.785e-08 4.133e-06 -0.007 0.9946
int_rate 1.527e-01 7.329e-03 20.837 <2e-16 ***
annual_inc -7.265e-06 8.070e-07 -9.002 <2e-16 ***
age -5.120e-03 3.843e-03 -1.332 0.1828
home_ownershipOTHER 6.196e-01 3.072e-01 2.017 0.0437 *
home_ownershipOWN -1.487e-01 9.310e-02 -1.597 0.1103

home_ownershipRENT -6.259e-02 5.185e-02 -1.207 0.2274
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC: 13235
summary(result2)
Call:
glm(formula = loan_status ~ ., family = "binomial", data = traindata)
Deviance Residuals:

-1.0905 -0.5315 -0.4312 -0.3321 3.7253
Coefficients:
(Intercept) -2.830e+00 2.166e-01 -13.066 < 2e-16 ***
loan_amnt 2.691e-07 4.230e-06 0.064 0.949276
int_rate 8.519e-02 2.314e-02 3.681 0.000232 ***
gradeB 3.390e-01 1.092e-01 3.104 0.001909 **
gradeC 5.366e-01 1.581e-01 3.394 0.000688 ***
gradeD 6.203e-01 2.010e-01 3.086 0.002031 **
gradeE 7.253e-01 2.507e-01 2.893 0.003819 **
gradeF 9.959e-01 3.345e-01 2.977 0.002911 **
gradeG 1.192e+00 4.401e-01 2.707 0.006783 **
emp_length 3.406e-03 3.718e-03 0.916 0.359671
home_ownershipOTHER 6.501e-01 3.085e-01 2.107 0.035129 *
home_ownershipOWN -1.740e-01 9.798e-02 -1.776 0.075728 .
home_ownershipRENT -5.825e-02 5.383e-02 -1.082 0.279175
annual_inc -6.929e-06 8.191e-07 -8.460 < 2e-16 ***
age -6.457e-03 3.963e-03 -1.629 0.103211
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC: 12667

We then set different cutoff to know which loan application to be denied and which to be accepted
confmat1
cutoff1
0 1
0 4494 1173
1 446 256
confmat2
cutoff2
0 1
0 5363 304
1 614 88
confmat3
cutoff3
0 1
0 5605 62
1 674 28

Here, cutoff1=.15, cutoff2=.20 and cutoff3=.25
Accuracy at different cutoff were
logit1
[1] 0.6531005
logit2
[1] 0.7494844
logit3
[1] 0.7745085
*

Credit risk modelling using logistic regression in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Credit risk modelling using logistic regression in R

Similar to Credit risk modelling using logistic regression in R (20)

More from Kriti Doneria

More from Kriti Doneria (16)

Recently uploaded

Recently uploaded (20)

Credit risk modelling using logistic regression in R