This document summarizes a seminar presentation on using logistic regression, artificial neural networks, and support vector machines to predict whether small business loans will be approved or denied based on characteristics in a dataset from the U.S. Small Business Administration. The presentation used three statistical models to analyze the data and predict loan risk. Logistic regression had the lowest misclassification rate at 32.15%, while support vector machines had the highest at 35.58%. The models were then used to predict whether two sample loan applications should be approved or denied.
1. Yashwantrao Chavan Institute of Science, Satara.
Department of Statistics
M.Sc. II 2018-2019
Seminar on
“Should This Loan be Approved or Denied?”: A Large
Dataset with Class Assignment Guidelines
Min Li, Amy Mickel, and Stanley Taylor
Presented by
Patil Pooja Rajaram
Roll No. 115
3. IntroduCtIon:
In this article, a large and rich dataset from the U.S.
Small Business Administration (SBA) and an accompanying
assignment designed to teach statistics as an investigative
process of decision making are presented. Guidelines for the
assignment titled “Should This Loan Be Approved or Denied?,”
along with a
subset of the larger dataset, are provided.
For this case-study assignment, students assume the role of
loan officer at a bank and are asked to approve or deny a loan by
assessing its risk of default using logistic regression. The dataset
accompanying this article is a real dataset from the U.S. Small
Business Administration (SBA).
4. MetHodoLoGY :
By analysing real data, students experience statistics as an
investigative process of decision making, for the student is required
to answer the following question: As a representative of the bank,
should I grant a loan to a particular small business (Company X)?
Why or why not? The student makes this decision by assessing a
loan’s risk.
The assessment is accomplished by estimating the loan’s
default probability through analyzing this historical dataset and then
classifying the loan into one of two categories:
(a) higher risk—likely to default on the loan (i.e., be charged
off/failure to pay in full) or
(b) lower risk—likely to pay off the loan in full.
5. BaCkGround and desCrIptIon of datasets :
The U.S. SBA was founded in 1953 on the principle of
promoting and assisting small enterprises in the U.S. credit market.
SBA acts much like an insurance provider to reduce the risk for a
bank by taking on some of the risk through guaranteeing a portion
of the loan.
Two datasets are provided:
(a) “National SBA” dataset (named SBAnational.csv) from the
U.S. SBA which includes historical data from 1987 through 2014
(899,164 observations) and
(b) “SBA Case” dataset (named SBAcase.csv) which is used in
the assignment described in this paper (2102 observations).
The “SBA Case” dataset is a subset of the “National SBA.”
The variable name, the data type, and a brief description of each
variable are provided for the 27 variables in the two datasets. For the
“SBA Case” dataset, an additional eight variables were generated by
the authors as part of the assignment.
6. PROCEDURE:
The steps involved in the investigative process of analysing
these data to make an informed decision as to whether a loan
should be approved or denied are :
Step 1: Identifying indicators of potential risk
Step 2: Understanding the case study
Step 3: Building the model, creating decision rules, and validating
the logistic regression model and
Step 4: Using the model to make decisions.
7. STATISTICAL TOOLS USED FOR ANALYSIS ARE :
Statistical analysis is carried out using R-software and
statistical tools used for analysis are :
1] Logistic regression
2] Artificial neural network(ANN)
3] Support vector machine(SVM)
8. Step 1: Identifying Explanatory Variables (Indicators or
Predictors) of Potential Risk
1) Location (State)
2) Industry
3) Gross Disbursement
4) New versus Established Businesses
5) Loans Backed by Real Estate
6) Economic Recession
7) SBA’s Guaranteed Portion of Approved Loan
Step 2: Understanding the Case Study and Dataset:
Students being a loan officer for Bank of America, have
received two loan applications from two small businesses:
Carmichael Realty (a commercial real estate agency) and SV
Consulting (a real estate consulting firm). As a loan officer,
students need to determine if they should grant or deny these
two loan applications and provide an explanation as to “why or
why not.” To make this decision, they need to assess the loan’s
risk by calculating the estimated probability of default using
9.
10. Step 4: Using the Model to Make Decisions :
Table 1:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.61170 0.09462 6.465 1.02e-10
Real Estate 2.12822 0.34500 6.169 6.89e-10
Portion 0.55722 0.10058 5.540 3.03e-08
Recession -0.50412 0.24121 -2.090 0.0366
classification
State of nature: Reality
Loans charged
off
Loans paid in
full
Total
Higher risk 31 14 45
Lower risk 324 682 1006
Total 355 696 1051
Model Accuracy = 0.6784015 = 67.84 %
Misclassification rate = 0.3215985 = 32.15 %
11. The final model with the risk indicators in Table 1 is used to
estimate the probability of default for the two loan applications, the
estimated probability of default for Carmichael Realty (Loan 1) is 0.05
and SV Consulting (Loan 2) is 0.55. Applying the decision rules and
cut-off probability of 0.5, Loan 1 is classified as “lower risk” and
should be approved, and Loan 2 is classified as “higher risk” and
should be denied.
Loan Name Date Loan SBA Real
Estate
Est.
Prob. Of
Default
Approve
1 Carmichael
Realty
Current $1000000 $750000 Yes 0.05 Yes
2 SV
Consulting
current $100000 $40000 No 0.55 No
12. Artificial neural network :
classification
State of nature: Reality
Loans
charged off
Loans paid in
full
Total
Higher risk 31 12 43
Lower risk 324 684 1008
Total 355 696 1051
Model Accuracy = 0.6803045 = 68.03 %
Misclassification rate = 0.3196955 = 31.96 %
13. Support vector machine :
classification
State of nature: Reality
Loans
charged off
Loans paid in
full
Total
Higher risk 20 39 59
Lower risk 335 657 992
Total 355 696 1051
Model Accuracy = 0.6441484 = 64.41 %
Misclassification rate = 0.3558516 = 35.58 %
15. ConClusion:
Model Accuracy Misclassification rate
Logistic regression 67.84 % 32.15 %
ANN 68.03 % 31.96 %
SVM 64.41 % 35.58 %
The misclassification rate for support vector machine was
found to be higher than those from logistic regression or
neural networks.
Logistic regression is equivalent to the neural network with no
hidden node.
If the objective is to separate loans from loans that are likely
to default without needing the predicted probability of default,
then neural networks and SVM are good choices.
16. RefeRences :
Journal of statistics education (Taylor and Francis
group)
Introduction to linear regression analysis
:Douglas C Montgomerry, Elizabeth A. Peck, G.
Geoffrey Vining
Data mining concepts and techniques :Micheline
Kamber, Jiawei Han, Jian Pei