1) The document describes a project analyzing FICO credit score data using linear regression to determine which demographic variables influence an individual's score.
2) Through variable selection, the final model included average months in file, maximum delinquency in last 12 months, maximum delinquency ever, and net fraction revolving burden as predictors of FICO score.
3) While the model performed reasonably well, some assumptions like independent errors and homoscedasticity were violated, so the results should be interpreted cautiously despite consistency between training and test sets.
1. MISY 267: BUSINESS
ANALYTICS FINAL
PROJECT
FICO Credit Risk Data
MAY 9TH
, 2016
UNIVERSITY OF DELAWARE
Daniel Amato, Kyle Zaino, Chirag Dhamecha, Erik Caputo, Joey Czechowicz
2. MISY267: Business Analytics 1
Goals
Our problem is that we are uncertain what variables influence a person’s FICO Score and
so our goal is to ascertain what demographic variables determine a given person’s FICO
score. We will use a variable selection method to determine which variables to include or
exclude from the model and we will run a linear regression model because we are
predicting a continuous variable as opposed to a binary variable. Our team will test the
model assumptions and check for any violations within the model. If any violations are
found, we will make suggestions on how to address them. This model will then be able to
assist a manager or a financial firm that is attempting to determine if an applicant is too
risky to be given a loan.
Methodologies
Data Available & Variables of Interest
Once we eliminated the variables with the terms “Application” and “Trades” in their
names, we were left with the following variables: risk flag (paid as negotiated flag), interest
revenue accumulated, sampling weight, debt to income ratio, number of borrowers,
geographic region, prior or current bank relationship, collateral value, loan amount
requested, loan to value ratio, FICO score, average months in file, months since most recent
delinquency, maximum delinquency and public records in the last twelve months,
maximums delinquency ever, number of inquiries in last 6 months, number of inquiries in
last 6 months excluding 7 days, net fraction revolving burden, and net fraction installment
burden.
Variable Selection: Backward Stepwise Selection
We used the Backward Stepwise variable selection method where we started running the
model with all variables available, and removed one variable at a time, choosing to remove
the variables that contributed the least to the model. Since we are working with FICO scores
of individual persons, we will choose a model based on the best prediction accuracy and
not R-squared value. Therefore, we want a model that yields the smallest errors out-of-
sample. This led us to remove all of the variables from our model except for average
months in file, maximum delinquency with the last 12 months, maximum delinquency ever,
and net fraction revolving burden. Our final model has an R-squared value of
approximately 71%, and the aforementioned four variables. Additionally, we looked for
interactions between our remaining variables, but found that there were not any interactions
that added value to our model. Lastly, we tested non-linear relationships, where we tested
transformations of variables within our model. We added the square root of the ‘Max
Delinquency in 12 months’ variable into our model, which increased our R2
by about 2%.
However, none of the variable transformations we tested substantially increased the R-
squared of our model and so they were not included in the model to avoid the risk of
overfitting. If a variable added an R-squared value of less than 5% we decided not to
include it in our model.
3. MISY267: Business Analytics 2
Proposed Model:
Division of Data: 70% of the data is part our training set and 30% is part of our
test set
Random Selection Method: The observations of our data occur over different
subjects so we will use randomly selected observations as our test set.
Proposed Linear Regression Model: FICO Score = 590.41 + 0.39*Average
Months In File + 12.71*Max Delinquency Last 12 Months + 6.28*Max
Delinquency Ever –0.95*Net Fraction Revolving Burden
Response Variable Interpretation: An increase in FICO score decreases the risk
of the individual/applicant.
● Multicollinearity: Multicollinearity is when a predictor variable shares a
relationship with another predictor variable to the point where we cannot determine
how much of an impact increasing one variable has on the model. We looked at the
correlation matrix to check for multicollinearity in order to make sure we can
interpret our variables directly and found that some of our variables have a
relationship above .25 or below -.25. There was a correlation of 0.61 between max
delinquency in the last 12 months and max delinquency ever. Although the
correlation is larger than 0.25, we don’t want to remove any of these variables as
they contribute a lot to the model on their own. However, interpretation of these
variables requires careful evaluation. Therefore, multicollinearity is an issue in this
model because all variables cannot be interpreted directly and some require more
careful evaluation.
● Statistical Significance: All of our variables have stars in the output, therefore they
are all statistically significant predictors of FICO scores.
● Intercept Interpretation: When the application has been in the file for 0 months,
there has been 0 delinquencies in the last 12 months and ever and the net fraction
revolving burden (outstanding balance/credit) is 0, the FICO Score on average is
590.41. The intercept makes sense to us intuitively, because the FICO score is
positive as it should be. For a more detailed interpretation of the intercept, we will
consult with the data manager.
● Average Months in File: When the average months in file increases by 1 month,
the FICO score increases on average increases by 0.39 points. This makes sense to
us intuitively, because the longer a file is being investigated, the riskier the
applicant probably is. However, this coefficient is not economically significant,
because a 0.39 point increase is miniscule. However, we decided not to remove it
from our model because it contributes 5.6% to our R2
.
● Maximum Delinquency: When the Maximum Delinquency within last 12 Months
is increased by 10, our FICO score is increased by approximately 127.1. This does
not make sense because if the individual has a large maximum delinquency it is
likely that the individual is risky. Since the FICO score increases by approximately
130 points it is economically significant because this has a substantial impact on an
individual's FICO score, which will determine their eligibility to receive a loan.
4. MISY267: Business Analytics 3
● Maximum Delinquency Ever: When max delinquency ever increases by 10, the
FICO score increases on average by (6.28 *10) 62.8 points. This does not make
sense to us intuitively, because the larger the max delinquency (failure to pay
outstanding debt) ever is, the riskier the applicant probably is. This coefficient is
economically significant, because a 70 point increase can be a deciding factor when
determining an individual’s eligibility to receive a loan.
● Net Fraction Revolving Burden: When the net fraction revolving burden
increases by 1 unit, meaning your balance in relation to your credit increases, the
FICO score decreases on average by 0.95 points. Intuitively this is sensible because
the larger the net fraction revolving burden (outstanding balance/credit) is, the less
risky the applicant probably is. This coefficient is not economically significant,
because a 0.89 point increase is close to 0 so it is too small to be economically
significant. However, we don’t want to remove it from our model because it
contributes 23.32% to R2
.
Tests of Model Assumptions
Assumption 1: Exclude Unnecessary Variables & Good Linear Model
Assumption 1 is a formality and so we will assume that our model is a good linear model.
Assumption 1 passes automatically because of this formality.
Assumption 2: No Perfect Multicollinearity
There is no perfect multicollinearity in our model. Based on the correlation matrix none
of the correlation results were equal to 1 or -1.
Assumption 3: Independent Errors
Based on the Residual Chart shown below, we pass the assumption of independent errors
because there is no pattern in the graph. Intuitively, one person’s FICO score should not be
related to another person’s FICO score so this makes sense. However, we should take note
of the small gap between 0 and 2000 on the Index (x-axis). Additionally, in the ACF plot
shown below, we fail the assumption of independent errors because we see the first lag
crossing the dotted line threshold. However, it is not substantially crossing the line, so we
can choose to view the first lag as insignificant. All other lags are insignificant if we view
the first lag as insignificant.
5. MISY267: Business Analytics 4
Assumption 4: No Heteroskedasticity
Based on the Fitted Vs Residual plot shown below, we see unequal variances in FICO
scores. Therefore, we fail the assumption of no heteroskedasticity. Since we fail the
assumption of no heteroskedasticity, we would address the violation in one of the following
ways:
● Ignoring
● Bootstrapping
● Or controlling the correlation by including the lagged values as predictors in the
model
Assumption 5: Normal Distribution of Errors
According to the histogram of our residuals, the distribution of the errors visually looks
approximately normal. According to the QQ plot shown below, the results in the middle
look similar to a straight line, however, towards the upper and lower end, the line deviates
dramatically from a linear path. Therefore, it appears that on average our model is good at
predicting FICO Scores that are within the middle range, but is poor at predicting FICO
Scores that are of the upper and lower extremes. The average error for the training data set
is 29.4 and the average error for the testing data set is 29.5. This illustrates that our model
6. MISY267: Business Analytics 5
is consistent since both numbers are similar. Additionally, this is a good model because our
average error is approximately 30 points for someone’s FICO Score, which is not a
substantial amount when dealing with numbers that are in the hundreds.
Removal of Outliers
We determined that outliers should not be a issue for this model. Since a FICO score has
a defined range, we do not believe that any FICO Score would be unusual to see. Also,
we want our model to be able to predict perfect credit scores as well very low credit
scores, so these values should be included in our model.
Prediction In & Out of Samples
The mean absolute value of errors on average are 23.1 for the training set and 23.9 for the
test set. This indicates that our model is very consistent. The model performs almost
exactly as well on data it hasn’t seen before as data that it has seen. A roughly 23 point
average error is very small given the range of a FICO score. This means our model is a
good model. Therefore, our model has both consistency and goodness, so we can assume
that we created a useful model.
7. MISY267: Business Analytics 6
Conclusion
Based on the analysis we conducted, we would use the model to predict an individual’s
FICO Score, but would exercise with caution since some assumptions were violated. A
manager or financial firm can use this model in order to assess an individual’s FICO score
and use that information to approve or deny a loan. Lastly, we emphasize that this model
has issues which need to be addressed through methods such as bootstrapping in order to
address the violations of no heteroskedasticity and independent errors.