SlideShare a Scribd company logo
1 of 7
Download to read offline
MISY 267: BUSINESS
ANALYTICS FINAL
PROJECT
FICO Credit Risk Data
MAY 9TH
, 2016
UNIVERSITY OF DELAWARE
Daniel Amato, Kyle Zaino, Chirag Dhamecha, Erik Caputo, Joey Czechowicz
MISY267: Business Analytics 1
Goals
Our problem is that we are uncertain what variables influence a person’s FICO Score and
so our goal is to ascertain what demographic variables determine a given person’s FICO
score. We will use a variable selection method to determine which variables to include or
exclude from the model and we will run a linear regression model because we are
predicting a continuous variable as opposed to a binary variable. Our team will test the
model assumptions and check for any violations within the model. If any violations are
found, we will make suggestions on how to address them. This model will then be able to
assist a manager or a financial firm that is attempting to determine if an applicant is too
risky to be given a loan.
Methodologies
Data Available & Variables of Interest
Once we eliminated the variables with the terms “Application” and “Trades” in their
names, we were left with the following variables: risk flag (paid as negotiated flag), interest
revenue accumulated, sampling weight, debt to income ratio, number of borrowers,
geographic region, prior or current bank relationship, collateral value, loan amount
requested, loan to value ratio, FICO score, average months in file, months since most recent
delinquency, maximum delinquency and public records in the last twelve months,
maximums delinquency ever, number of inquiries in last 6 months, number of inquiries in
last 6 months excluding 7 days, net fraction revolving burden, and net fraction installment
burden.
Variable Selection: Backward Stepwise Selection
We used the Backward Stepwise variable selection method where we started running the
model with all variables available, and removed one variable at a time, choosing to remove
the variables that contributed the least to the model. Since we are working with FICO scores
of individual persons, we will choose a model based on the best prediction accuracy and
not R-squared value. Therefore, we want a model that yields the smallest errors out-of-
sample. This led us to remove all of the variables from our model except for average
months in file, maximum delinquency with the last 12 months, maximum delinquency ever,
and net fraction revolving burden. Our final model has an R-squared value of
approximately 71%, and the aforementioned four variables. Additionally, we looked for
interactions between our remaining variables, but found that there were not any interactions
that added value to our model. Lastly, we tested non-linear relationships, where we tested
transformations of variables within our model. We added the square root of the ‘Max
Delinquency in 12 months’ variable into our model, which increased our R2
by about 2%.
However, none of the variable transformations we tested substantially increased the R-
squared of our model and so they were not included in the model to avoid the risk of
overfitting. If a variable added an R-squared value of less than 5% we decided not to
include it in our model.
MISY267: Business Analytics 2
Proposed Model:
 Division of Data: 70% of the data is part our training set and 30% is part of our
test set
 Random Selection Method: The observations of our data occur over different
subjects so we will use randomly selected observations as our test set.
 Proposed Linear Regression Model: FICO Score = 590.41 + 0.39*Average
Months In File + 12.71*Max Delinquency Last 12 Months + 6.28*Max
Delinquency Ever –0.95*Net Fraction Revolving Burden
 Response Variable Interpretation: An increase in FICO score decreases the risk
of the individual/applicant.
● Multicollinearity: Multicollinearity is when a predictor variable shares a
relationship with another predictor variable to the point where we cannot determine
how much of an impact increasing one variable has on the model. We looked at the
correlation matrix to check for multicollinearity in order to make sure we can
interpret our variables directly and found that some of our variables have a
relationship above .25 or below -.25. There was a correlation of 0.61 between max
delinquency in the last 12 months and max delinquency ever. Although the
correlation is larger than 0.25, we don’t want to remove any of these variables as
they contribute a lot to the model on their own. However, interpretation of these
variables requires careful evaluation. Therefore, multicollinearity is an issue in this
model because all variables cannot be interpreted directly and some require more
careful evaluation.
● Statistical Significance: All of our variables have stars in the output, therefore they
are all statistically significant predictors of FICO scores.
● Intercept Interpretation: When the application has been in the file for 0 months,
there has been 0 delinquencies in the last 12 months and ever and the net fraction
revolving burden (outstanding balance/credit) is 0, the FICO Score on average is
590.41. The intercept makes sense to us intuitively, because the FICO score is
positive as it should be. For a more detailed interpretation of the intercept, we will
consult with the data manager.
● Average Months in File: When the average months in file increases by 1 month,
the FICO score increases on average increases by 0.39 points. This makes sense to
us intuitively, because the longer a file is being investigated, the riskier the
applicant probably is. However, this coefficient is not economically significant,
because a 0.39 point increase is miniscule. However, we decided not to remove it
from our model because it contributes 5.6% to our R2
.
● Maximum Delinquency: When the Maximum Delinquency within last 12 Months
is increased by 10, our FICO score is increased by approximately 127.1. This does
not make sense because if the individual has a large maximum delinquency it is
likely that the individual is risky. Since the FICO score increases by approximately
130 points it is economically significant because this has a substantial impact on an
individual's FICO score, which will determine their eligibility to receive a loan.
MISY267: Business Analytics 3
● Maximum Delinquency Ever: When max delinquency ever increases by 10, the
FICO score increases on average by (6.28 *10) 62.8 points. This does not make
sense to us intuitively, because the larger the max delinquency (failure to pay
outstanding debt) ever is, the riskier the applicant probably is. This coefficient is
economically significant, because a 70 point increase can be a deciding factor when
determining an individual’s eligibility to receive a loan.
● Net Fraction Revolving Burden: When the net fraction revolving burden
increases by 1 unit, meaning your balance in relation to your credit increases, the
FICO score decreases on average by 0.95 points. Intuitively this is sensible because
the larger the net fraction revolving burden (outstanding balance/credit) is, the less
risky the applicant probably is. This coefficient is not economically significant,
because a 0.89 point increase is close to 0 so it is too small to be economically
significant. However, we don’t want to remove it from our model because it
contributes 23.32% to R2
.
Tests of Model Assumptions
Assumption 1: Exclude Unnecessary Variables & Good Linear Model
Assumption 1 is a formality and so we will assume that our model is a good linear model.
Assumption 1 passes automatically because of this formality.
Assumption 2: No Perfect Multicollinearity
There is no perfect multicollinearity in our model. Based on the correlation matrix none
of the correlation results were equal to 1 or -1.
Assumption 3: Independent Errors
Based on the Residual Chart shown below, we pass the assumption of independent errors
because there is no pattern in the graph. Intuitively, one person’s FICO score should not be
related to another person’s FICO score so this makes sense. However, we should take note
of the small gap between 0 and 2000 on the Index (x-axis). Additionally, in the ACF plot
shown below, we fail the assumption of independent errors because we see the first lag
crossing the dotted line threshold. However, it is not substantially crossing the line, so we
can choose to view the first lag as insignificant. All other lags are insignificant if we view
the first lag as insignificant.
MISY267: Business Analytics 4
Assumption 4: No Heteroskedasticity
Based on the Fitted Vs Residual plot shown below, we see unequal variances in FICO
scores. Therefore, we fail the assumption of no heteroskedasticity. Since we fail the
assumption of no heteroskedasticity, we would address the violation in one of the following
ways:
● Ignoring
● Bootstrapping
● Or controlling the correlation by including the lagged values as predictors in the
model
Assumption 5: Normal Distribution of Errors
According to the histogram of our residuals, the distribution of the errors visually looks
approximately normal. According to the QQ plot shown below, the results in the middle
look similar to a straight line, however, towards the upper and lower end, the line deviates
dramatically from a linear path. Therefore, it appears that on average our model is good at
predicting FICO Scores that are within the middle range, but is poor at predicting FICO
Scores that are of the upper and lower extremes. The average error for the training data set
is 29.4 and the average error for the testing data set is 29.5. This illustrates that our model
MISY267: Business Analytics 5
is consistent since both numbers are similar. Additionally, this is a good model because our
average error is approximately 30 points for someone’s FICO Score, which is not a
substantial amount when dealing with numbers that are in the hundreds.
Removal of Outliers
We determined that outliers should not be a issue for this model. Since a FICO score has
a defined range, we do not believe that any FICO Score would be unusual to see. Also,
we want our model to be able to predict perfect credit scores as well very low credit
scores, so these values should be included in our model.
Prediction In & Out of Samples
The mean absolute value of errors on average are 23.1 for the training set and 23.9 for the
test set. This indicates that our model is very consistent. The model performs almost
exactly as well on data it hasn’t seen before as data that it has seen. A roughly 23 point
average error is very small given the range of a FICO score. This means our model is a
good model. Therefore, our model has both consistency and goodness, so we can assume
that we created a useful model.
MISY267: Business Analytics 6
Conclusion
Based on the analysis we conducted, we would use the model to predict an individual’s
FICO Score, but would exercise with caution since some assumptions were violated. A
manager or financial firm can use this model in order to assess an individual’s FICO score
and use that information to approve or deny a loan. Lastly, we emphasize that this model
has issues which need to be addressed through methods such as bootstrapping in order to
address the violations of no heteroskedasticity and independent errors.

More Related Content

What's hot

Slide unemployment
Slide unemploymentSlide unemployment
Slide unemploymentbloglendu
 
Bangladesh labor low 2015
Bangladesh labor low 2015Bangladesh labor low 2015
Bangladesh labor low 2015Masudul Hasan
 
Equal Remuneration Act,1976
Equal Remuneration Act,1976Equal Remuneration Act,1976
Equal Remuneration Act,1976Shekhar Singh
 
Basic Computer Works for BBA students
Basic Computer Works for BBA studentsBasic Computer Works for BBA students
Basic Computer Works for BBA studentsAnkit Gupta
 
Final powerpoint presentation, prof3 a sphiwe dladla-201221896
Final powerpoint presentation, prof3 a sphiwe dladla-201221896Final powerpoint presentation, prof3 a sphiwe dladla-201221896
Final powerpoint presentation, prof3 a sphiwe dladla-201221896Sphiwe Dladla
 
What is word processing software
What is word processing softwareWhat is word processing software
What is word processing softwareOmar Jacalne
 
Advanced Excel &Basic Excel Training
Advanced Excel &Basic Excel TrainingAdvanced Excel &Basic Excel Training
Advanced Excel &Basic Excel Trainingaarkex
 

What's hot (9)

Slide unemployment
Slide unemploymentSlide unemployment
Slide unemployment
 
Bangladesh labor low 2015
Bangladesh labor low 2015Bangladesh labor low 2015
Bangladesh labor low 2015
 
Equal Remuneration Act,1976
Equal Remuneration Act,1976Equal Remuneration Act,1976
Equal Remuneration Act,1976
 
Mail merge
Mail mergeMail merge
Mail merge
 
Basic Computer Works for BBA students
Basic Computer Works for BBA studentsBasic Computer Works for BBA students
Basic Computer Works for BBA students
 
Final powerpoint presentation, prof3 a sphiwe dladla-201221896
Final powerpoint presentation, prof3 a sphiwe dladla-201221896Final powerpoint presentation, prof3 a sphiwe dladla-201221896
Final powerpoint presentation, prof3 a sphiwe dladla-201221896
 
EPQ essay
EPQ essayEPQ essay
EPQ essay
 
What is word processing software
What is word processing softwareWhat is word processing software
What is word processing software
 
Advanced Excel &Basic Excel Training
Advanced Excel &Basic Excel TrainingAdvanced Excel &Basic Excel Training
Advanced Excel &Basic Excel Training
 

Similar to FICO Credit Risk Data

Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 
Multivariate data analysis regression, cluster and factor analysis on spss
Multivariate data analysis   regression, cluster and factor analysis on spssMultivariate data analysis   regression, cluster and factor analysis on spss
Multivariate data analysis regression, cluster and factor analysis on spssAditya Banerjee
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Data_Analysis_LendingClub_InterestRate
Data_Analysis_LendingClub_InterestRateData_Analysis_LendingClub_InterestRate
Data_Analysis_LendingClub_InterestRateKaren Yang
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationAsadJaved304231
 
GAP Statistical Analysis Report
GAP Statistical Analysis ReportGAP Statistical Analysis Report
GAP Statistical Analysis ReportAlexandra Nolan
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryPranov Mishra
 
Case2_Best_Model_Final
Case2_Best_Model_FinalCase2_Best_Model_Final
Case2_Best_Model_FinalEric Esajian
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project presentKexinZhang22
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipRithish Kumar
 

Similar to FICO Credit Risk Data (20)

FICO Credit Risk Data
FICO Credit Risk DataFICO Credit Risk Data
FICO Credit Risk Data
 
Logistic regression sage
Logistic regression sageLogistic regression sage
Logistic regression sage
 
Econometrics
EconometricsEconometrics
Econometrics
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
report
reportreport
report
 
Hy2208 final
Hy2208 finalHy2208 final
Hy2208 final
 
Hy2208 Final
Hy2208 FinalHy2208 Final
Hy2208 Final
 
Multivariate data analysis regression, cluster and factor analysis on spss
Multivariate data analysis   regression, cluster and factor analysis on spssMultivariate data analysis   regression, cluster and factor analysis on spss
Multivariate data analysis regression, cluster and factor analysis on spss
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Data_Analysis_LendingClub_InterestRate
Data_Analysis_LendingClub_InterestRateData_Analysis_LendingClub_InterestRate
Data_Analysis_LendingClub_InterestRate
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
GAP Statistical Analysis Report
GAP Statistical Analysis ReportGAP Statistical Analysis Report
GAP Statistical Analysis Report
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Pm 6
Pm 6Pm 6
Pm 6
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Case2_Best_Model_Final
Case2_Best_Model_FinalCase2_Best_Model_Final
Case2_Best_Model_Final
 
statistical measurement project present
statistical measurement project presentstatistical measurement project present
statistical measurement project present
 
Qt unit i
Qt unit   iQt unit   i
Qt unit i
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

FICO Credit Risk Data

  • 1. MISY 267: BUSINESS ANALYTICS FINAL PROJECT FICO Credit Risk Data MAY 9TH , 2016 UNIVERSITY OF DELAWARE Daniel Amato, Kyle Zaino, Chirag Dhamecha, Erik Caputo, Joey Czechowicz
  • 2. MISY267: Business Analytics 1 Goals Our problem is that we are uncertain what variables influence a person’s FICO Score and so our goal is to ascertain what demographic variables determine a given person’s FICO score. We will use a variable selection method to determine which variables to include or exclude from the model and we will run a linear regression model because we are predicting a continuous variable as opposed to a binary variable. Our team will test the model assumptions and check for any violations within the model. If any violations are found, we will make suggestions on how to address them. This model will then be able to assist a manager or a financial firm that is attempting to determine if an applicant is too risky to be given a loan. Methodologies Data Available & Variables of Interest Once we eliminated the variables with the terms “Application” and “Trades” in their names, we were left with the following variables: risk flag (paid as negotiated flag), interest revenue accumulated, sampling weight, debt to income ratio, number of borrowers, geographic region, prior or current bank relationship, collateral value, loan amount requested, loan to value ratio, FICO score, average months in file, months since most recent delinquency, maximum delinquency and public records in the last twelve months, maximums delinquency ever, number of inquiries in last 6 months, number of inquiries in last 6 months excluding 7 days, net fraction revolving burden, and net fraction installment burden. Variable Selection: Backward Stepwise Selection We used the Backward Stepwise variable selection method where we started running the model with all variables available, and removed one variable at a time, choosing to remove the variables that contributed the least to the model. Since we are working with FICO scores of individual persons, we will choose a model based on the best prediction accuracy and not R-squared value. Therefore, we want a model that yields the smallest errors out-of- sample. This led us to remove all of the variables from our model except for average months in file, maximum delinquency with the last 12 months, maximum delinquency ever, and net fraction revolving burden. Our final model has an R-squared value of approximately 71%, and the aforementioned four variables. Additionally, we looked for interactions between our remaining variables, but found that there were not any interactions that added value to our model. Lastly, we tested non-linear relationships, where we tested transformations of variables within our model. We added the square root of the ‘Max Delinquency in 12 months’ variable into our model, which increased our R2 by about 2%. However, none of the variable transformations we tested substantially increased the R- squared of our model and so they were not included in the model to avoid the risk of overfitting. If a variable added an R-squared value of less than 5% we decided not to include it in our model.
  • 3. MISY267: Business Analytics 2 Proposed Model:  Division of Data: 70% of the data is part our training set and 30% is part of our test set  Random Selection Method: The observations of our data occur over different subjects so we will use randomly selected observations as our test set.  Proposed Linear Regression Model: FICO Score = 590.41 + 0.39*Average Months In File + 12.71*Max Delinquency Last 12 Months + 6.28*Max Delinquency Ever –0.95*Net Fraction Revolving Burden  Response Variable Interpretation: An increase in FICO score decreases the risk of the individual/applicant. ● Multicollinearity: Multicollinearity is when a predictor variable shares a relationship with another predictor variable to the point where we cannot determine how much of an impact increasing one variable has on the model. We looked at the correlation matrix to check for multicollinearity in order to make sure we can interpret our variables directly and found that some of our variables have a relationship above .25 or below -.25. There was a correlation of 0.61 between max delinquency in the last 12 months and max delinquency ever. Although the correlation is larger than 0.25, we don’t want to remove any of these variables as they contribute a lot to the model on their own. However, interpretation of these variables requires careful evaluation. Therefore, multicollinearity is an issue in this model because all variables cannot be interpreted directly and some require more careful evaluation. ● Statistical Significance: All of our variables have stars in the output, therefore they are all statistically significant predictors of FICO scores. ● Intercept Interpretation: When the application has been in the file for 0 months, there has been 0 delinquencies in the last 12 months and ever and the net fraction revolving burden (outstanding balance/credit) is 0, the FICO Score on average is 590.41. The intercept makes sense to us intuitively, because the FICO score is positive as it should be. For a more detailed interpretation of the intercept, we will consult with the data manager. ● Average Months in File: When the average months in file increases by 1 month, the FICO score increases on average increases by 0.39 points. This makes sense to us intuitively, because the longer a file is being investigated, the riskier the applicant probably is. However, this coefficient is not economically significant, because a 0.39 point increase is miniscule. However, we decided not to remove it from our model because it contributes 5.6% to our R2 . ● Maximum Delinquency: When the Maximum Delinquency within last 12 Months is increased by 10, our FICO score is increased by approximately 127.1. This does not make sense because if the individual has a large maximum delinquency it is likely that the individual is risky. Since the FICO score increases by approximately 130 points it is economically significant because this has a substantial impact on an individual's FICO score, which will determine their eligibility to receive a loan.
  • 4. MISY267: Business Analytics 3 ● Maximum Delinquency Ever: When max delinquency ever increases by 10, the FICO score increases on average by (6.28 *10) 62.8 points. This does not make sense to us intuitively, because the larger the max delinquency (failure to pay outstanding debt) ever is, the riskier the applicant probably is. This coefficient is economically significant, because a 70 point increase can be a deciding factor when determining an individual’s eligibility to receive a loan. ● Net Fraction Revolving Burden: When the net fraction revolving burden increases by 1 unit, meaning your balance in relation to your credit increases, the FICO score decreases on average by 0.95 points. Intuitively this is sensible because the larger the net fraction revolving burden (outstanding balance/credit) is, the less risky the applicant probably is. This coefficient is not economically significant, because a 0.89 point increase is close to 0 so it is too small to be economically significant. However, we don’t want to remove it from our model because it contributes 23.32% to R2 . Tests of Model Assumptions Assumption 1: Exclude Unnecessary Variables & Good Linear Model Assumption 1 is a formality and so we will assume that our model is a good linear model. Assumption 1 passes automatically because of this formality. Assumption 2: No Perfect Multicollinearity There is no perfect multicollinearity in our model. Based on the correlation matrix none of the correlation results were equal to 1 or -1. Assumption 3: Independent Errors Based on the Residual Chart shown below, we pass the assumption of independent errors because there is no pattern in the graph. Intuitively, one person’s FICO score should not be related to another person’s FICO score so this makes sense. However, we should take note of the small gap between 0 and 2000 on the Index (x-axis). Additionally, in the ACF plot shown below, we fail the assumption of independent errors because we see the first lag crossing the dotted line threshold. However, it is not substantially crossing the line, so we can choose to view the first lag as insignificant. All other lags are insignificant if we view the first lag as insignificant.
  • 5. MISY267: Business Analytics 4 Assumption 4: No Heteroskedasticity Based on the Fitted Vs Residual plot shown below, we see unequal variances in FICO scores. Therefore, we fail the assumption of no heteroskedasticity. Since we fail the assumption of no heteroskedasticity, we would address the violation in one of the following ways: ● Ignoring ● Bootstrapping ● Or controlling the correlation by including the lagged values as predictors in the model Assumption 5: Normal Distribution of Errors According to the histogram of our residuals, the distribution of the errors visually looks approximately normal. According to the QQ plot shown below, the results in the middle look similar to a straight line, however, towards the upper and lower end, the line deviates dramatically from a linear path. Therefore, it appears that on average our model is good at predicting FICO Scores that are within the middle range, but is poor at predicting FICO Scores that are of the upper and lower extremes. The average error for the training data set is 29.4 and the average error for the testing data set is 29.5. This illustrates that our model
  • 6. MISY267: Business Analytics 5 is consistent since both numbers are similar. Additionally, this is a good model because our average error is approximately 30 points for someone’s FICO Score, which is not a substantial amount when dealing with numbers that are in the hundreds. Removal of Outliers We determined that outliers should not be a issue for this model. Since a FICO score has a defined range, we do not believe that any FICO Score would be unusual to see. Also, we want our model to be able to predict perfect credit scores as well very low credit scores, so these values should be included in our model. Prediction In & Out of Samples The mean absolute value of errors on average are 23.1 for the training set and 23.9 for the test set. This indicates that our model is very consistent. The model performs almost exactly as well on data it hasn’t seen before as data that it has seen. A roughly 23 point average error is very small given the range of a FICO score. This means our model is a good model. Therefore, our model has both consistency and goodness, so we can assume that we created a useful model.
  • 7. MISY267: Business Analytics 6 Conclusion Based on the analysis we conducted, we would use the model to predict an individual’s FICO Score, but would exercise with caution since some assumptions were violated. A manager or financial firm can use this model in order to assess an individual’s FICO score and use that information to approve or deny a loan. Lastly, we emphasize that this model has issues which need to be addressed through methods such as bootstrapping in order to address the violations of no heteroskedasticity and independent errors.