More Related Content Similar to Loan predicting web service (20) More from Marcos Quezada (7) Loan predicting web service1. 1
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Israel Chavez
Ngadhnjim Halilaj
Anusha Kodali
Marcos Quezada
Jyoti Shrestha
Sarat Tadi
April 28, 2016
EMC Education Services
Data Science & Big Data Analytics
2. 2
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Project Goals
• Create a model that will allow FPC to provide a loan predicting service to
its customers.
• Identify the necessary attributes that will enable the model to give a better
prediction.
• Test the Marketing Department threshold suggestions.
• Advice FPC about the suggestions that they could offer to their customers
to increase their chances of getting a loan.
3. 3
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Situation
•FPC wants to expand its set of services offered to its customers by creating
an online site for loan advice.
•Provide a fast and reliable planning platform for customers to manage their
personal finances.
•Attract potential customers that want to know their eligibility for loans, thus
increasing FPC business.
4. 4
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Executive Summary
Regression and Decision tree are somewhat efficient in predicting
outcome
• Logistic Regression
– Precision: 0.786
– Recall: 0.984
•Decision Tree
– Precision: 0.784
– Recall: 0.984
5. 5
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Discovery
• Used 2010 housing loan database by Home Mortgage Disclosure Act (HMDA).
• Filtered data based on:
4 Owner-occupied
4 1-4 Family
4 Action Type (Loan originated, application approved but not accepted,
application denied, application withdrawn)
6. 6
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Data Conditioning:
4 Data was factored, incomplete data was removed Data set created.
4 Releveled variables to produce reference for possible logistic regression.
4 Tested numeric variable correlation through a correlation matrix.
4 Dataset reduced to “Originated” and “Denied” loans.
• Data Visualization:
4 Overviewed data to check distribution and noise.
4 Two originators of noise:
8 Home Improvement Loans
8 Loan amounts > $400K
Data Preparation
7. 7
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Planning
• Model Selection:
4 Two methods:
8 Logistic Regression
8 Classification Tree
• Regression:
4 0.5 and 0.75 thresholds suggested by the Marketing Department were
used.
8. 8
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Planning
• Variable Selection:
4 Created a Small Set for testing purposes:
8 Three possibilities:
▪ Absence of personal data
▪ Absence of County data
▪ Absence of personal and county data.
• Developed two Full models:
4 Model 1: Included everything that the example script suggested;
4 Model 2: Included only the variables that we chose to build the model
with.
• Pseudo-R² was used to check the variance of the models
• ROC & AUC were used to check the performance of our model.
9. 9
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Building
• Created a Holdout set with 25% of the data to test models
• Logistic Regression:
4 Categorized the holdout data in three bins:
8 Low threshold (<50%),
8 Medium threshold (from 50-74%),
8 High threshold (>=75%).
• To further test Regression model, we experimented with a binary
classification: Loan Rejected/ Loan Approved
4 First prediction: threshold 0.5.
4 Second prediction: threshold 0.7.
• Decision Tree:
4 Used binary classification: Loan Rejected/ Loan Approved
• A confusion matrix was developed to compare both methods.
10. 10
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Results and Accuracy
•The model developed with Logistic Regression with threshold 0.5 has
predictive power at least as good as the Decision Tree model
Logistic
Regression
Threshold = 0.5
Predictions
FALSE TRUE
Actual FALSE 2,452 23,657
Actual TRUE 1,385 87,383
Decision Tree
Model
Predictions
FALSE TRUE
Actual FALSE 2,082 24,027
Actual TRUE 1,349 87,419
Logistic Regression model Decision Tree model
Accuracy 0.780 0.779
Precision 0.786 0.784
Recall 0.984 0.984
11. 11
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Logistic Regression Prediction
12. 12
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Decision Tree Visualization
Decision Tree model is a good way to compare the prediction power of a
Logistic Regression model
13. 13
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Overview of Basic Methodology: Predict the likelihood of a person getting a loan
from FPC.
• Model: Logistic regression and Decision Tree.
• Dependent variable: “Approved”, if the loan application was approved or not.
• Scope:
– 662,997 total observations for year 2010 extracted from the housing loan
database that was assembled by federal agencies pursuant to the Home
Mortgage Disclosure Act (HMDA).
•After thoroughly cleaning the data, the model had 550,336
observations.
•Sampling
– Small set: 10% of the data.
– Holdout set: 25% of the data.
Model Description
14. 14
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Data distribution visualization
Visualizing the variables for a normal distribution helps to understand
how good of a predictor they are
15. 15
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Data distribution visualization
Removing the unwanted “noises” from the model increases the predicting
powers of the model
16. 16
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
ROC/AUC
The ROC curves lie just inside the full model curve
Essentially they are the same model
Full Model
AUC: 0.70
Personal data
removed 0.69
Personal data
and county
removed
AUC: 0.68
17. 17
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Data available for analysis is somewhat efficient.
• Logistic Regression or Classification Tree yield a similar result.
• Logistic Regression should be used considering the web app response time
requirement.
• The model provides an estimate not an assurance that a specific customer
will or will not get a loan.
• Sensitive personal information does not affect the model.
• County information does not affect the model.
• High income increases the chances of getting a loan.
• % of minority population in the customer tract reduces the chances of getting
a loan (We don’t recommend to show this finding in the web!)
Recommendations