4. 1. Executive Summary
1.1 Problem Description
This report presents an analysis of the data for the rate spread of mortgage applications adapted from
the Federal Financial Institutions Examination Council’s (FFIEC).Each row represents a HMDA-
reported loan application. The data set contains 21 variables, each of these variables contain specific
characteristic’s for the loan application that help predict rate spread for mortgage applications.This
data set covers one particular year and there are 200,000 rows representing loan applications in the
data set.
1.2 Analysis
Data exploration for this data set included calculating summary and descriptive statistics, visual-
izations for the data set,correlations between the 21 different features were calculated in order to
predict the label rate spread. After data exploration and data preparation a predictive regression
model was used to predict the rate spread for the mortgage applicants from the features that were
created.
1.3 Key Findings
After data exploration, the following conclusions were made: The data set includes 21 features out
of which the features most helpful in predicting the rate spread for the mortgage applicants are the
following:
• This is a regression problem because the label being predicted is a numerical value.:Loan Rate
Spread.
• Thereare 13 categorical variables and 8 numerical features.
• missing values, null values, and certain redundant features were dealt with for a better COD.
• The best method found to deal with a regression problem with a large portion of the dataset to
have categorical variables was Decision Forest Regression & Boosted Decision Tree Regression.
• Both model were run and cross validation used to check if the model generalizes well.
• Boosted Tree final version yielded COD of :0.7691. This rounded to 0.77 for the competition.
• Boosted Decision Trees was the algorithm of choice as it yielded the best Coefficient of Deter-
mination.
3
5. • Feature engineering through two methods were done:
– Filter based feature selection
– Permutation Feature Importance
• The most important features were found using permutation feature selection in my case it was
13 features.
• Hypertuning was used to make the most out of tweaking the parameters which yielded a higher
COD.
• 13 features are listed below:
• msa_md- Metropolitan statistical area/division
• state_code - code for US states
• lender - lenders which approved/denied loans
• loan_amount - loan requested size in dollars.
• loan_type - type of loan.Conventional,FHA-insured,VA-guaranteed ,FSA/RHS
• property_type - type of property for loan: One to four-family,Manufactured housing,Multifamily
• loan_purpose - purpose of requested loan: Home purchase,Home improvement,Refinancing
• Occupancy- applicants primary dwelling for the loan request: Owner-occupied as a principal
dwelling, Not owner occupied,Not applicable.
• Preapproval- status of pre-approval for loan. was requested,not requested,Not applicable
• applicant_income - Applicants income
• applicant_ethnicity- Ethnicity of applicant:Hispanic or Latino,Not Hispanic or Latino,Information
not provided ,Not applicable,No co-applicant
• applicant_race - Applicants race:American Indian or Alaska Native,Asian,Black or African
American,Native Hawaiian or Other Pacific Islander,White,Information not provided,Not ap-
plicable,No co-applicant
• minority_population_pct - minority percentage in population
• ffiecmedian_family_income - median family income in population tract.
4
6. 2. Data Exploration
The first step towards data exploration is with summary and descriptive statistics:
2.1 Numerical Feature Statistics
Summary statistics for minimum, maximum, mean, median, standard deviation, and distinct count
were calculated for numeric columns, and the results taken from 216 observations are shown here:
Figure 2.1: Feature Statistics
Since Rate Spread is of interest in this analysis, it was noted that the mean and median of this value
are not so far off and that the standard deviation indicates that there isn’t considerable variance in
the rate spread rates for applicants. A histogram of the Rate Spread column shows that the rate
spread values are right-skewed – in other words, most applicants have a rate spread at the lower end
of the loan rate spread range.
Figure 2.2: Rate Spread
5
7. 2.2 Categorical Features:
• loan_type: Conventional,FHA insured,VA guaranteed,FSA/RHS
• property_type: one to four family, manufactured housing, multifamily
• loan_purpose: home purchase, home improvement, refinancing
• occupancy:principal dwelling, not owner occupied, not applicable
• preapproval: requested, not requested, not applicable
• msa_md - Metropolitan statistical area/division
• state_code - code for US states
• county_code - indicates the county in US
• applicant_ethnicity: hispanic/latino, not hispanic/latino, information not given, not appli-
cable, no co-applicant
• applicant_race: American native, asian, black/African American, pacific islander, white,
information not given, not applicable, no co-applicant
• applicant_sex : Male,Female, no information, not applicable, not applicable
• lender: lenders which approved/denied loans
Bar charts were created to show the frequency of these features and these insights were indicated:
• Pre-approval: Pre-approval requests for loans was higher than without request.
• Loan types: Conventional loan types was the highest followed by FHA insured loans.
• Occupancy: Most applicants have applied for property which is the principle dwelling.
• Property types: Most of the loans are for property which is a one to four dwelling.
• Loan purpose: The highest number of loans for is for home purchase followed by home
improvements
• Co-applicant: Most applications don’t have co-applicant
• Ethnicity: The highest number of applicants applying is hispanic/latino.
• Race: The highest number of applicants are American Natives
• Sex: The highest number of applicants are Male followed by Females.
• States: no information can be derived due to large number of categories.
• County: no information can be derived due to large number of categories.
• Lenders: no information can be derived due to large number of categories.
• Metropolitan Statistical Area: no information can be derived due to large number of
categories.
6
9. (a) County in US (b) Metropolitan Statistical Area
(c) States in US (d) Lender
8
10. 3. Correlation and Apparent Relationships
After data exploration and individual feature analysis; the establishment of correlations and rela-
tionships between the features and the label Rate Spread is crucial to find key insights.
3.1 Numerical Feature Relationships
(a) Histogram of Numeric Feature
9
11. In order to find correlations and relationship between the labels and the features a correlation table,
and scatter-pot matrix was generated initially to compare the numeric features with one another
and the label. The key features are shown below:
(a) Numeric Scatter-plot matrix with log Rate Spread
10
12. 3.2 Numerical feature correlations in logarithmic scale
Most of the features and relationships don’t show much apparent relationships or linearity. In an
attempt improve the fit of the features to Rate Spread the log normal value for rate spread was
calculated. A correlation table of all the features against linearity may give more insights on this
matter.
(a) Numeric Scatter-plot matrix with log Rate Spread
11
13. The correlation between the numeric columns was calculated and tabulates as follows: These corre-
lations validate the plots by showing the negative correlation between loan amount, income, popu-
lation, minority population, and median family income with Rate Spread.
Figure 3.5: Correlation Heat Map and Table of Numeric Features
12
14. 3.3 Categorical Feature Relationships
After exploring the relationships between the numeric features an attempt was made to discern any
relationships between the categorical features and rate spread. The following boxplots show the
categorical columns that seem to exhibit relationship with the log of rate spread.
Figure 3.6: Box plots for Loan purpose
Figure 3.7: Box plots for Loan type
Figure 3.8: Box plots for Property Type
13
15. Figure 3.9: Box plots for Pre-approval
Figure 3.10: Box plots for Occupancy type
Figure 3.11: Box plots for Co-Applicant
14
16. The box plots show more clear differences in terms of the median and range of rate spread across
the categorical features as well as the distribution and outliers.For example:
• Loan type: The highest spread in loan rate was observed in conventional loan type. FHA-
insured following and large number of outliers.
• Loan purpose: home purchasing and refinancing had lower loan rates and rate spread than
home improvement which ranged from 1-7.5.
• Property type: Manufactured housing had highest spread in loan rate followed by multifam-
ily dwelling and lastly one to four family being the lowest in rate spread and rates offered.
• Pre-approval: Requested or not requested was low rate spread than Not applicable.
• Occupancy: Property occupied as principal dwelling had lowest rate spread compared to
others.
• Co-applicant: Sames rate spread regardless of presence/absence of co applicant.
4. Data Preparation
Based on the anaysis of the loan rate spread data, a regression model was used to predict the loan
rate spread for the applicants. Based on the relationships identified when analyzing the data a
Boosted Decision Tree was created to predict the loan rate spread.
The data given has 21 variables:
i 12 categorical features
ii 8 numeric features
iii 1 bool feature
iv 1 label
Since the rate spread is the label being predicted we are predicting a number and hence a regression
model is needed to predict the rate spread. The categorical features pose a big problem since they
are present as numbers for each class it belongs to. The data types are mismatched for what they
stand for in the data set hence it needs to be changed.
4.1 Platform chosen for Data Cleaning: Python
4.1.1 Steps taken for Data Preparation:
• Numbers used to classify are converted to string variables for easier modelling with regression
model
• There are missing values in the columns: msa_md, state_code, county_code as well as null
values in the entire data set.
• Missing values rows are removed since they are a small percentage of the data set and since
they are categorical and represent distinct areas/state/county;replacing them with any values
such as mean,median, or mode would be misleading the data set.
15
17. • Null values for numeric features are replaced with median since this is a highly right skewed
data set replacing with the mean or mode would be misleading since the outliers represent a
large portion of the data(in this case, this is the type of data provided)
5. Machine Learning Model
5.1 Platform chosen for Predictive Modelling using Azure ML
Studio
• The data set is added to AzureML Studio. Clear inspection of data set for any problems is
done.
• Clean missing data module for replacing/removing missing values. Median replacement for
numeric and removal of rows from categorical features with missing values.
• Since Classes were converted to string variables this changed the data type so any other type
of data to be used in modelling needs to be corrected.
– Edit metadata module is used for changes.
– row_id is a feature that is cleared since it is only an identification number and contributes
to nothing for predictive modelling.
– Categorical variables are selected and chose as categorical and pass unchanged since these
will be used for modelling.
– Rate spread is a float in this data set and is the label being predicted so it is chosen as
label.
– Co-applicant is a Boolean feature and is chosen as such.
• Since our data set has large numbers of categorical values as well as some numeric and not
every feature is important as observed from the Data Analysis prior to Data Preparation.
– Correlation with rate spread is not apparent for every variable in fact only a few.
– Hence, Filter Based Feature Selection is used to identify columns that have the greatest
predictive power for the input data set.
• Feature Selection module is used so as to choose the right features so I can potentially improve
the accuracy and efficiency of my model predictions.
• Feature Selection method used is:
– Mutual Information Score
– Chi-Squared
• Once the features have been identified the features with the highest scores were used. I used
12 features in this case.
• Split Data model is used for a 70-30 spit of data at first.
16
18. • Boosted Decision Tree Regression is the machine learning model that is used and the reasons
are:
– Since a large number of categorical features are present for this regression problem.Boosted
decision tree regression is used.
– This module is used to create an ensemble of regression trees using boosting.
– Boosting in a decision tree ensemble tends to improve accuracy with some small risk of
less coverage.
– Since this machine learning algorithm does not have any problems dealing with categorical
features and with boosting method improves accuracy without potentially misleading
results. I chose this algorithm.
– In order to maximize the results of this model I used the Tune Model Hyper-parameters
module. Entire grid sweep with metric for accuracy and regression-coefficient of determi-
nation.
– This is XGBoost(Extreme boosting) which has yielded high coefficient of determination
for this competition.
– The data set is then trained, scored, and evaluate to see the performance.
– The spit in this case was boosted up to 90-10 since boosted trees makes decision based
on the number of leaves and decision tree the more information it had the better in this
case it predicted.
• I used another method for Feature Engineering:
– Permutation Feature Importance
– Permutation feature importance doesn’t measure the association between a feature and
a target value, but instead captures how much influence each feature has on predictions
from the model.
– This yielded different columns compared to filter based feature selection.
• Once all these steps were completed on the data set the final data set is downloaded and used
in another experiment for predicting.
– With hypertuning the algorithm is XGBoost and with the best features according to my
understanding the new data set is trained and evaluated.
– A significant amount of boost in Coefficient of determination is observed.
– The model is converted to a web service and used for model deployment.
– Once the model is deployed; excel is used from request response for predicting using the
test data set which has been cleaned and prepared the same as the training data set.
• The prediction for rate_spread along with row_id is added to another CSV format file for
submission and submitted to acquire the Coefficient of Determination against a private test
set to check for over fit as well as how good the model is predicting.
17
23. (a) Updated model with Feature Engineering and Hypteruning
(b) Predictive Model Azure
22
24. (a) Azure ML plugin Excel prediction
(b) World Ranking Competition
23
25. 6. Machine Learning Model using Python in Jupyter
Notebooks
6.1 Python Codes Used
1000 import pandas as pd
from sklearn import preprocessing
1002 import sklearn . model_selection as ms
from sklearn import linear_model
1004 import sklearn . metrics as sklm
import numpy as np
1006 import numpy . random as nr
import matplotlib . pyplot as p l t
1008 import seaborn as sns
import scipy . s t a t s as ss
1010 import math
import matplotlib . pyplot as p l t
1012 %matplotlib i n l i n e
t r a i n = pd . read_csv ( ’ train_values . csv ’ )
1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ )
t e s t = pd . read_csv ( ’ test_values . csv ’ )
1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ )
df . shape
1018 df . columns
df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose ()
1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’
race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ ,
’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ]
column = { ’msa_md ’ : −1,
1022 ’ state_code ’ : −1,
’ county_code ’ : −1}
1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True )
df . isna () . sum ()
1026
mode_statecode = df [ ’ state_code ’ ] . mode ()
1028 mode_income = df [ ’ income ’ ] . mode ()
mode_pop = df [ ’ population ’ ] . mode ()
1030 mode_minpop = df [ ’min_pop ’ ] . mode ()
mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode ()
1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median ()
mode_owneroccup= df [ ’ owner_occup ’ ] . mode ()
1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode ()
1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) )
df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) )
1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) )
df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) )
1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) )
df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) )
1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) )
df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) )
1044 df . to_csv ( ’ pythontest . csv ’ , index = False )
24
26. 1000 import pandas as pd
import numpy as np
1002 import numpy . random as nr
import matplotlib . pyplot as p l t
1004 import seaborn as sns
import scipy . s t a t s as ss
1006 import math
from sklearn import preprocessing
1008 import sklearn . model_selection as ms
from sklearn import linear_model
1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t
from sklearn . preprocessing import StandardScaler
1012 from sklearn . model_selection import cross_val_score
import sklearn . metrics as sklm
1014 from sklearn . ensemble import RandomForestRegressor
from sklearn . metrics import mean_squared_error
1016 pd . pandas . set_option ( ’ display . max_columns ’ , None)
import warnings
1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ )
1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ )
t e s t = pd . read_csv ( ’ pythontest . csv ’ )
1022 df
df . shape
1024 df . columns
df . isna () . sum ()
1026 t e s t . head ()
1028 df . loan_type = df . loan_type . astype (np . object )
df . property_type = df . property_type . astype (np . object )
1030 df . loan_purpose = df . loan_purpose . astype (np . object )
df . occupancy = df . occupancy . astype (np . object )
1032 df . occupancy = df . occupancy . astype (np . object )
df . preapproval = df . preapproval . astype (np . object )
1034 df .msa_md = df .msa_md. astype (np . object )
df . state_code = df . state_code . astype (np . object )
1036 df . county_code = df . county_code . astype (np . object )
df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object )
1038 df . race = df . race . astype (np . object )
df . sex = df . sex . astype (np . object )
1040 df . lender = df . lender . astype (np . object )
df . dtypes
1042
numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’
tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ]
1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’
lender ’ ] ]
log_num = np . log ( numeric )
1046
1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) :
sns . set_style ( " darkgrid " )
1050 f o r c o l in c o l s :
f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) )
1052 ax = f i g . gca ()
df . plot . s c a t t e r (x = col , y = col_y , ax = ax )
1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l )
ax . set_xlabel ( c o l )
1056 ax . set_ylabel ( col_y )
p l t . show ()
1058
plot_scatter ( df , log_num)
25
27. 1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ )
y = df . ratespread
1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 )
X_train . shape , X_test . shape
1004 X_train . d e s c r i b e () . transpose ()
1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’
e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’
co_applicant ’ , ’ ratespread ’ ] ]
import numpy as np
1008 import pandas as pd
import matplotlib as mpl
1010 import matplotlib . pyplot as p l t
import seaborn as sns
1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ )
1014 l a r g e = 22; med = 16; small = 12
params = { ’ axes . t i t l e s i z e ’ : large ,
1016 ’ legend . f o n t s i z e ’ : med ,
’ f i g u r e . f i g s i z e ’ : (16 , 10) ,
1018 ’ axes . l a b e l s i z e ’ : med ,
’ axes . t i t l e s i z e ’ : med ,
1020 ’ xtick . l a b e l s i z e ’ : med ,
’ ytick . l a b e l s i z e ’ : med ,
1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e }
p l t . rcParams . update ( params )
1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ )
sns . set_style ( " white " )
1026 %matplotlib i n l i n e
import random
1028
s c a l e r = StandardScaler ()
1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] )
1032 xgbmod = xgb . XGBRegressor (max_depth=128, learning_rate =0.01 , n_estimators =1000, s i l e n t=
False , )
eval = [ ( X_test [ t r a i n i n g ] , y_test ) ]
1034 xgb_mod . f i t ( X_train [ t r a i n i n g ] , y_train , eval=eval , verbose=False )
p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_train [ t r a i n i n g ] )
1036 print ( ’XGB t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n 0 ) ) )
p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_test [ t r a i n i n g ] )
1038 print ( ’XGB t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n 0 ) ) )
1040 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" ,
random_state=1234)
f r e g . f i t ( X_train [ t r a i n i n g ] , y_train )
1042 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] )
print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) )
1044 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] )
print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) )
1046
p r e d i c t i o n 1 = [ ]
1048 f o r model in [ f r e g ] :
p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) )
1050 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1)
temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1)
1052 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ]
temp1 . head ()
1054 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False )
1056 importance = pd . S e r i e s ( f r e g . feature_importances_ )
importance . index = t r a i n i n g
1058 importance . sort_values ( i n p l a c e=True , ascending=False )
importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ ,
’ darkcyan ’ ] )
26
28. 7. Conclusion and Recommendation
This analysis concludes that predictions of interest rate spread can be confidently made from the
information collected from loan applications. In particular, loan amount, applicant income, loan
type, property type and loan purpose have significant effect on determining interest rate spread.
Trained model has proven to be effective in predicting rate spread from given information on loan
applications. The coefficient of determination (R squared) was 0.83 during training and model
achieved a R squared of 0.77 when tested against the test data in the competition. Although there
is a drop in R squared, it is not significant and it can be concluded that model generalizes well. It is
apparent from the joint plot regression line that the predicted values and actual values show a healthy
correlation. The scatter plot shows a variance in the predicted values and can provide an intuitive
inference to the error spread. Deeper domain knowledge is required for better understanding of the
data. As a way forward, expert consultation can be seeked to guide in better feature engineering.
At the same time as more data is collected, model must be retrained with new data. Further
feature engineering and model retraining will have significant impact in model’s future predictive
capabilities. Model can be deployed as a web service, will need to be administered and monitored.
Figure 7.1: Jointplot of predicted label vs training label
27