1. Microsoft Professional Program: Data Science
DAT102x Capstone Project
Predicting Mortgage Rates from Government Data
Mashfiq E Shahriar
November 2019
4. 1. Executive Summary
1.1 Problem Description
This document presents an analysis and predictive modelling of data concerning mortgage loan
applications and interest rate spread. Goal of the problem was to predict the rate spread of
mortgage applications according to the given data set, which is adapted from the Federal Financial
Institutions Examination Council’s (FFIEC). Link to Problem.The value to be predicted was a
float value, making this a regression problem. To measure the accuracy of regression, metric known
as the "coefficient of determination" or "R squared” was used. R squared is the proportion of the
variance in the dependent variable that is explained by the predictive model. Closer the R squared
value to 1, higher the accuracy of the predictive model. R-squared can be calculated by the formula:
R2
= 1 −
i
(yi − ˆyi)2
/
i
(yi − ¯y)2
1.2 Analysis Process
The initial data set given for the problem contained 200,000 cases of mortgage loan applications.
There were 21 variables i.e. features in the data that were associated with rate spread i.e. target
label. To tackle the problem:
Data exploration was carried out, which involved descriptive statistics, distribution plots for
numerical features. Followed by pair plot and heat map visualization to see potential relationships
among them. Bar charts were used to identify relationships in categorical variables.
Data preparation was done to pre-process the data i.e. clean missing data, change features to
appropriate data types, transforming categorical values represented by numbers to strings. Further
wrangling was done which involved detection & replacement of outliers, normalization of numeric fea-
tures, converting categorical features to Boolean columns also known as ‘one-hot encoding’, feature
Selection was performed by analyzing features in Pearson correlation, Kendall correlation, Mutual
information and Chi squared.
Machine learning model was created by selection of best algorithm amongst a few regressors,
based on R squared. Chosen model was tuned for hyper-parameters, cross validated and deployed
for predicting test data.
1.3 Key Findings
Most important features for predictive modelling of rate spread turned out to be:
‘loan_amount’, ‘loan_type’, ‘property_type’, ‘preapproval’, ‘loan_purpose’, ‘ffiecmedian_family_income’,
‘applicant_income’, ‘minority_population_pct’.
3
5. Boosted decision tree regression algorithm performed the best in training the model. It was able
to achieve the highest R squared compared to Linear regression, Decision forest regression. Trained
model was observed to generalize well when cross-validated, depicting significantly minute standard
deviation in R squared across folds. Model upon deployment predicted rate spread based on the
test data. When compared to actual rate spread with held in the machine learning competition R
squared of 0.77 was achieved.Model was ranked # 12 in the Machine Learning Competition out of
700+ competitors. Link to Competition
2. Data Exploration
2.1 Dataset
Data exploration started with visualization of the data frame. The original dataset can be seen
below, containing 23 columns: 21 feature columns, 1 label column and 1 ID column.
Figure 2.1: Dataset
Numerical Features:
• loan_amount: Size of the requested loan in thousands of dollars.
• applicant_income: In thousands of dollars.
• population: Total population in tract.
• minority_population_pct: Percentage of minority population to total population for tract.
• ffiecmedian_family_income: FFIEC Median family income in dollars for the
• MSA/MD in which the tract is located.
• tract_to_msa_md_income_pct: Percentage of tract median family income compared to
MSA/MD median family income.
• number_of_owner-occupied_units: Number of dwellings, including individual condominiums,
that are lived in by the owner.
• number_of_1_to_4_family_units: Dwellings that are built to house fewer than 5 families.
4
6. Categorical Features:
• loan_type: Indicates whether the loan granted, applied for, or purchased was conventional,
government-guaranteed, or government-insured.
• property_type: Indicates whether the loan or application was for a one-to-four-family dwelling
(other than manufactured housing), manufactured housing, or multifamily dwelling.
• loan_purpose: Indicates whether the purpose of the loan or application was for home purchase,
home improvement, or refinancing.
• occupancy: Indicates whether the property to which the loan application relates will be the
owner’s principal dwelling.
• preapproval: Indicate whether the application or loan involved a request for a pre-approval of
a home purchase loan.
• msa_md: Metropolitan Statistical Area/Metropolitan Division state_code: U.S. state.
• county_code: County.
• applicant_ethnicity: Ethnicity of the applicant.
• applicant_race: Race of the applicant.
• applicant_sex: Sex of the applicant.
• lender: The authority in approving or denying this loan.
• co_applicant: Indicates whether there is a co-applicant (often a spouse) or not.
5
7. 2.2 Numerical Feature Relationships
Summary statistics for minimum, maximum, mean, standard deviation, and distinct count were
calculated for numeric columns, and the results from 200,000 observations are shown in Figure 2.2.
Figure 2.2: Summary Statistics
Figure 2.3: Distribution of Training Label
Kernel density estimator plot of training label and important numerical features were generated to
study their distribution. The training label in Figure 2.3 appears to be right-skewed and depicts
high-peak. Therefore, high positive values for skewness and kurtosis can be assumed. High kurtosis
value indicates likelihood of outliers; this can be visually seen as well. Majority of the data lie below
10 with few positive outliers that are sparse as portrayed by the rug on the x-axis. The discontinuity
of rug also shows that the it is a distribution of discreet values.
6
8. (a) Distribution of Applicant Income (b) Distribution of Loan Amount
(c) Distribution of Median Family Income (d) Distribution of Minority Population Percentage
Figure 2.4: Distribution of Numerical Features
Applicant income in Figure 2.4a appears to be right-skewed and depicts high-peak. Therefore,
high positive values for skewness and kurtosis can be assumed here. High kurtosis value indicates
likelihood of outliers; this can be visually seen as well. Majority of the data lie below 200 (x-axis in
thousands) with quite a few positive outliers. Outliers tend to be more extreme and sparser after
the 2000.
Applicant loan amount in Figure2.4b appears to have a very similar distribution to applicant
income. Loan amount and applicant income are positively correlated. This is shown in the scatter
plot matrix, which is to follow later in the report.
Median family incomes of the metropolitan statistical areas in Figure 2.4c appear to be normally
distributed with the mean being around 65,000. There are a few low-income areas below 40,000 as
shown by the stacks in rug on the x-axis.
Minority population percentage distribution in Figure 2.4d is right skewed. The mean value of the
distribution is 34 percent, which is higher than what is generally witnessed in most areas. Higher
mean is due to positive skewness, some areas have as high as 100 percent minority population. The
mode of the distribution in around 5 percent and that is the demographic of the highest frequency.
7
9. The following scatter plot matrix in Figure 2.5 was generated to visualize the relationships of the
numerical features with each other and the training label.
Figure 2.5: Scatter-Plot Matrix for Numeric Features
8
10. The image in Figure 2.6a below shows the correlation between logarithmic numerical features and
the logarithmic rate spread. This followed by an annotated heat map in Figure 2.6b that shows
the correlations amongst the features and the rate spread. Log values were used to accentuate the
correlation numbers.
(a) Correlations of Numerical Features with Training Label
(b) Heatmap of Correlations
Figure 2.6: Correlation Coefficients
9
11. 2.3 Categorical Feature Frequency
Below Figures 2.7 show the frequency of the categorical features.
Figure 2.7: Bar Charts
10
12. 3. Data Preparation
3.1 Data Pre-processing
The initial data set had missing values in eight of the numerical features. During data exploration
stage of the project it was discovered that the features with missing values had right skewed dis-
tributions. Therefore, the missing values were replaced by the median values and not the mean.
Missing values were replaced instead of dropping the rows in order to not lose training data points.
The initial data set also had the data columns as non-appropriate data types. Many of the categor-
ical columns were typed as integers or floats. Columns were changed to the appropriate data types
as shown in Figure 3.1. After correcting the data types, numbers used to represent the unique
categories in the categorical columns were converted to descriptive strings. This makes the data set
not only readable visually but also eliminates problems that may arise during predictive modelling
process. The data set after changes shown below in Figure 3.2
Figure 3.1: Feature Data Types
11
13. Figure 3.2: Data Set After Changing to Descriptive Strings
3.2 Data Wrangling
The below four processes worked out in Azure Machine Learning Studio:
• Detection of extreme outliers: ‘Clip Values’ module was used to detect values higher than 99th
percentile and were replaced by a system generated threshold value.
• Normalization of numeric features: ‘Normalize Data’ module was used to rescale numeric
features to a standard range. Z-score rescaling was used in order to scale & constrain numeric
features. This is performed so that model training is not dominated by numeric features with
larger values. Normalization did not appear to have any effect on Tree based algorithms and
was avoided only in ensemble tree algorithms.
• Converting categorical features to Boolean columns also known as one-hot encoding: ‘Convert
to Indicator Values’ module was used for one-hot encoding. One hot encoding did not appear
to have any effect on Tree based algorithms and was avoided only in ensemble tree algorithms.
• Feature Selection: ‘Filter Based Feature Selection’ module was used in Azure ML. Pearson
correlation and Kendall correlation was used only on the numeric features. Mutual information
and Chi squared was applied to all features, as these two methods can handle both numeric
and categorical features. The 8 features with the most predictive power for rate spread using
Pearson Correlation and Mutual Information are shown below in Figure 3.3.
Figure 3.3: Feature Selection
12
14. 4. Machine Learning Model
4.1 Choosing Machine Learning Algorithm
The problem at hand is prediction of continuous values, training input and training labels are
available. Therefore, it is a supervised learning of the regression type. Azure machine learning
studio was used for training the model.
Factors to consider when choosing a machine learning algorithm:
• Problem Category: Specific algorithms are meant to solve specific problems. Every algorithm
has its own inductive bias. It has already been established that our problem is a supervised
regression.
• Size of Data set: Here it boils down to the number of features. Certain data sets can have a
large number of features compared to number of data points. This becomes a bias/variance
issue. Low bias-high variance models tend to over fit on data sets with large number of features.
High bias-low variance models may not over fit but will be less accurate. Our data set has
ample data points compared to number of features.
• Linearity: Some algorithms depend on linearity in the data set. For example, linear regression
assumes data trends follow a straight line. In our case there are a large number of nominal
variables, data set lacks linearity.
• Accuracy Required: Sometimes only an approximation is needed and at other times higher
accuracy is needed. Some algorithms are not very accurate but an advantage is that they do
not over-fit. In our case the metric of measurement is R squared, therefore, accuracy is of most
importance.
• Training Time: Training time and computational power requirement is different for each al-
gorithm. Models such as neural network regression needs long training time. Larger data sets
need longer model training time, as well as higher accuracy is related to longer training time.
Taking the above points into account 3 algorithms were chosen. R squared when applied to the data
set were as follows:
1. Boosted Decision Tree Regression: 0.74
2. Decision Forest Regression: 0.68
3. Linear Regression: 0.41
Therefore, Boosted Decision Tree Regression was chosen for further tuning of the model to enhance
accuracy. Azure’s Boosted tree for regression uses gradient boosting.
1. Gradient boosting starts by making a leaf first, which is the average of the label values.
2. Pseudo-residuals are then calculated for each data point by subtracting the average value of
the label i.e. the first leaf from the actual label values.
13
15. 3. Then a decision tree is created to predict the Pseudo-residuals instead of the label. The result
of the decision tree prediction which is a residual value is multiplied by the learning rate and
added to the first leaf. This results in a new label average.
4. Then the new label average is used to calculate a new set of Pseudo-residuals. Another tree is
created to predict a new residual value. The new residual value is multiplied by the learning rate
and added to the previous tree. This process is repeated to reduce the residuals and converge
the prediction close to target values. Therefore, this is a sequential ensemble method. It tried
to fit together a sequence of weak learners to make a strong model. Predictions are combined
through a weighted sum (learning rate) to make a final prediction.
4.2 Tuning Hyper-parameters and Cross-Validation
Hyper-parameters for the algorithm:
• Maximum number of leaves per tree
• Minimum number of samples per leaf node
• Learning rate
• Number of trees constructed
Refer to Figure 4.1 below which shows work flow in Azure ML studio to train the model. Two
models were trained, one with tuning model hyper-parameters and one without. The model with
tuned hyper-parameters had better accuracy compared to the model with default parameters as
shown in the metrics in Figure 4.2 in the next page.
Figure 4.1: Azure ML Tuning Hyperparameters Workflow
14
16. Figure 4.2: Tuning Hyperparameters Results
After retrieving the best numbers for hyper-parameters from ‘Tune Model Hyper-parameters’ mod-
ule, further model training was carried out. Cross validation was done to witness this model’s
behaviour across 10 folds. Refer to Figure 4.3 below to see Azure ML work flow. Model seemed to
generalise well as seen in the next page in Figure 4.4. R squared values across the folds were close
with a small standard deviation on 0.0024. Mean R Squared of the 10 folds was 0.7945.
Figure 4.3: Azure ML Cross Validation Workflow
15
18. 4.3 Joint Plot of Predicted Rate Spread
Final model training was done with 75% of the training data and tested with the remaining 25%.
Refer to Figure 4.5 to see Azure ML work flow. The root mean squared error (RMSE) was 0.73.
It is apparent from the joint plot in Figure 4.6 in the next page that the model predicted the
rate spread as continuous values, whereas the training labels were discreet. Predicted values have a
right skewed distribution compared to multiple peaks in the training label distribution. The fitted
regression line shows a linear relationship between the predicted values and the training label. It
can also be inferred from fading scatter points that the model does a better job at predicting high
and low values of rate spread. Model is more error prone for the middle values (3,4, 5).
Figure 4.5: Azure ML Final Model Worklflow
17
20. 5. Conclusions & way forward
This analysis concludes that predictions of interest rate spread can be confidently made from the
information collected from loan applications. In particular, loan amount, loan type, property type,
preapproval requirement and loan purpose have significant effect on determining interest rate spread.
Trained model has proven to be sufficiently effective in predicting rate spread from given information
on loan applications. Future work can involve acquiring more domain knowledge to continue feature
engineering and data collection. The model predicted rate spread as continuous float values from a
training label that was discreet. Further understanding of whether rate spread must be discreet or
a continuous float is needed. Expert knowledge can be consulted to provide further understanding
of domain.
Trained model can be deployed as a web service and be used for predicting in a data pipeline.
Overtime if more data is collected, new data need to be used to retrain and update the existing
model. Retraining model periodically with new data and enhanced features will increase predictive
capabilities.
Resources utilised for this analysis were: Azure ML Studio, Jupyter Notebook, Anaconda, Python
3.0, Packages (Scikit-learn, SciPy, Numpy, Pandas, Matplotlib, Seaborn).
19
21. 6. Appendix
6.1 Machine Learning with Python in Jupyter Notebook
1000 import pandas as pd
from sklearn import preprocessing
1002 import sklearn . model_selection as ms
from sklearn import linear_model
1004 import sklearn . metrics as sklm
import numpy as np
1006 import numpy . random as nr
import matplotlib . pyplot as p l t
1008 import seaborn as sns
import scipy . s t a t s as ss
1010 import math
import matplotlib . pyplot as p l t
1012 %matplotlib i n l i n e
t r a i n = pd . read_csv ( ’ train_values . csv ’ )
1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ )
t e s t = pd . read_csv ( ’ test_values . csv ’ )
1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ )
df . shape
1018 df . columns
df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose ()
1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’
race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ ,
’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ]
column = { ’msa_md ’ : −1,
1022 ’ state_code ’ : −1,
’ county_code ’ : −1}
1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True )
df . isna () . sum ()
1026
mode_statecode = df [ ’ state_code ’ ] . mode ()
1028 mode_income = df [ ’ income ’ ] . mode ()
mode_pop = df [ ’ population ’ ] . mode ()
1030 mode_minpop = df [ ’min_pop ’ ] . mode ()
mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode ()
1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median ()
mode_owneroccup= df [ ’ owner_occup ’ ] . mode ()
1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode ()
1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) )
df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) )
1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) )
df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) )
1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) )
df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) )
1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) )
df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) )
1044 df . to_csv ( ’ pythontest . csv ’ , index = False )
20
22. 1000 import pandas as pd
import numpy as np
1002 import numpy . random as nr
import matplotlib . pyplot as p l t
1004 import seaborn as sns
import scipy . s t a t s as ss
1006 import math
from sklearn import preprocessing
1008 import sklearn . model_selection as ms
from sklearn import linear_model
1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t
from sklearn . preprocessing import StandardScaler
1012 from sklearn . model_selection import cross_val_score
import sklearn . metrics as sklm
1014 from sklearn . ensemble import RandomForestRegressor
from sklearn . metrics import mean_squared_error
1016 pd . pandas . set_option ( ’ display . max_columns ’ , None)
import warnings
1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ )
1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ )
t e s t = pd . read_csv ( ’ pythontest . csv ’ )
1022 df
df . shape
1024 df . columns
df . isna () . sum ()
1026 t e s t . head ()
1028 df . loan_type = df . loan_type . astype (np . object )
df . property_type = df . property_type . astype (np . object )
1030 df . loan_purpose = df . loan_purpose . astype (np . object )
df . occupancy = df . occupancy . astype (np . object )
1032 df . occupancy = df . occupancy . astype (np . object )
df . preapproval = df . preapproval . astype (np . object )
1034 df .msa_md = df .msa_md. astype (np . object )
df . state_code = df . state_code . astype (np . object )
1036 df . county_code = df . county_code . astype (np . object )
df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object )
1038 df . race = df . race . astype (np . object )
df . sex = df . sex . astype (np . object )
1040 df . lender = df . lender . astype (np . object )
df . dtypes
1042
numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’
tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ]
1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’
lender ’ ] ]
log_num = np . log ( numeric )
1046
1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) :
sns . set_style ( " darkgrid " )
1050 f o r c o l in c o l s :
f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) )
1052 ax = f i g . gca ()
df . plot . s c a t t e r (x = col , y = col_y , ax = ax )
1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l )
ax . set_xlabel ( c o l )
1056 ax . set_ylabel ( col_y )
p l t . show ()
1058
plot_scatter ( df , log_num)
21
23. 1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ )
y = df . ratespread
1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 )
X_train . shape , X_test . shape
1004 X_train . d e s c r i b e () . transpose ()
1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’
e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’
co_applicant ’ , ’ ratespread ’ ] ]
import numpy as np
1008 import pandas as pd
import matplotlib as mpl
1010 import matplotlib . pyplot as p l t
import seaborn as sns
1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ )
1014 l a r g e = 22; med = 16; small = 12
params = { ’ axes . t i t l e s i z e ’ : large ,
1016 ’ legend . f o n t s i z e ’ : med ,
’ f i g u r e . f i g s i z e ’ : (16 , 10) ,
1018 ’ axes . l a b e l s i z e ’ : med ,
’ axes . t i t l e s i z e ’ : med ,
1020 ’ xtick . l a b e l s i z e ’ : med ,
’ ytick . l a b e l s i z e ’ : med ,
1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e }
p l t . rcParams . update ( params )
1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ )
sns . set_style ( " white " )
1026 %matplotlib i n l i n e
import random
1028
s c a l e r = StandardScaler ()
1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] )
1032 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" ,
random_state=1234)
f r e g . f i t ( X_train [ t r a i n i n g ] , y_train )
1034 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] )
print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) )
1036 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] )
print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) )
1038
p r e d i c t i o n 1 = [ ]
1040 f o r model in [ f r e g ] :
p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) )
1042 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1)
temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1)
1044 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ]
temp1 . head ()
1046 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False )
1048 importance = pd . S e r i e s ( f r e g . feature_importances_ )
importance . index = t r a i n i n g
1050 importance . sort_values ( i n p l a c e=True , ascending=False )
importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ ,
’ darkcyan ’ ] )
22