SlideShare a Scribd company logo
1 of 28
Download to read offline
DAT102x - Microsoft Professional Capstone: Data Science
Predicting Mortgage Rates From
Government Data
Mehnaz Newaz
November 2019
Contents
1 Executive Summary 3
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Exploration 5
2.1 Numerical Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Categorical Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Correlation and Apparent Relationships 9
3.1 Numerical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Numerical feature correlations in logarithmic scale . . . . . . . . . . . . . . . . . . . 11
3.3 Categorical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Data Preparation 15
4.1 Platform chosen for Data Cleaning: Python . . . . . . . . . . . . . . . . . . . . . . . 15
5 Machine Learning Model 16
5.1 Platform chosen for Predictive Modelling using Azure ML Studio . . . . . . . . . . . 16
6 Machine Learning Model using Python in Jupyter Notebooks 24
6.1 Python Codes Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Conclusion and Recommendation 27
1
List of Figures
2.1 Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 Correlation Heat Map and Table of Numeric Features . . . . . . . . . . . . . . . . . 12
3.6 Box plots for Loan purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Box plots for Loan type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Box plots for Property Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.9 Box plots for Pre-approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.10 Box plots for Occupancy type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.11 Box plots for Co-Applicant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1 Categorical features convert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Filter based Feature Selection Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Permutation Feature importance Modelling . . . . . . . . . . . . . . . . . . . . . . . 19
7.1 Jointplot of predicted label vs training label . . . . . . . . . . . . . . . . . . . . . . . 27
2
1. Executive Summary
1.1 Problem Description
This report presents an analysis of the data for the rate spread of mortgage applications adapted from
the Federal Financial Institutions Examination Council’s (FFIEC).Each row represents a HMDA-
reported loan application. The data set contains 21 variables, each of these variables contain specific
characteristic’s for the loan application that help predict rate spread for mortgage applications.This
data set covers one particular year and there are 200,000 rows representing loan applications in the
data set.
1.2 Analysis
Data exploration for this data set included calculating summary and descriptive statistics, visual-
izations for the data set,correlations between the 21 different features were calculated in order to
predict the label rate spread. After data exploration and data preparation a predictive regression
model was used to predict the rate spread for the mortgage applicants from the features that were
created.
1.3 Key Findings
After data exploration, the following conclusions were made: The data set includes 21 features out
of which the features most helpful in predicting the rate spread for the mortgage applicants are the
following:
• This is a regression problem because the label being predicted is a numerical value.:Loan Rate
Spread.
• Thereare 13 categorical variables and 8 numerical features.
• missing values, null values, and certain redundant features were dealt with for a better COD.
• The best method found to deal with a regression problem with a large portion of the dataset to
have categorical variables was Decision Forest Regression & Boosted Decision Tree Regression.
• Both model were run and cross validation used to check if the model generalizes well.
• Boosted Tree final version yielded COD of :0.7691. This rounded to 0.77 for the competition.
• Boosted Decision Trees was the algorithm of choice as it yielded the best Coefficient of Deter-
mination.
3
• Feature engineering through two methods were done:
– Filter based feature selection
– Permutation Feature Importance
• The most important features were found using permutation feature selection in my case it was
13 features.
• Hypertuning was used to make the most out of tweaking the parameters which yielded a higher
COD.
• 13 features are listed below:
• msa_md- Metropolitan statistical area/division
• state_code - code for US states
• lender - lenders which approved/denied loans
• loan_amount - loan requested size in dollars.
• loan_type - type of loan.Conventional,FHA-insured,VA-guaranteed ,FSA/RHS
• property_type - type of property for loan: One to four-family,Manufactured housing,Multifamily
• loan_purpose - purpose of requested loan: Home purchase,Home improvement,Refinancing
• Occupancy- applicants primary dwelling for the loan request: Owner-occupied as a principal
dwelling, Not owner occupied,Not applicable.
• Preapproval- status of pre-approval for loan. was requested,not requested,Not applicable
• applicant_income - Applicants income
• applicant_ethnicity- Ethnicity of applicant:Hispanic or Latino,Not Hispanic or Latino,Information
not provided ,Not applicable,No co-applicant
• applicant_race - Applicants race:American Indian or Alaska Native,Asian,Black or African
American,Native Hawaiian or Other Pacific Islander,White,Information not provided,Not ap-
plicable,No co-applicant
• minority_population_pct - minority percentage in population
• ffiecmedian_family_income - median family income in population tract.
4
2. Data Exploration
The first step towards data exploration is with summary and descriptive statistics:
2.1 Numerical Feature Statistics
Summary statistics for minimum, maximum, mean, median, standard deviation, and distinct count
were calculated for numeric columns, and the results taken from 216 observations are shown here:
Figure 2.1: Feature Statistics
Since Rate Spread is of interest in this analysis, it was noted that the mean and median of this value
are not so far off and that the standard deviation indicates that there isn’t considerable variance in
the rate spread rates for applicants. A histogram of the Rate Spread column shows that the rate
spread values are right-skewed – in other words, most applicants have a rate spread at the lower end
of the loan rate spread range.
Figure 2.2: Rate Spread
5
2.2 Categorical Features:
• loan_type: Conventional,FHA insured,VA guaranteed,FSA/RHS
• property_type: one to four family, manufactured housing, multifamily
• loan_purpose: home purchase, home improvement, refinancing
• occupancy:principal dwelling, not owner occupied, not applicable
• preapproval: requested, not requested, not applicable
• msa_md - Metropolitan statistical area/division
• state_code - code for US states
• county_code - indicates the county in US
• applicant_ethnicity: hispanic/latino, not hispanic/latino, information not given, not appli-
cable, no co-applicant
• applicant_race: American native, asian, black/African American, pacific islander, white,
information not given, not applicable, no co-applicant
• applicant_sex : Male,Female, no information, not applicable, not applicable
• lender: lenders which approved/denied loans
Bar charts were created to show the frequency of these features and these insights were indicated:
• Pre-approval: Pre-approval requests for loans was higher than without request.
• Loan types: Conventional loan types was the highest followed by FHA insured loans.
• Occupancy: Most applicants have applied for property which is the principle dwelling.
• Property types: Most of the loans are for property which is a one to four dwelling.
• Loan purpose: The highest number of loans for is for home purchase followed by home
improvements
• Co-applicant: Most applications don’t have co-applicant
• Ethnicity: The highest number of applicants applying is hispanic/latino.
• Race: The highest number of applicants are American Natives
• Sex: The highest number of applicants are Male followed by Females.
• States: no information can be derived due to large number of categories.
• County: no information can be derived due to large number of categories.
• Lenders: no information can be derived due to large number of categories.
• Metropolitan Statistical Area: no information can be derived due to large number of
categories.
6
(a) Bar charts for Categorical variables
7
(a) County in US (b) Metropolitan Statistical Area
(c) States in US (d) Lender
8
3. Correlation and Apparent Relationships
After data exploration and individual feature analysis; the establishment of correlations and rela-
tionships between the features and the label Rate Spread is crucial to find key insights.
3.1 Numerical Feature Relationships
(a) Histogram of Numeric Feature
9
In order to find correlations and relationship between the labels and the features a correlation table,
and scatter-pot matrix was generated initially to compare the numeric features with one another
and the label. The key features are shown below:
(a) Numeric Scatter-plot matrix with log Rate Spread
10
3.2 Numerical feature correlations in logarithmic scale
Most of the features and relationships don’t show much apparent relationships or linearity. In an
attempt improve the fit of the features to Rate Spread the log normal value for rate spread was
calculated. A correlation table of all the features against linearity may give more insights on this
matter.
(a) Numeric Scatter-plot matrix with log Rate Spread
11
The correlation between the numeric columns was calculated and tabulates as follows: These corre-
lations validate the plots by showing the negative correlation between loan amount, income, popu-
lation, minority population, and median family income with Rate Spread.
Figure 3.5: Correlation Heat Map and Table of Numeric Features
12
3.3 Categorical Feature Relationships
After exploring the relationships between the numeric features an attempt was made to discern any
relationships between the categorical features and rate spread. The following boxplots show the
categorical columns that seem to exhibit relationship with the log of rate spread.
Figure 3.6: Box plots for Loan purpose
Figure 3.7: Box plots for Loan type
Figure 3.8: Box plots for Property Type
13
Figure 3.9: Box plots for Pre-approval
Figure 3.10: Box plots for Occupancy type
Figure 3.11: Box plots for Co-Applicant
14
The box plots show more clear differences in terms of the median and range of rate spread across
the categorical features as well as the distribution and outliers.For example:
• Loan type: The highest spread in loan rate was observed in conventional loan type. FHA-
insured following and large number of outliers.
• Loan purpose: home purchasing and refinancing had lower loan rates and rate spread than
home improvement which ranged from 1-7.5.
• Property type: Manufactured housing had highest spread in loan rate followed by multifam-
ily dwelling and lastly one to four family being the lowest in rate spread and rates offered.
• Pre-approval: Requested or not requested was low rate spread than Not applicable.
• Occupancy: Property occupied as principal dwelling had lowest rate spread compared to
others.
• Co-applicant: Sames rate spread regardless of presence/absence of co applicant.
4. Data Preparation
Based on the anaysis of the loan rate spread data, a regression model was used to predict the loan
rate spread for the applicants. Based on the relationships identified when analyzing the data a
Boosted Decision Tree was created to predict the loan rate spread.
The data given has 21 variables:
i 12 categorical features
ii 8 numeric features
iii 1 bool feature
iv 1 label
Since the rate spread is the label being predicted we are predicting a number and hence a regression
model is needed to predict the rate spread. The categorical features pose a big problem since they
are present as numbers for each class it belongs to. The data types are mismatched for what they
stand for in the data set hence it needs to be changed.
4.1 Platform chosen for Data Cleaning: Python
4.1.1 Steps taken for Data Preparation:
• Numbers used to classify are converted to string variables for easier modelling with regression
model
• There are missing values in the columns: msa_md, state_code, county_code as well as null
values in the entire data set.
• Missing values rows are removed since they are a small percentage of the data set and since
they are categorical and represent distinct areas/state/county;replacing them with any values
such as mean,median, or mode would be misleading the data set.
15
• Null values for numeric features are replaced with median since this is a highly right skewed
data set replacing with the mean or mode would be misleading since the outliers represent a
large portion of the data(in this case, this is the type of data provided)
5. Machine Learning Model
5.1 Platform chosen for Predictive Modelling using Azure ML
Studio
• The data set is added to AzureML Studio. Clear inspection of data set for any problems is
done.
• Clean missing data module for replacing/removing missing values. Median replacement for
numeric and removal of rows from categorical features with missing values.
• Since Classes were converted to string variables this changed the data type so any other type
of data to be used in modelling needs to be corrected.
– Edit metadata module is used for changes.
– row_id is a feature that is cleared since it is only an identification number and contributes
to nothing for predictive modelling.
– Categorical variables are selected and chose as categorical and pass unchanged since these
will be used for modelling.
– Rate spread is a float in this data set and is the label being predicted so it is chosen as
label.
– Co-applicant is a Boolean feature and is chosen as such.
• Since our data set has large numbers of categorical values as well as some numeric and not
every feature is important as observed from the Data Analysis prior to Data Preparation.
– Correlation with rate spread is not apparent for every variable in fact only a few.
– Hence, Filter Based Feature Selection is used to identify columns that have the greatest
predictive power for the input data set.
• Feature Selection module is used so as to choose the right features so I can potentially improve
the accuracy and efficiency of my model predictions.
• Feature Selection method used is:
– Mutual Information Score
– Chi-Squared
• Once the features have been identified the features with the highest scores were used. I used
12 features in this case.
• Split Data model is used for a 70-30 spit of data at first.
16
• Boosted Decision Tree Regression is the machine learning model that is used and the reasons
are:
– Since a large number of categorical features are present for this regression problem.Boosted
decision tree regression is used.
– This module is used to create an ensemble of regression trees using boosting.
– Boosting in a decision tree ensemble tends to improve accuracy with some small risk of
less coverage.
– Since this machine learning algorithm does not have any problems dealing with categorical
features and with boosting method improves accuracy without potentially misleading
results. I chose this algorithm.
– In order to maximize the results of this model I used the Tune Model Hyper-parameters
module. Entire grid sweep with metric for accuracy and regression-coefficient of determi-
nation.
– This is XGBoost(Extreme boosting) which has yielded high coefficient of determination
for this competition.
– The data set is then trained, scored, and evaluate to see the performance.
– The spit in this case was boosted up to 90-10 since boosted trees makes decision based
on the number of leaves and decision tree the more information it had the better in this
case it predicted.
• I used another method for Feature Engineering:
– Permutation Feature Importance
– Permutation feature importance doesn’t measure the association between a feature and
a target value, but instead captures how much influence each feature has on predictions
from the model.
– This yielded different columns compared to filter based feature selection.
• Once all these steps were completed on the data set the final data set is downloaded and used
in another experiment for predicting.
– With hypertuning the algorithm is XGBoost and with the best features according to my
understanding the new data set is trained and evaluated.
– A significant amount of boost in Coefficient of determination is observed.
– The model is converted to a web service and used for model deployment.
– Once the model is deployed; excel is used from request response for predicting using the
test data set which has been cleaned and prepared the same as the training data set.
• The prediction for rate_spread along with row_id is added to another CSV format file for
submission and submitted to acquire the Coefficient of Determination against a private test
set to check for over fit as well as how good the model is predicting.
17
Figure 5.1: Categorical features convert
18
Figure 5.2: Filter based Feature Selection Modelling
Figure 5.3: Permutation Feature importance Modelling
19
(a) Correlation Coefficients (b) Feature Importance
20
(a) Hypertuning Parameters with Feature Engineering
21
(a) Updated model with Feature Engineering and Hypteruning
(b) Predictive Model Azure
22
(a) Azure ML plugin Excel prediction
(b) World Ranking Competition
23
6. Machine Learning Model using Python in Jupyter
Notebooks
6.1 Python Codes Used
1000 import pandas as pd
from sklearn import preprocessing
1002 import sklearn . model_selection as ms
from sklearn import linear_model
1004 import sklearn . metrics as sklm
import numpy as np
1006 import numpy . random as nr
import matplotlib . pyplot as p l t
1008 import seaborn as sns
import scipy . s t a t s as ss
1010 import math
import matplotlib . pyplot as p l t
1012 %matplotlib i n l i n e
t r a i n = pd . read_csv ( ’ train_values . csv ’ )
1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ )
t e s t = pd . read_csv ( ’ test_values . csv ’ )
1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ )
df . shape
1018 df . columns
df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose ()
1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’
race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ ,
’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ]
column = { ’msa_md ’ : −1,
1022 ’ state_code ’ : −1,
’ county_code ’ : −1}
1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True )
df . isna () . sum ()
1026
mode_statecode = df [ ’ state_code ’ ] . mode ()
1028 mode_income = df [ ’ income ’ ] . mode ()
mode_pop = df [ ’ population ’ ] . mode ()
1030 mode_minpop = df [ ’min_pop ’ ] . mode ()
mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode ()
1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median ()
mode_owneroccup= df [ ’ owner_occup ’ ] . mode ()
1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode ()
1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) )
df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) )
1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) )
df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) )
1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) )
df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) )
1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) )
df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) )
1044 df . to_csv ( ’ pythontest . csv ’ , index = False )
24
1000 import pandas as pd
import numpy as np
1002 import numpy . random as nr
import matplotlib . pyplot as p l t
1004 import seaborn as sns
import scipy . s t a t s as ss
1006 import math
from sklearn import preprocessing
1008 import sklearn . model_selection as ms
from sklearn import linear_model
1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t
from sklearn . preprocessing import StandardScaler
1012 from sklearn . model_selection import cross_val_score
import sklearn . metrics as sklm
1014 from sklearn . ensemble import RandomForestRegressor
from sklearn . metrics import mean_squared_error
1016 pd . pandas . set_option ( ’ display . max_columns ’ , None)
import warnings
1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ )
1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ )
t e s t = pd . read_csv ( ’ pythontest . csv ’ )
1022 df
df . shape
1024 df . columns
df . isna () . sum ()
1026 t e s t . head ()
1028 df . loan_type = df . loan_type . astype (np . object )
df . property_type = df . property_type . astype (np . object )
1030 df . loan_purpose = df . loan_purpose . astype (np . object )
df . occupancy = df . occupancy . astype (np . object )
1032 df . occupancy = df . occupancy . astype (np . object )
df . preapproval = df . preapproval . astype (np . object )
1034 df .msa_md = df .msa_md. astype (np . object )
df . state_code = df . state_code . astype (np . object )
1036 df . county_code = df . county_code . astype (np . object )
df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object )
1038 df . race = df . race . astype (np . object )
df . sex = df . sex . astype (np . object )
1040 df . lender = df . lender . astype (np . object )
df . dtypes
1042
numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’
tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ]
1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’
lender ’ ] ]
log_num = np . log ( numeric )
1046
1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) :
sns . set_style ( " darkgrid " )
1050 f o r c o l in c o l s :
f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) )
1052 ax = f i g . gca ()
df . plot . s c a t t e r (x = col , y = col_y , ax = ax )
1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l )
ax . set_xlabel ( c o l )
1056 ax . set_ylabel ( col_y )
p l t . show ()
1058
plot_scatter ( df , log_num)
25
1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ )
y = df . ratespread
1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 )
X_train . shape , X_test . shape
1004 X_train . d e s c r i b e () . transpose ()
1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’
e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’
co_applicant ’ , ’ ratespread ’ ] ]
import numpy as np
1008 import pandas as pd
import matplotlib as mpl
1010 import matplotlib . pyplot as p l t
import seaborn as sns
1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ )
1014 l a r g e = 22; med = 16; small = 12
params = { ’ axes . t i t l e s i z e ’ : large ,
1016 ’ legend . f o n t s i z e ’ : med ,
’ f i g u r e . f i g s i z e ’ : (16 , 10) ,
1018 ’ axes . l a b e l s i z e ’ : med ,
’ axes . t i t l e s i z e ’ : med ,
1020 ’ xtick . l a b e l s i z e ’ : med ,
’ ytick . l a b e l s i z e ’ : med ,
1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e }
p l t . rcParams . update ( params )
1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ )
sns . set_style ( " white " )
1026 %matplotlib i n l i n e
import random
1028
s c a l e r = StandardScaler ()
1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] )
1032 xgbmod = xgb . XGBRegressor (max_depth=128, learning_rate =0.01 , n_estimators =1000, s i l e n t=
False , )
eval = [ ( X_test [ t r a i n i n g ] , y_test ) ]
1034 xgb_mod . f i t ( X_train [ t r a i n i n g ] , y_train , eval=eval , verbose=False )
p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_train [ t r a i n i n g ] )
1036 print ( ’XGB t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n 0 ) ) )
p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_test [ t r a i n i n g ] )
1038 print ( ’XGB t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n 0 ) ) )
1040 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" ,
random_state=1234)
f r e g . f i t ( X_train [ t r a i n i n g ] , y_train )
1042 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] )
print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) )
1044 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] )
print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) )
1046
p r e d i c t i o n 1 = [ ]
1048 f o r model in [ f r e g ] :
p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) )
1050 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1)
temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1)
1052 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ]
temp1 . head ()
1054 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False )
1056 importance = pd . S e r i e s ( f r e g . feature_importances_ )
importance . index = t r a i n i n g
1058 importance . sort_values ( i n p l a c e=True , ascending=False )
importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ ,
’ darkcyan ’ ] )
26
7. Conclusion and Recommendation
This analysis concludes that predictions of interest rate spread can be confidently made from the
information collected from loan applications. In particular, loan amount, applicant income, loan
type, property type and loan purpose have significant effect on determining interest rate spread.
Trained model has proven to be effective in predicting rate spread from given information on loan
applications. The coefficient of determination (R squared) was 0.83 during training and model
achieved a R squared of 0.77 when tested against the test data in the competition. Although there
is a drop in R squared, it is not significant and it can be concluded that model generalizes well. It is
apparent from the joint plot regression line that the predicted values and actual values show a healthy
correlation. The scatter plot shows a variance in the predicted values and can provide an intuitive
inference to the error spread. Deeper domain knowledge is required for better understanding of the
data. As a way forward, expert consultation can be seeked to guide in better feature engineering.
At the same time as more data is collected, model must be retrained with new data. Further
feature engineering and model retraining will have significant impact in model’s future predictive
capabilities. Model can be deployed as a web service, will need to be administered and monitored.
Figure 7.1: Jointplot of predicted label vs training label
27

More Related Content

Similar to Predicting Mortgage Rates From Government Data

Single Cell RNA Sequencing Market.pdf
 Single Cell RNA Sequencing Market.pdf Single Cell RNA Sequencing Market.pdf
Single Cell RNA Sequencing Market.pdfBIS Research Inc.
 
Face recognition vendor test 2002 supplemental report
Face recognition vendor test 2002   supplemental reportFace recognition vendor test 2002   supplemental report
Face recognition vendor test 2002 supplemental reportSungkwan Park
 
BI Project report
BI Project reportBI Project report
BI Project reporthlel
 
An introduction to data cleaning with r
An introduction to data cleaning with rAn introduction to data cleaning with r
An introduction to data cleaning with rthecar1992
 
Project Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring ModelProject Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring ModelSubhasis Mishra
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET Journal
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)NAFCU Services Corporation
 
Multidimensional scaling & Conjoint Analysis
Multidimensional scaling & Conjoint AnalysisMultidimensional scaling & Conjoint Analysis
Multidimensional scaling & Conjoint AnalysisOmer Maroof
 
Energy Management System Market: Increasing Demand for Energy Conservation an...
Energy Management System Market: Increasing Demand for Energy Conservation an...Energy Management System Market: Increasing Demand for Energy Conservation an...
Energy Management System Market: Increasing Demand for Energy Conservation an...AmanpreetSingh409
 
Auditoría de TrueCrypt: Informe final fase II
Auditoría de TrueCrypt: Informe final fase IIAuditoría de TrueCrypt: Informe final fase II
Auditoría de TrueCrypt: Informe final fase IIChema Alonso
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionMatthieu Cisel
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesShantanu Deshpande
 

Similar to Predicting Mortgage Rates From Government Data (20)

Single Cell RNA Sequencing Market.pdf
 Single Cell RNA Sequencing Market.pdf Single Cell RNA Sequencing Market.pdf
Single Cell RNA Sequencing Market.pdf
 
Face recognition vendor test 2002 supplemental report
Face recognition vendor test 2002   supplemental reportFace recognition vendor test 2002   supplemental report
Face recognition vendor test 2002 supplemental report
 
Mrd template
Mrd templateMrd template
Mrd template
 
Big Data Social Network Analysis
Big Data Social Network AnalysisBig Data Social Network Analysis
Big Data Social Network Analysis
 
BI Project report
BI Project reportBI Project report
BI Project report
 
An introduction to data cleaning with r
An introduction to data cleaning with rAn introduction to data cleaning with r
An introduction to data cleaning with r
 
Project Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring ModelProject Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring Model
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank Loans
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
 
10.1.1.21.3147
10.1.1.21.314710.1.1.21.3147
10.1.1.21.3147
 
10.1.1.21.3147
10.1.1.21.314710.1.1.21.3147
10.1.1.21.3147
 
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
 
Multidimensional scaling & Conjoint Analysis
Multidimensional scaling & Conjoint AnalysisMultidimensional scaling & Conjoint Analysis
Multidimensional scaling & Conjoint Analysis
 
Global ems market
Global ems marketGlobal ems market
Global ems market
 
Energy Management System Market: Increasing Demand for Energy Conservation an...
Energy Management System Market: Increasing Demand for Energy Conservation an...Energy Management System Market: Increasing Demand for Energy Conservation an...
Energy Management System Market: Increasing Demand for Energy Conservation an...
 
Auditoría de TrueCrypt: Informe final fase II
Auditoría de TrueCrypt: Informe final fase IIAuditoría de TrueCrypt: Informe final fase II
Auditoría de TrueCrypt: Informe final fase II
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniques
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 

Recently uploaded

如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样
如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样
如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样muwyto
 
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOAL
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOALBLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOAL
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOALCaitlinCummins3
 
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024Hector Del Castillo, CPM, CPMM
 
Stack and its operations, Queue and its operations
Stack and its operations, Queue and its operationsStack and its operations, Queue and its operations
Stack and its operations, Queue and its operationspoongothai11
 
PROGRAM FOR GRADUATION CEREMONY 2023-2024
PROGRAM FOR GRADUATION CEREMONY 2023-2024PROGRAM FOR GRADUATION CEREMONY 2023-2024
PROGRAM FOR GRADUATION CEREMONY 2023-2024alyssakayporras3
 
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样qyguxu
 
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证gakamzu
 
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样qyguxu
 
We’re looking for a junior patent engineer to join our Team!
We’re looking for a junior patent engineer to join our Team!We’re looking for a junior patent engineer to join our Team!
We’re looking for a junior patent engineer to join our Team!Juli Boned
 
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样muwyto
 
Common breast clinical based cases in Tanzania.pptx
Common breast clinical based cases in Tanzania.pptxCommon breast clinical based cases in Tanzania.pptx
Common breast clinical based cases in Tanzania.pptxJustineNDeodatus
 
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样muwyto
 
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdf
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdfOchsen Screenplay Coverage - JACOB - 10.16.23.pdf
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdfRachel Ochsenschlager
 
Kathleen McBride Costume Design Resume.pdf
Kathleen McBride Costume Design Resume.pdfKathleen McBride Costume Design Resume.pdf
Kathleen McBride Costume Design Resume.pdfKathleenMcBride8
 
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样qyguxu
 
Job Hunting - pick over this fishbone for telephone interviews!.pptx
Job Hunting - pick over this fishbone for telephone interviews!.pptxJob Hunting - pick over this fishbone for telephone interviews!.pptx
Job Hunting - pick over this fishbone for telephone interviews!.pptxJon Stephenson
 
Building a Culture of Innovation How I Encourage It in My Team.pdf
Building a Culture of Innovation How I Encourage It in My Team.pdfBuilding a Culture of Innovation How I Encourage It in My Team.pdf
Building a Culture of Innovation How I Encourage It in My Team.pdfAlexis Alexandrou
 
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHOR
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHORIN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHOR
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHORpillahdonald
 
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样qyguxu
 
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 

Recently uploaded (20)

如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样
如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样
如何办理(UdeM毕业证书)蒙特利尔大学毕业证成绩单原件一模一样
 
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOAL
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOALBLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOAL
BLAHALIFHKSDFOILEWKHJSFDNLDSKFN,DLFKNFMELKFJAERPIOAL
 
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024
WIOA Program Info Session | PMI Silver Spring Chapter | May 17, 2024
 
Stack and its operations, Queue and its operations
Stack and its operations, Queue and its operationsStack and its operations, Queue and its operations
Stack and its operations, Queue and its operations
 
PROGRAM FOR GRADUATION CEREMONY 2023-2024
PROGRAM FOR GRADUATION CEREMONY 2023-2024PROGRAM FOR GRADUATION CEREMONY 2023-2024
PROGRAM FOR GRADUATION CEREMONY 2023-2024
 
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样
如何办理(CCA毕业证书)加利福尼亚艺术学院毕业证成绩单原件一模一样
 
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(NEU毕业证书)东北大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样
如何办理(USYD毕业证书)悉尼大学毕业证成绩单原件一模一样
 
We’re looking for a junior patent engineer to join our Team!
We’re looking for a junior patent engineer to join our Team!We’re looking for a junior patent engineer to join our Team!
We’re looking for a junior patent engineer to join our Team!
 
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样
如何办理(CQU毕业证书)中央昆士兰大学毕业证成绩单原件一模一样
 
Common breast clinical based cases in Tanzania.pptx
Common breast clinical based cases in Tanzania.pptxCommon breast clinical based cases in Tanzania.pptx
Common breast clinical based cases in Tanzania.pptx
 
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样
如何办理(laurentian毕业证书)劳伦森大学毕业证成绩单原件一模一样
 
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdf
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdfOchsen Screenplay Coverage - JACOB - 10.16.23.pdf
Ochsen Screenplay Coverage - JACOB - 10.16.23.pdf
 
Kathleen McBride Costume Design Resume.pdf
Kathleen McBride Costume Design Resume.pdfKathleen McBride Costume Design Resume.pdf
Kathleen McBride Costume Design Resume.pdf
 
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样
如何办理(CBU毕业证书)浸会大学毕业证成绩单原件一模一样
 
Job Hunting - pick over this fishbone for telephone interviews!.pptx
Job Hunting - pick over this fishbone for telephone interviews!.pptxJob Hunting - pick over this fishbone for telephone interviews!.pptx
Job Hunting - pick over this fishbone for telephone interviews!.pptx
 
Building a Culture of Innovation How I Encourage It in My Team.pdf
Building a Culture of Innovation How I Encourage It in My Team.pdfBuilding a Culture of Innovation How I Encourage It in My Team.pdf
Building a Culture of Innovation How I Encourage It in My Team.pdf
 
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHOR
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHORIN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHOR
IN DOHA +27838792658 ABORTION PILLS FOR SALE IN DOHA, AL KHOR
 
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样
如何办理(UoA毕业证书)奥克兰大学毕业证成绩单原件一模一样
 
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Jakarta (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 

Predicting Mortgage Rates From Government Data

  • 1. DAT102x - Microsoft Professional Capstone: Data Science Predicting Mortgage Rates From Government Data Mehnaz Newaz November 2019
  • 2. Contents 1 Executive Summary 3 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Data Exploration 5 2.1 Numerical Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Categorical Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Correlation and Apparent Relationships 9 3.1 Numerical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Numerical feature correlations in logarithmic scale . . . . . . . . . . . . . . . . . . . 11 3.3 Categorical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Data Preparation 15 4.1 Platform chosen for Data Cleaning: Python . . . . . . . . . . . . . . . . . . . . . . . 15 5 Machine Learning Model 16 5.1 Platform chosen for Predictive Modelling using Azure ML Studio . . . . . . . . . . . 16 6 Machine Learning Model using Python in Jupyter Notebooks 24 6.1 Python Codes Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7 Conclusion and Recommendation 27 1
  • 3. List of Figures 2.1 Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.5 Correlation Heat Map and Table of Numeric Features . . . . . . . . . . . . . . . . . 12 3.6 Box plots for Loan purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.7 Box plots for Loan type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.8 Box plots for Property Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.9 Box plots for Pre-approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.10 Box plots for Occupancy type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.11 Box plots for Co-Applicant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.1 Categorical features convert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2 Filter based Feature Selection Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.3 Permutation Feature importance Modelling . . . . . . . . . . . . . . . . . . . . . . . 19 7.1 Jointplot of predicted label vs training label . . . . . . . . . . . . . . . . . . . . . . . 27 2
  • 4. 1. Executive Summary 1.1 Problem Description This report presents an analysis of the data for the rate spread of mortgage applications adapted from the Federal Financial Institutions Examination Council’s (FFIEC).Each row represents a HMDA- reported loan application. The data set contains 21 variables, each of these variables contain specific characteristic’s for the loan application that help predict rate spread for mortgage applications.This data set covers one particular year and there are 200,000 rows representing loan applications in the data set. 1.2 Analysis Data exploration for this data set included calculating summary and descriptive statistics, visual- izations for the data set,correlations between the 21 different features were calculated in order to predict the label rate spread. After data exploration and data preparation a predictive regression model was used to predict the rate spread for the mortgage applicants from the features that were created. 1.3 Key Findings After data exploration, the following conclusions were made: The data set includes 21 features out of which the features most helpful in predicting the rate spread for the mortgage applicants are the following: • This is a regression problem because the label being predicted is a numerical value.:Loan Rate Spread. • Thereare 13 categorical variables and 8 numerical features. • missing values, null values, and certain redundant features were dealt with for a better COD. • The best method found to deal with a regression problem with a large portion of the dataset to have categorical variables was Decision Forest Regression & Boosted Decision Tree Regression. • Both model were run and cross validation used to check if the model generalizes well. • Boosted Tree final version yielded COD of :0.7691. This rounded to 0.77 for the competition. • Boosted Decision Trees was the algorithm of choice as it yielded the best Coefficient of Deter- mination. 3
  • 5. • Feature engineering through two methods were done: – Filter based feature selection – Permutation Feature Importance • The most important features were found using permutation feature selection in my case it was 13 features. • Hypertuning was used to make the most out of tweaking the parameters which yielded a higher COD. • 13 features are listed below: • msa_md- Metropolitan statistical area/division • state_code - code for US states • lender - lenders which approved/denied loans • loan_amount - loan requested size in dollars. • loan_type - type of loan.Conventional,FHA-insured,VA-guaranteed ,FSA/RHS • property_type - type of property for loan: One to four-family,Manufactured housing,Multifamily • loan_purpose - purpose of requested loan: Home purchase,Home improvement,Refinancing • Occupancy- applicants primary dwelling for the loan request: Owner-occupied as a principal dwelling, Not owner occupied,Not applicable. • Preapproval- status of pre-approval for loan. was requested,not requested,Not applicable • applicant_income - Applicants income • applicant_ethnicity- Ethnicity of applicant:Hispanic or Latino,Not Hispanic or Latino,Information not provided ,Not applicable,No co-applicant • applicant_race - Applicants race:American Indian or Alaska Native,Asian,Black or African American,Native Hawaiian or Other Pacific Islander,White,Information not provided,Not ap- plicable,No co-applicant • minority_population_pct - minority percentage in population • ffiecmedian_family_income - median family income in population tract. 4
  • 6. 2. Data Exploration The first step towards data exploration is with summary and descriptive statistics: 2.1 Numerical Feature Statistics Summary statistics for minimum, maximum, mean, median, standard deviation, and distinct count were calculated for numeric columns, and the results taken from 216 observations are shown here: Figure 2.1: Feature Statistics Since Rate Spread is of interest in this analysis, it was noted that the mean and median of this value are not so far off and that the standard deviation indicates that there isn’t considerable variance in the rate spread rates for applicants. A histogram of the Rate Spread column shows that the rate spread values are right-skewed – in other words, most applicants have a rate spread at the lower end of the loan rate spread range. Figure 2.2: Rate Spread 5
  • 7. 2.2 Categorical Features: • loan_type: Conventional,FHA insured,VA guaranteed,FSA/RHS • property_type: one to four family, manufactured housing, multifamily • loan_purpose: home purchase, home improvement, refinancing • occupancy:principal dwelling, not owner occupied, not applicable • preapproval: requested, not requested, not applicable • msa_md - Metropolitan statistical area/division • state_code - code for US states • county_code - indicates the county in US • applicant_ethnicity: hispanic/latino, not hispanic/latino, information not given, not appli- cable, no co-applicant • applicant_race: American native, asian, black/African American, pacific islander, white, information not given, not applicable, no co-applicant • applicant_sex : Male,Female, no information, not applicable, not applicable • lender: lenders which approved/denied loans Bar charts were created to show the frequency of these features and these insights were indicated: • Pre-approval: Pre-approval requests for loans was higher than without request. • Loan types: Conventional loan types was the highest followed by FHA insured loans. • Occupancy: Most applicants have applied for property which is the principle dwelling. • Property types: Most of the loans are for property which is a one to four dwelling. • Loan purpose: The highest number of loans for is for home purchase followed by home improvements • Co-applicant: Most applications don’t have co-applicant • Ethnicity: The highest number of applicants applying is hispanic/latino. • Race: The highest number of applicants are American Natives • Sex: The highest number of applicants are Male followed by Females. • States: no information can be derived due to large number of categories. • County: no information can be derived due to large number of categories. • Lenders: no information can be derived due to large number of categories. • Metropolitan Statistical Area: no information can be derived due to large number of categories. 6
  • 8. (a) Bar charts for Categorical variables 7
  • 9. (a) County in US (b) Metropolitan Statistical Area (c) States in US (d) Lender 8
  • 10. 3. Correlation and Apparent Relationships After data exploration and individual feature analysis; the establishment of correlations and rela- tionships between the features and the label Rate Spread is crucial to find key insights. 3.1 Numerical Feature Relationships (a) Histogram of Numeric Feature 9
  • 11. In order to find correlations and relationship between the labels and the features a correlation table, and scatter-pot matrix was generated initially to compare the numeric features with one another and the label. The key features are shown below: (a) Numeric Scatter-plot matrix with log Rate Spread 10
  • 12. 3.2 Numerical feature correlations in logarithmic scale Most of the features and relationships don’t show much apparent relationships or linearity. In an attempt improve the fit of the features to Rate Spread the log normal value for rate spread was calculated. A correlation table of all the features against linearity may give more insights on this matter. (a) Numeric Scatter-plot matrix with log Rate Spread 11
  • 13. The correlation between the numeric columns was calculated and tabulates as follows: These corre- lations validate the plots by showing the negative correlation between loan amount, income, popu- lation, minority population, and median family income with Rate Spread. Figure 3.5: Correlation Heat Map and Table of Numeric Features 12
  • 14. 3.3 Categorical Feature Relationships After exploring the relationships between the numeric features an attempt was made to discern any relationships between the categorical features and rate spread. The following boxplots show the categorical columns that seem to exhibit relationship with the log of rate spread. Figure 3.6: Box plots for Loan purpose Figure 3.7: Box plots for Loan type Figure 3.8: Box plots for Property Type 13
  • 15. Figure 3.9: Box plots for Pre-approval Figure 3.10: Box plots for Occupancy type Figure 3.11: Box plots for Co-Applicant 14
  • 16. The box plots show more clear differences in terms of the median and range of rate spread across the categorical features as well as the distribution and outliers.For example: • Loan type: The highest spread in loan rate was observed in conventional loan type. FHA- insured following and large number of outliers. • Loan purpose: home purchasing and refinancing had lower loan rates and rate spread than home improvement which ranged from 1-7.5. • Property type: Manufactured housing had highest spread in loan rate followed by multifam- ily dwelling and lastly one to four family being the lowest in rate spread and rates offered. • Pre-approval: Requested or not requested was low rate spread than Not applicable. • Occupancy: Property occupied as principal dwelling had lowest rate spread compared to others. • Co-applicant: Sames rate spread regardless of presence/absence of co applicant. 4. Data Preparation Based on the anaysis of the loan rate spread data, a regression model was used to predict the loan rate spread for the applicants. Based on the relationships identified when analyzing the data a Boosted Decision Tree was created to predict the loan rate spread. The data given has 21 variables: i 12 categorical features ii 8 numeric features iii 1 bool feature iv 1 label Since the rate spread is the label being predicted we are predicting a number and hence a regression model is needed to predict the rate spread. The categorical features pose a big problem since they are present as numbers for each class it belongs to. The data types are mismatched for what they stand for in the data set hence it needs to be changed. 4.1 Platform chosen for Data Cleaning: Python 4.1.1 Steps taken for Data Preparation: • Numbers used to classify are converted to string variables for easier modelling with regression model • There are missing values in the columns: msa_md, state_code, county_code as well as null values in the entire data set. • Missing values rows are removed since they are a small percentage of the data set and since they are categorical and represent distinct areas/state/county;replacing them with any values such as mean,median, or mode would be misleading the data set. 15
  • 17. • Null values for numeric features are replaced with median since this is a highly right skewed data set replacing with the mean or mode would be misleading since the outliers represent a large portion of the data(in this case, this is the type of data provided) 5. Machine Learning Model 5.1 Platform chosen for Predictive Modelling using Azure ML Studio • The data set is added to AzureML Studio. Clear inspection of data set for any problems is done. • Clean missing data module for replacing/removing missing values. Median replacement for numeric and removal of rows from categorical features with missing values. • Since Classes were converted to string variables this changed the data type so any other type of data to be used in modelling needs to be corrected. – Edit metadata module is used for changes. – row_id is a feature that is cleared since it is only an identification number and contributes to nothing for predictive modelling. – Categorical variables are selected and chose as categorical and pass unchanged since these will be used for modelling. – Rate spread is a float in this data set and is the label being predicted so it is chosen as label. – Co-applicant is a Boolean feature and is chosen as such. • Since our data set has large numbers of categorical values as well as some numeric and not every feature is important as observed from the Data Analysis prior to Data Preparation. – Correlation with rate spread is not apparent for every variable in fact only a few. – Hence, Filter Based Feature Selection is used to identify columns that have the greatest predictive power for the input data set. • Feature Selection module is used so as to choose the right features so I can potentially improve the accuracy and efficiency of my model predictions. • Feature Selection method used is: – Mutual Information Score – Chi-Squared • Once the features have been identified the features with the highest scores were used. I used 12 features in this case. • Split Data model is used for a 70-30 spit of data at first. 16
  • 18. • Boosted Decision Tree Regression is the machine learning model that is used and the reasons are: – Since a large number of categorical features are present for this regression problem.Boosted decision tree regression is used. – This module is used to create an ensemble of regression trees using boosting. – Boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage. – Since this machine learning algorithm does not have any problems dealing with categorical features and with boosting method improves accuracy without potentially misleading results. I chose this algorithm. – In order to maximize the results of this model I used the Tune Model Hyper-parameters module. Entire grid sweep with metric for accuracy and regression-coefficient of determi- nation. – This is XGBoost(Extreme boosting) which has yielded high coefficient of determination for this competition. – The data set is then trained, scored, and evaluate to see the performance. – The spit in this case was boosted up to 90-10 since boosted trees makes decision based on the number of leaves and decision tree the more information it had the better in this case it predicted. • I used another method for Feature Engineering: – Permutation Feature Importance – Permutation feature importance doesn’t measure the association between a feature and a target value, but instead captures how much influence each feature has on predictions from the model. – This yielded different columns compared to filter based feature selection. • Once all these steps were completed on the data set the final data set is downloaded and used in another experiment for predicting. – With hypertuning the algorithm is XGBoost and with the best features according to my understanding the new data set is trained and evaluated. – A significant amount of boost in Coefficient of determination is observed. – The model is converted to a web service and used for model deployment. – Once the model is deployed; excel is used from request response for predicting using the test data set which has been cleaned and prepared the same as the training data set. • The prediction for rate_spread along with row_id is added to another CSV format file for submission and submitted to acquire the Coefficient of Determination against a private test set to check for over fit as well as how good the model is predicting. 17
  • 19. Figure 5.1: Categorical features convert 18
  • 20. Figure 5.2: Filter based Feature Selection Modelling Figure 5.3: Permutation Feature importance Modelling 19
  • 21. (a) Correlation Coefficients (b) Feature Importance 20
  • 22. (a) Hypertuning Parameters with Feature Engineering 21
  • 23. (a) Updated model with Feature Engineering and Hypteruning (b) Predictive Model Azure 22
  • 24. (a) Azure ML plugin Excel prediction (b) World Ranking Competition 23
  • 25. 6. Machine Learning Model using Python in Jupyter Notebooks 6.1 Python Codes Used 1000 import pandas as pd from sklearn import preprocessing 1002 import sklearn . model_selection as ms from sklearn import linear_model 1004 import sklearn . metrics as sklm import numpy as np 1006 import numpy . random as nr import matplotlib . pyplot as p l t 1008 import seaborn as sns import scipy . s t a t s as ss 1010 import math import matplotlib . pyplot as p l t 1012 %matplotlib i n l i n e t r a i n = pd . read_csv ( ’ train_values . csv ’ ) 1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ ) t e s t = pd . read_csv ( ’ test_values . csv ’ ) 1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ ) df . shape 1018 df . columns df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose () 1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’ loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ] column = { ’msa_md ’ : −1, 1022 ’ state_code ’ : −1, ’ county_code ’ : −1} 1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True ) df . isna () . sum () 1026 mode_statecode = df [ ’ state_code ’ ] . mode () 1028 mode_income = df [ ’ income ’ ] . mode () mode_pop = df [ ’ population ’ ] . mode () 1030 mode_minpop = df [ ’min_pop ’ ] . mode () mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode () 1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median () mode_owneroccup= df [ ’ owner_occup ’ ] . mode () 1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode () 1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) ) df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) ) 1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) ) df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) ) 1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) ) df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) ) 1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) ) df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) ) 1044 df . to_csv ( ’ pythontest . csv ’ , index = False ) 24
  • 26. 1000 import pandas as pd import numpy as np 1002 import numpy . random as nr import matplotlib . pyplot as p l t 1004 import seaborn as sns import scipy . s t a t s as ss 1006 import math from sklearn import preprocessing 1008 import sklearn . model_selection as ms from sklearn import linear_model 1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t from sklearn . preprocessing import StandardScaler 1012 from sklearn . model_selection import cross_val_score import sklearn . metrics as sklm 1014 from sklearn . ensemble import RandomForestRegressor from sklearn . metrics import mean_squared_error 1016 pd . pandas . set_option ( ’ display . max_columns ’ , None) import warnings 1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ ) 1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ ) t e s t = pd . read_csv ( ’ pythontest . csv ’ ) 1022 df df . shape 1024 df . columns df . isna () . sum () 1026 t e s t . head () 1028 df . loan_type = df . loan_type . astype (np . object ) df . property_type = df . property_type . astype (np . object ) 1030 df . loan_purpose = df . loan_purpose . astype (np . object ) df . occupancy = df . occupancy . astype (np . object ) 1032 df . occupancy = df . occupancy . astype (np . object ) df . preapproval = df . preapproval . astype (np . object ) 1034 df .msa_md = df .msa_md. astype (np . object ) df . state_code = df . state_code . astype (np . object ) 1036 df . county_code = df . county_code . astype (np . object ) df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object ) 1038 df . race = df . race . astype (np . object ) df . sex = df . sex . astype (np . object ) 1040 df . lender = df . lender . astype (np . object ) df . dtypes 1042 numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ] 1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’ lender ’ ] ] log_num = np . log ( numeric ) 1046 1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) : sns . set_style ( " darkgrid " ) 1050 f o r c o l in c o l s : f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) ) 1052 ax = f i g . gca () df . plot . s c a t t e r (x = col , y = col_y , ax = ax ) 1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l ) ax . set_xlabel ( c o l ) 1056 ax . set_ylabel ( col_y ) p l t . show () 1058 plot_scatter ( df , log_num) 25
  • 27. 1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ ) y = df . ratespread 1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 ) X_train . shape , X_test . shape 1004 X_train . d e s c r i b e () . transpose () 1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’ co_applicant ’ , ’ ratespread ’ ] ] import numpy as np 1008 import pandas as pd import matplotlib as mpl 1010 import matplotlib . pyplot as p l t import seaborn as sns 1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ ) 1014 l a r g e = 22; med = 16; small = 12 params = { ’ axes . t i t l e s i z e ’ : large , 1016 ’ legend . f o n t s i z e ’ : med , ’ f i g u r e . f i g s i z e ’ : (16 , 10) , 1018 ’ axes . l a b e l s i z e ’ : med , ’ axes . t i t l e s i z e ’ : med , 1020 ’ xtick . l a b e l s i z e ’ : med , ’ ytick . l a b e l s i z e ’ : med , 1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e } p l t . rcParams . update ( params ) 1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ ) sns . set_style ( " white " ) 1026 %matplotlib i n l i n e import random 1028 s c a l e r = StandardScaler () 1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] ) 1032 xgbmod = xgb . XGBRegressor (max_depth=128, learning_rate =0.01 , n_estimators =1000, s i l e n t= False , ) eval = [ ( X_test [ t r a i n i n g ] , y_test ) ] 1034 xgb_mod . f i t ( X_train [ t r a i n i n g ] , y_train , eval=eval , verbose=False ) p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_train [ t r a i n i n g ] ) 1036 print ( ’XGB t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n 0 ) ) ) p r e d i c t i o n 0 = xgbmod . p r e d i c t ( X_test [ t r a i n i n g ] ) 1038 print ( ’XGB t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n 0 ) ) ) 1040 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" , random_state=1234) f r e g . f i t ( X_train [ t r a i n i n g ] , y_train ) 1042 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] ) print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) ) 1044 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] ) print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) ) 1046 p r e d i c t i o n 1 = [ ] 1048 f o r model in [ f r e g ] : p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) ) 1050 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1) temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1) 1052 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ] temp1 . head () 1054 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False ) 1056 importance = pd . S e r i e s ( f r e g . feature_importances_ ) importance . index = t r a i n i n g 1058 importance . sort_values ( i n p l a c e=True , ascending=False ) importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ , ’ darkcyan ’ ] ) 26
  • 28. 7. Conclusion and Recommendation This analysis concludes that predictions of interest rate spread can be confidently made from the information collected from loan applications. In particular, loan amount, applicant income, loan type, property type and loan purpose have significant effect on determining interest rate spread. Trained model has proven to be effective in predicting rate spread from given information on loan applications. The coefficient of determination (R squared) was 0.83 during training and model achieved a R squared of 0.77 when tested against the test data in the competition. Although there is a drop in R squared, it is not significant and it can be concluded that model generalizes well. It is apparent from the joint plot regression line that the predicted values and actual values show a healthy correlation. The scatter plot shows a variance in the predicted values and can provide an intuitive inference to the error spread. Deeper domain knowledge is required for better understanding of the data. As a way forward, expert consultation can be seeked to guide in better feature engineering. At the same time as more data is collected, model must be retrained with new data. Further feature engineering and model retraining will have significant impact in model’s future predictive capabilities. Model can be deployed as a web service, will need to be administered and monitored. Figure 7.1: Jointplot of predicted label vs training label 27