SlideShare a Scribd company logo
1 of 24
Download to read offline
Microsoft Professional Program: Data Science
DAT102x Capstone Project
Predicting Mortgage Rates from Government Data
Mashfiq E Shahriar
November 2019
Contents
1 Executive Summary 3
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Analysis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Exploration 4
2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Numerical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Categorical Feature Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Data Preparation 11
3.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Data Wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Machine Learning Model 13
4.1 Choosing Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Tuning Hyper-parameters and Cross-Validation . . . . . . . . . . . . . . . . . . . . . 14
4.3 Joint Plot of Predicted Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Conclusions & way forward 19
6 Appendix 20
6.1 Machine Learning with Python in Jupyter Notebook . . . . . . . . . . . . . . . . . . 20
6.2 Machine Learning Competition Dashboard . . . . . . . . . . . . . . . . . . . . . . . . 23
1
List of Figures
2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Distribution of Training Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Distribution of Numerical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Scatter-Plot Matrix for Numeric Features . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Feature Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Data Set After Changing to Descriptive Strings . . . . . . . . . . . . . . . . . . . . . 12
3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Azure ML Tuning Hyperparameters Workflow . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Tuning Hyperparameters Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Azure ML Cross Validation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Cross Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Azure ML Final Model Worklflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Jointplot of Predicted Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1 Competition Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
1. Executive Summary
1.1 Problem Description
This document presents an analysis and predictive modelling of data concerning mortgage loan
applications and interest rate spread. Goal of the problem was to predict the rate spread of
mortgage applications according to the given data set, which is adapted from the Federal Financial
Institutions Examination Council’s (FFIEC). Link to Problem.The value to be predicted was a
float value, making this a regression problem. To measure the accuracy of regression, metric known
as the "coefficient of determination" or "R squared” was used. R squared is the proportion of the
variance in the dependent variable that is explained by the predictive model. Closer the R squared
value to 1, higher the accuracy of the predictive model. R-squared can be calculated by the formula:
R2
= 1 −
i
(yi − ˆyi)2
/
i
(yi − ¯y)2
1.2 Analysis Process
The initial data set given for the problem contained 200,000 cases of mortgage loan applications.
There were 21 variables i.e. features in the data that were associated with rate spread i.e. target
label. To tackle the problem:
Data exploration was carried out, which involved descriptive statistics, distribution plots for
numerical features. Followed by pair plot and heat map visualization to see potential relationships
among them. Bar charts were used to identify relationships in categorical variables.
Data preparation was done to pre-process the data i.e. clean missing data, change features to
appropriate data types, transforming categorical values represented by numbers to strings. Further
wrangling was done which involved detection & replacement of outliers, normalization of numeric fea-
tures, converting categorical features to Boolean columns also known as ‘one-hot encoding’, feature
Selection was performed by analyzing features in Pearson correlation, Kendall correlation, Mutual
information and Chi squared.
Machine learning model was created by selection of best algorithm amongst a few regressors,
based on R squared. Chosen model was tuned for hyper-parameters, cross validated and deployed
for predicting test data.
1.3 Key Findings
Most important features for predictive modelling of rate spread turned out to be:
‘loan_amount’, ‘loan_type’, ‘property_type’, ‘preapproval’, ‘loan_purpose’, ‘ffiecmedian_family_income’,
‘applicant_income’, ‘minority_population_pct’.
3
Boosted decision tree regression algorithm performed the best in training the model. It was able
to achieve the highest R squared compared to Linear regression, Decision forest regression. Trained
model was observed to generalize well when cross-validated, depicting significantly minute standard
deviation in R squared across folds. Model upon deployment predicted rate spread based on the
test data. When compared to actual rate spread with held in the machine learning competition R
squared of 0.77 was achieved.Model was ranked # 12 in the Machine Learning Competition out of
700+ competitors. Link to Competition
2. Data Exploration
2.1 Dataset
Data exploration started with visualization of the data frame. The original dataset can be seen
below, containing 23 columns: 21 feature columns, 1 label column and 1 ID column.
Figure 2.1: Dataset
Numerical Features:
• loan_amount: Size of the requested loan in thousands of dollars.
• applicant_income: In thousands of dollars.
• population: Total population in tract.
• minority_population_pct: Percentage of minority population to total population for tract.
• ffiecmedian_family_income: FFIEC Median family income in dollars for the
• MSA/MD in which the tract is located.
• tract_to_msa_md_income_pct: Percentage of tract median family income compared to
MSA/MD median family income.
• number_of_owner-occupied_units: Number of dwellings, including individual condominiums,
that are lived in by the owner.
• number_of_1_to_4_family_units: Dwellings that are built to house fewer than 5 families.
4
Categorical Features:
• loan_type: Indicates whether the loan granted, applied for, or purchased was conventional,
government-guaranteed, or government-insured.
• property_type: Indicates whether the loan or application was for a one-to-four-family dwelling
(other than manufactured housing), manufactured housing, or multifamily dwelling.
• loan_purpose: Indicates whether the purpose of the loan or application was for home purchase,
home improvement, or refinancing.
• occupancy: Indicates whether the property to which the loan application relates will be the
owner’s principal dwelling.
• preapproval: Indicate whether the application or loan involved a request for a pre-approval of
a home purchase loan.
• msa_md: Metropolitan Statistical Area/Metropolitan Division state_code: U.S. state.
• county_code: County.
• applicant_ethnicity: Ethnicity of the applicant.
• applicant_race: Race of the applicant.
• applicant_sex: Sex of the applicant.
• lender: The authority in approving or denying this loan.
• co_applicant: Indicates whether there is a co-applicant (often a spouse) or not.
5
2.2 Numerical Feature Relationships
Summary statistics for minimum, maximum, mean, standard deviation, and distinct count were
calculated for numeric columns, and the results from 200,000 observations are shown in Figure 2.2.
Figure 2.2: Summary Statistics
Figure 2.3: Distribution of Training Label
Kernel density estimator plot of training label and important numerical features were generated to
study their distribution. The training label in Figure 2.3 appears to be right-skewed and depicts
high-peak. Therefore, high positive values for skewness and kurtosis can be assumed. High kurtosis
value indicates likelihood of outliers; this can be visually seen as well. Majority of the data lie below
10 with few positive outliers that are sparse as portrayed by the rug on the x-axis. The discontinuity
of rug also shows that the it is a distribution of discreet values.
6
(a) Distribution of Applicant Income (b) Distribution of Loan Amount
(c) Distribution of Median Family Income (d) Distribution of Minority Population Percentage
Figure 2.4: Distribution of Numerical Features
Applicant income in Figure 2.4a appears to be right-skewed and depicts high-peak. Therefore,
high positive values for skewness and kurtosis can be assumed here. High kurtosis value indicates
likelihood of outliers; this can be visually seen as well. Majority of the data lie below 200 (x-axis in
thousands) with quite a few positive outliers. Outliers tend to be more extreme and sparser after
the 2000.
Applicant loan amount in Figure2.4b appears to have a very similar distribution to applicant
income. Loan amount and applicant income are positively correlated. This is shown in the scatter
plot matrix, which is to follow later in the report.
Median family incomes of the metropolitan statistical areas in Figure 2.4c appear to be normally
distributed with the mean being around 65,000. There are a few low-income areas below 40,000 as
shown by the stacks in rug on the x-axis.
Minority population percentage distribution in Figure 2.4d is right skewed. The mean value of the
distribution is 34 percent, which is higher than what is generally witnessed in most areas. Higher
mean is due to positive skewness, some areas have as high as 100 percent minority population. The
mode of the distribution in around 5 percent and that is the demographic of the highest frequency.
7
The following scatter plot matrix in Figure 2.5 was generated to visualize the relationships of the
numerical features with each other and the training label.
Figure 2.5: Scatter-Plot Matrix for Numeric Features
8
The image in Figure 2.6a below shows the correlation between logarithmic numerical features and
the logarithmic rate spread. This followed by an annotated heat map in Figure 2.6b that shows
the correlations amongst the features and the rate spread. Log values were used to accentuate the
correlation numbers.
(a) Correlations of Numerical Features with Training Label
(b) Heatmap of Correlations
Figure 2.6: Correlation Coefficients
9
2.3 Categorical Feature Frequency
Below Figures 2.7 show the frequency of the categorical features.
Figure 2.7: Bar Charts
10
3. Data Preparation
3.1 Data Pre-processing
The initial data set had missing values in eight of the numerical features. During data exploration
stage of the project it was discovered that the features with missing values had right skewed dis-
tributions. Therefore, the missing values were replaced by the median values and not the mean.
Missing values were replaced instead of dropping the rows in order to not lose training data points.
The initial data set also had the data columns as non-appropriate data types. Many of the categor-
ical columns were typed as integers or floats. Columns were changed to the appropriate data types
as shown in Figure 3.1. After correcting the data types, numbers used to represent the unique
categories in the categorical columns were converted to descriptive strings. This makes the data set
not only readable visually but also eliminates problems that may arise during predictive modelling
process. The data set after changes shown below in Figure 3.2
Figure 3.1: Feature Data Types
11
Figure 3.2: Data Set After Changing to Descriptive Strings
3.2 Data Wrangling
The below four processes worked out in Azure Machine Learning Studio:
• Detection of extreme outliers: ‘Clip Values’ module was used to detect values higher than 99th
percentile and were replaced by a system generated threshold value.
• Normalization of numeric features: ‘Normalize Data’ module was used to rescale numeric
features to a standard range. Z-score rescaling was used in order to scale & constrain numeric
features. This is performed so that model training is not dominated by numeric features with
larger values. Normalization did not appear to have any effect on Tree based algorithms and
was avoided only in ensemble tree algorithms.
• Converting categorical features to Boolean columns also known as one-hot encoding: ‘Convert
to Indicator Values’ module was used for one-hot encoding. One hot encoding did not appear
to have any effect on Tree based algorithms and was avoided only in ensemble tree algorithms.
• Feature Selection: ‘Filter Based Feature Selection’ module was used in Azure ML. Pearson
correlation and Kendall correlation was used only on the numeric features. Mutual information
and Chi squared was applied to all features, as these two methods can handle both numeric
and categorical features. The 8 features with the most predictive power for rate spread using
Pearson Correlation and Mutual Information are shown below in Figure 3.3.
Figure 3.3: Feature Selection
12
4. Machine Learning Model
4.1 Choosing Machine Learning Algorithm
The problem at hand is prediction of continuous values, training input and training labels are
available. Therefore, it is a supervised learning of the regression type. Azure machine learning
studio was used for training the model.
Factors to consider when choosing a machine learning algorithm:
• Problem Category: Specific algorithms are meant to solve specific problems. Every algorithm
has its own inductive bias. It has already been established that our problem is a supervised
regression.
• Size of Data set: Here it boils down to the number of features. Certain data sets can have a
large number of features compared to number of data points. This becomes a bias/variance
issue. Low bias-high variance models tend to over fit on data sets with large number of features.
High bias-low variance models may not over fit but will be less accurate. Our data set has
ample data points compared to number of features.
• Linearity: Some algorithms depend on linearity in the data set. For example, linear regression
assumes data trends follow a straight line. In our case there are a large number of nominal
variables, data set lacks linearity.
• Accuracy Required: Sometimes only an approximation is needed and at other times higher
accuracy is needed. Some algorithms are not very accurate but an advantage is that they do
not over-fit. In our case the metric of measurement is R squared, therefore, accuracy is of most
importance.
• Training Time: Training time and computational power requirement is different for each al-
gorithm. Models such as neural network regression needs long training time. Larger data sets
need longer model training time, as well as higher accuracy is related to longer training time.
Taking the above points into account 3 algorithms were chosen. R squared when applied to the data
set were as follows:
1. Boosted Decision Tree Regression: 0.74
2. Decision Forest Regression: 0.68
3. Linear Regression: 0.41
Therefore, Boosted Decision Tree Regression was chosen for further tuning of the model to enhance
accuracy. Azure’s Boosted tree for regression uses gradient boosting.
1. Gradient boosting starts by making a leaf first, which is the average of the label values.
2. Pseudo-residuals are then calculated for each data point by subtracting the average value of
the label i.e. the first leaf from the actual label values.
13
3. Then a decision tree is created to predict the Pseudo-residuals instead of the label. The result
of the decision tree prediction which is a residual value is multiplied by the learning rate and
added to the first leaf. This results in a new label average.
4. Then the new label average is used to calculate a new set of Pseudo-residuals. Another tree is
created to predict a new residual value. The new residual value is multiplied by the learning rate
and added to the previous tree. This process is repeated to reduce the residuals and converge
the prediction close to target values. Therefore, this is a sequential ensemble method. It tried
to fit together a sequence of weak learners to make a strong model. Predictions are combined
through a weighted sum (learning rate) to make a final prediction.
4.2 Tuning Hyper-parameters and Cross-Validation
Hyper-parameters for the algorithm:
• Maximum number of leaves per tree
• Minimum number of samples per leaf node
• Learning rate
• Number of trees constructed
Refer to Figure 4.1 below which shows work flow in Azure ML studio to train the model. Two
models were trained, one with tuning model hyper-parameters and one without. The model with
tuned hyper-parameters had better accuracy compared to the model with default parameters as
shown in the metrics in Figure 4.2 in the next page.
Figure 4.1: Azure ML Tuning Hyperparameters Workflow
14
Figure 4.2: Tuning Hyperparameters Results
After retrieving the best numbers for hyper-parameters from ‘Tune Model Hyper-parameters’ mod-
ule, further model training was carried out. Cross validation was done to witness this model’s
behaviour across 10 folds. Refer to Figure 4.3 below to see Azure ML work flow. Model seemed to
generalise well as seen in the next page in Figure 4.4. R squared values across the folds were close
with a small standard deviation on 0.0024. Mean R Squared of the 10 folds was 0.7945.
Figure 4.3: Azure ML Cross Validation Workflow
15
Figure 4.4: Cross Validation Results
16
4.3 Joint Plot of Predicted Rate Spread
Final model training was done with 75% of the training data and tested with the remaining 25%.
Refer to Figure 4.5 to see Azure ML work flow. The root mean squared error (RMSE) was 0.73.
It is apparent from the joint plot in Figure 4.6 in the next page that the model predicted the
rate spread as continuous values, whereas the training labels were discreet. Predicted values have a
right skewed distribution compared to multiple peaks in the training label distribution. The fitted
regression line shows a linear relationship between the predicted values and the training label. It
can also be inferred from fading scatter points that the model does a better job at predicting high
and low values of rate spread. Model is more error prone for the middle values (3,4, 5).
Figure 4.5: Azure ML Final Model Worklflow
17
Figure 4.6: Jointplot of Predicted Rate Spread
18
5. Conclusions & way forward
This analysis concludes that predictions of interest rate spread can be confidently made from the
information collected from loan applications. In particular, loan amount, loan type, property type,
preapproval requirement and loan purpose have significant effect on determining interest rate spread.
Trained model has proven to be sufficiently effective in predicting rate spread from given information
on loan applications. Future work can involve acquiring more domain knowledge to continue feature
engineering and data collection. The model predicted rate spread as continuous float values from a
training label that was discreet. Further understanding of whether rate spread must be discreet or
a continuous float is needed. Expert knowledge can be consulted to provide further understanding
of domain.
Trained model can be deployed as a web service and be used for predicting in a data pipeline.
Overtime if more data is collected, new data need to be used to retrain and update the existing
model. Retraining model periodically with new data and enhanced features will increase predictive
capabilities.
Resources utilised for this analysis were: Azure ML Studio, Jupyter Notebook, Anaconda, Python
3.0, Packages (Scikit-learn, SciPy, Numpy, Pandas, Matplotlib, Seaborn).
19
6. Appendix
6.1 Machine Learning with Python in Jupyter Notebook
1000 import pandas as pd
from sklearn import preprocessing
1002 import sklearn . model_selection as ms
from sklearn import linear_model
1004 import sklearn . metrics as sklm
import numpy as np
1006 import numpy . random as nr
import matplotlib . pyplot as p l t
1008 import seaborn as sns
import scipy . s t a t s as ss
1010 import math
import matplotlib . pyplot as p l t
1012 %matplotlib i n l i n e
t r a i n = pd . read_csv ( ’ train_values . csv ’ )
1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ )
t e s t = pd . read_csv ( ’ test_values . csv ’ )
1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ )
df . shape
1018 df . columns
df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose ()
1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’
race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ ,
’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ]
column = { ’msa_md ’ : −1,
1022 ’ state_code ’ : −1,
’ county_code ’ : −1}
1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True )
df . isna () . sum ()
1026
mode_statecode = df [ ’ state_code ’ ] . mode ()
1028 mode_income = df [ ’ income ’ ] . mode ()
mode_pop = df [ ’ population ’ ] . mode ()
1030 mode_minpop = df [ ’min_pop ’ ] . mode ()
mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode ()
1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median ()
mode_owneroccup= df [ ’ owner_occup ’ ] . mode ()
1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode ()
1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) )
df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) )
1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) )
df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) )
1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) )
df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) )
1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) )
df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) )
1044 df . to_csv ( ’ pythontest . csv ’ , index = False )
20
1000 import pandas as pd
import numpy as np
1002 import numpy . random as nr
import matplotlib . pyplot as p l t
1004 import seaborn as sns
import scipy . s t a t s as ss
1006 import math
from sklearn import preprocessing
1008 import sklearn . model_selection as ms
from sklearn import linear_model
1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t
from sklearn . preprocessing import StandardScaler
1012 from sklearn . model_selection import cross_val_score
import sklearn . metrics as sklm
1014 from sklearn . ensemble import RandomForestRegressor
from sklearn . metrics import mean_squared_error
1016 pd . pandas . set_option ( ’ display . max_columns ’ , None)
import warnings
1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ )
1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ )
t e s t = pd . read_csv ( ’ pythontest . csv ’ )
1022 df
df . shape
1024 df . columns
df . isna () . sum ()
1026 t e s t . head ()
1028 df . loan_type = df . loan_type . astype (np . object )
df . property_type = df . property_type . astype (np . object )
1030 df . loan_purpose = df . loan_purpose . astype (np . object )
df . occupancy = df . occupancy . astype (np . object )
1032 df . occupancy = df . occupancy . astype (np . object )
df . preapproval = df . preapproval . astype (np . object )
1034 df .msa_md = df .msa_md. astype (np . object )
df . state_code = df . state_code . astype (np . object )
1036 df . county_code = df . county_code . astype (np . object )
df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object )
1038 df . race = df . race . astype (np . object )
df . sex = df . sex . astype (np . object )
1040 df . lender = df . lender . astype (np . object )
df . dtypes
1042
numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’
tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ]
1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’
preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’
lender ’ ] ]
log_num = np . log ( numeric )
1046
1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) :
sns . set_style ( " darkgrid " )
1050 f o r c o l in c o l s :
f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) )
1052 ax = f i g . gca ()
df . plot . s c a t t e r (x = col , y = col_y , ax = ax )
1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l )
ax . set_xlabel ( c o l )
1056 ax . set_ylabel ( col_y )
p l t . show ()
1058
plot_scatter ( df , log_num)
21
1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ )
y = df . ratespread
1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 )
X_train . shape , X_test . shape
1004 X_train . d e s c r i b e () . transpose ()
1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’
e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’
co_applicant ’ , ’ ratespread ’ ] ]
import numpy as np
1008 import pandas as pd
import matplotlib as mpl
1010 import matplotlib . pyplot as p l t
import seaborn as sns
1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ )
1014 l a r g e = 22; med = 16; small = 12
params = { ’ axes . t i t l e s i z e ’ : large ,
1016 ’ legend . f o n t s i z e ’ : med ,
’ f i g u r e . f i g s i z e ’ : (16 , 10) ,
1018 ’ axes . l a b e l s i z e ’ : med ,
’ axes . t i t l e s i z e ’ : med ,
1020 ’ xtick . l a b e l s i z e ’ : med ,
’ ytick . l a b e l s i z e ’ : med ,
1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e }
p l t . rcParams . update ( params )
1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ )
sns . set_style ( " white " )
1026 %matplotlib i n l i n e
import random
1028
s c a l e r = StandardScaler ()
1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] )
1032 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" ,
random_state=1234)
f r e g . f i t ( X_train [ t r a i n i n g ] , y_train )
1034 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] )
print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) )
1036 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] )
print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) )
1038
p r e d i c t i o n 1 = [ ]
1040 f o r model in [ f r e g ] :
p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) )
1042 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1)
temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1)
1044 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ]
temp1 . head ()
1046 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False )
1048 importance = pd . S e r i e s ( f r e g . feature_importances_ )
importance . index = t r a i n i n g
1050 importance . sort_values ( i n p l a c e=True , ascending=False )
importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ ,
’ darkcyan ’ ] )
22
6.2 Machine Learning Competition Dashboard
Figure 6.1: Competition Score
23

More Related Content

What's hot

Dissertation / Master's Thesis
Dissertation / Master's ThesisDissertation / Master's Thesis
Dissertation / Master's ThesisVimal Gopal
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...Evaldas Taroza
 
Web Application Forensics: Taxonomy and Trends
Web Application Forensics: Taxonomy and TrendsWeb Application Forensics: Taxonomy and Trends
Web Application Forensics: Taxonomy and TrendsKrassen Deltchev
 
Algorithms for Reinforcement Learning
Algorithms for Reinforcement LearningAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learningmustafa sarac
 
Ibm spss categories
Ibm spss categoriesIbm spss categories
Ibm spss categoriesDũ Lê Anh
 
Study of different approaches to Out of Distribution Generalization
Study of different approaches to Out of Distribution GeneralizationStudy of different approaches to Out of Distribution Generalization
Study of different approaches to Out of Distribution GeneralizationMohamedAmineHACHICHA1
 
MSc(Mathematical Statistics) Theses
MSc(Mathematical Statistics) ThesesMSc(Mathematical Statistics) Theses
MSc(Mathematical Statistics) ThesesCollins Okoyo
 
M re dissertation 97-2003
M re  dissertation 97-2003M re  dissertation 97-2003
M re dissertation 97-2003Vimal Gopal
 
Setting Goals and Choosing Metrics for Recommender System Evaluations
Setting Goals and Choosing Metrics for Recommender System EvaluationsSetting Goals and Choosing Metrics for Recommender System Evaluations
Setting Goals and Choosing Metrics for Recommender System EvaluationsGunnar Fabritius (Schroeder)
 
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Yomna Mahmoud Ibrahim Hassan
 
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...Manjunath Jois
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansionLiron Zighelnic
 

What's hot (18)

Notes econometricswithr
Notes econometricswithrNotes econometricswithr
Notes econometricswithr
 
Thesis van Heesch
Thesis van HeeschThesis van Heesch
Thesis van Heesch
 
Tutorial
TutorialTutorial
Tutorial
 
Dissertation / Master's Thesis
Dissertation / Master's ThesisDissertation / Master's Thesis
Dissertation / Master's Thesis
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
 
Web Application Forensics: Taxonomy and Trends
Web Application Forensics: Taxonomy and TrendsWeb Application Forensics: Taxonomy and Trends
Web Application Forensics: Taxonomy and Trends
 
Algorithms for Reinforcement Learning
Algorithms for Reinforcement LearningAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learning
 
Ibm spss categories
Ibm spss categoriesIbm spss categories
Ibm spss categories
 
Study of different approaches to Out of Distribution Generalization
Study of different approaches to Out of Distribution GeneralizationStudy of different approaches to Out of Distribution Generalization
Study of different approaches to Out of Distribution Generalization
 
MSc(Mathematical Statistics) Theses
MSc(Mathematical Statistics) ThesesMSc(Mathematical Statistics) Theses
MSc(Mathematical Statistics) Theses
 
Thesis_Main
Thesis_MainThesis_Main
Thesis_Main
 
M re dissertation 97-2003
M re  dissertation 97-2003M re  dissertation 97-2003
M re dissertation 97-2003
 
Setting Goals and Choosing Metrics for Recommender System Evaluations
Setting Goals and Choosing Metrics for Recommender System EvaluationsSetting Goals and Choosing Metrics for Recommender System Evaluations
Setting Goals and Choosing Metrics for Recommender System Evaluations
 
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
 
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...
Master Thesis - A Column Generation Approach to Solve Multi-Team Influence Ma...
 
thesis
thesisthesis
thesis
 
my book
my bookmy book
my book
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansion
 

Similar to Microsoft Professional Capstone: Data Science

Predicting Mortgage Rates From Government Data
Predicting Mortgage Rates From Government DataPredicting Mortgage Rates From Government Data
Predicting Mortgage Rates From Government DataMehnaz Newaz
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurt Portelli
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
BI Project report
BI Project reportBI Project report
BI Project reporthlel
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaTrushita Redij
 
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsChetan Hireholi
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke PredictionMohammadRakib8
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke predictiondrubosaha
 
SDTMIG_v3.3_FINAL.pdf
SDTMIG_v3.3_FINAL.pdfSDTMIG_v3.3_FINAL.pdf
SDTMIG_v3.3_FINAL.pdfssusera19791
 
Al-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectAl-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectLeila Al-Mqbali
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 

Similar to Microsoft Professional Capstone: Data Science (20)

Predicting Mortgage Rates From Government Data
Predicting Mortgage Rates From Government DataPredicting Mortgage Rates From Government Data
Predicting Mortgage Rates From Government Data
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
report
reportreport
report
 
Milan_thesis.pdf
Milan_thesis.pdfMilan_thesis.pdf
Milan_thesis.pdf
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
 
T401
T401T401
T401
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
BI Project report
BI Project reportBI Project report
BI Project report
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
 
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke prediction
 
SDTMIG_v3.3_FINAL.pdf
SDTMIG_v3.3_FINAL.pdfSDTMIG_v3.3_FINAL.pdf
SDTMIG_v3.3_FINAL.pdf
 
Al-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research ProjectAl-Mqbali, Leila, Big Data - Research Project
Al-Mqbali, Leila, Big Data - Research Project
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
Data mining of massive datasets
Data mining of massive datasetsData mining of massive datasets
Data mining of massive datasets
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Recently uploaded (20)

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Microsoft Professional Capstone: Data Science

  • 1. Microsoft Professional Program: Data Science DAT102x Capstone Project Predicting Mortgage Rates from Government Data Mashfiq E Shahriar November 2019
  • 2. Contents 1 Executive Summary 3 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Analysis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Data Exploration 4 2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Numerical Feature Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Categorical Feature Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Data Preparation 11 3.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Data Wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Machine Learning Model 13 4.1 Choosing Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Tuning Hyper-parameters and Cross-Validation . . . . . . . . . . . . . . . . . . . . . 14 4.3 Joint Plot of Predicted Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Conclusions & way forward 19 6 Appendix 20 6.1 Machine Learning with Python in Jupyter Notebook . . . . . . . . . . . . . . . . . . 20 6.2 Machine Learning Competition Dashboard . . . . . . . . . . . . . . . . . . . . . . . . 23 1
  • 3. List of Figures 2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Distribution of Training Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Distribution of Numerical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Scatter-Plot Matrix for Numeric Features . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Feature Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Data Set After Changing to Descriptive Strings . . . . . . . . . . . . . . . . . . . . . 12 3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 Azure ML Tuning Hyperparameters Workflow . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Tuning Hyperparameters Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Azure ML Cross Validation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Cross Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.5 Azure ML Final Model Worklflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.6 Jointplot of Predicted Rate Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.1 Competition Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2
  • 4. 1. Executive Summary 1.1 Problem Description This document presents an analysis and predictive modelling of data concerning mortgage loan applications and interest rate spread. Goal of the problem was to predict the rate spread of mortgage applications according to the given data set, which is adapted from the Federal Financial Institutions Examination Council’s (FFIEC). Link to Problem.The value to be predicted was a float value, making this a regression problem. To measure the accuracy of regression, metric known as the "coefficient of determination" or "R squared” was used. R squared is the proportion of the variance in the dependent variable that is explained by the predictive model. Closer the R squared value to 1, higher the accuracy of the predictive model. R-squared can be calculated by the formula: R2 = 1 − i (yi − ˆyi)2 / i (yi − ¯y)2 1.2 Analysis Process The initial data set given for the problem contained 200,000 cases of mortgage loan applications. There were 21 variables i.e. features in the data that were associated with rate spread i.e. target label. To tackle the problem: Data exploration was carried out, which involved descriptive statistics, distribution plots for numerical features. Followed by pair plot and heat map visualization to see potential relationships among them. Bar charts were used to identify relationships in categorical variables. Data preparation was done to pre-process the data i.e. clean missing data, change features to appropriate data types, transforming categorical values represented by numbers to strings. Further wrangling was done which involved detection & replacement of outliers, normalization of numeric fea- tures, converting categorical features to Boolean columns also known as ‘one-hot encoding’, feature Selection was performed by analyzing features in Pearson correlation, Kendall correlation, Mutual information and Chi squared. Machine learning model was created by selection of best algorithm amongst a few regressors, based on R squared. Chosen model was tuned for hyper-parameters, cross validated and deployed for predicting test data. 1.3 Key Findings Most important features for predictive modelling of rate spread turned out to be: ‘loan_amount’, ‘loan_type’, ‘property_type’, ‘preapproval’, ‘loan_purpose’, ‘ffiecmedian_family_income’, ‘applicant_income’, ‘minority_population_pct’. 3
  • 5. Boosted decision tree regression algorithm performed the best in training the model. It was able to achieve the highest R squared compared to Linear regression, Decision forest regression. Trained model was observed to generalize well when cross-validated, depicting significantly minute standard deviation in R squared across folds. Model upon deployment predicted rate spread based on the test data. When compared to actual rate spread with held in the machine learning competition R squared of 0.77 was achieved.Model was ranked # 12 in the Machine Learning Competition out of 700+ competitors. Link to Competition 2. Data Exploration 2.1 Dataset Data exploration started with visualization of the data frame. The original dataset can be seen below, containing 23 columns: 21 feature columns, 1 label column and 1 ID column. Figure 2.1: Dataset Numerical Features: • loan_amount: Size of the requested loan in thousands of dollars. • applicant_income: In thousands of dollars. • population: Total population in tract. • minority_population_pct: Percentage of minority population to total population for tract. • ffiecmedian_family_income: FFIEC Median family income in dollars for the • MSA/MD in which the tract is located. • tract_to_msa_md_income_pct: Percentage of tract median family income compared to MSA/MD median family income. • number_of_owner-occupied_units: Number of dwellings, including individual condominiums, that are lived in by the owner. • number_of_1_to_4_family_units: Dwellings that are built to house fewer than 5 families. 4
  • 6. Categorical Features: • loan_type: Indicates whether the loan granted, applied for, or purchased was conventional, government-guaranteed, or government-insured. • property_type: Indicates whether the loan or application was for a one-to-four-family dwelling (other than manufactured housing), manufactured housing, or multifamily dwelling. • loan_purpose: Indicates whether the purpose of the loan or application was for home purchase, home improvement, or refinancing. • occupancy: Indicates whether the property to which the loan application relates will be the owner’s principal dwelling. • preapproval: Indicate whether the application or loan involved a request for a pre-approval of a home purchase loan. • msa_md: Metropolitan Statistical Area/Metropolitan Division state_code: U.S. state. • county_code: County. • applicant_ethnicity: Ethnicity of the applicant. • applicant_race: Race of the applicant. • applicant_sex: Sex of the applicant. • lender: The authority in approving or denying this loan. • co_applicant: Indicates whether there is a co-applicant (often a spouse) or not. 5
  • 7. 2.2 Numerical Feature Relationships Summary statistics for minimum, maximum, mean, standard deviation, and distinct count were calculated for numeric columns, and the results from 200,000 observations are shown in Figure 2.2. Figure 2.2: Summary Statistics Figure 2.3: Distribution of Training Label Kernel density estimator plot of training label and important numerical features were generated to study their distribution. The training label in Figure 2.3 appears to be right-skewed and depicts high-peak. Therefore, high positive values for skewness and kurtosis can be assumed. High kurtosis value indicates likelihood of outliers; this can be visually seen as well. Majority of the data lie below 10 with few positive outliers that are sparse as portrayed by the rug on the x-axis. The discontinuity of rug also shows that the it is a distribution of discreet values. 6
  • 8. (a) Distribution of Applicant Income (b) Distribution of Loan Amount (c) Distribution of Median Family Income (d) Distribution of Minority Population Percentage Figure 2.4: Distribution of Numerical Features Applicant income in Figure 2.4a appears to be right-skewed and depicts high-peak. Therefore, high positive values for skewness and kurtosis can be assumed here. High kurtosis value indicates likelihood of outliers; this can be visually seen as well. Majority of the data lie below 200 (x-axis in thousands) with quite a few positive outliers. Outliers tend to be more extreme and sparser after the 2000. Applicant loan amount in Figure2.4b appears to have a very similar distribution to applicant income. Loan amount and applicant income are positively correlated. This is shown in the scatter plot matrix, which is to follow later in the report. Median family incomes of the metropolitan statistical areas in Figure 2.4c appear to be normally distributed with the mean being around 65,000. There are a few low-income areas below 40,000 as shown by the stacks in rug on the x-axis. Minority population percentage distribution in Figure 2.4d is right skewed. The mean value of the distribution is 34 percent, which is higher than what is generally witnessed in most areas. Higher mean is due to positive skewness, some areas have as high as 100 percent minority population. The mode of the distribution in around 5 percent and that is the demographic of the highest frequency. 7
  • 9. The following scatter plot matrix in Figure 2.5 was generated to visualize the relationships of the numerical features with each other and the training label. Figure 2.5: Scatter-Plot Matrix for Numeric Features 8
  • 10. The image in Figure 2.6a below shows the correlation between logarithmic numerical features and the logarithmic rate spread. This followed by an annotated heat map in Figure 2.6b that shows the correlations amongst the features and the rate spread. Log values were used to accentuate the correlation numbers. (a) Correlations of Numerical Features with Training Label (b) Heatmap of Correlations Figure 2.6: Correlation Coefficients 9
  • 11. 2.3 Categorical Feature Frequency Below Figures 2.7 show the frequency of the categorical features. Figure 2.7: Bar Charts 10
  • 12. 3. Data Preparation 3.1 Data Pre-processing The initial data set had missing values in eight of the numerical features. During data exploration stage of the project it was discovered that the features with missing values had right skewed dis- tributions. Therefore, the missing values were replaced by the median values and not the mean. Missing values were replaced instead of dropping the rows in order to not lose training data points. The initial data set also had the data columns as non-appropriate data types. Many of the categor- ical columns were typed as integers or floats. Columns were changed to the appropriate data types as shown in Figure 3.1. After correcting the data types, numbers used to represent the unique categories in the categorical columns were converted to descriptive strings. This makes the data set not only readable visually but also eliminates problems that may arise during predictive modelling process. The data set after changes shown below in Figure 3.2 Figure 3.1: Feature Data Types 11
  • 13. Figure 3.2: Data Set After Changing to Descriptive Strings 3.2 Data Wrangling The below four processes worked out in Azure Machine Learning Studio: • Detection of extreme outliers: ‘Clip Values’ module was used to detect values higher than 99th percentile and were replaced by a system generated threshold value. • Normalization of numeric features: ‘Normalize Data’ module was used to rescale numeric features to a standard range. Z-score rescaling was used in order to scale & constrain numeric features. This is performed so that model training is not dominated by numeric features with larger values. Normalization did not appear to have any effect on Tree based algorithms and was avoided only in ensemble tree algorithms. • Converting categorical features to Boolean columns also known as one-hot encoding: ‘Convert to Indicator Values’ module was used for one-hot encoding. One hot encoding did not appear to have any effect on Tree based algorithms and was avoided only in ensemble tree algorithms. • Feature Selection: ‘Filter Based Feature Selection’ module was used in Azure ML. Pearson correlation and Kendall correlation was used only on the numeric features. Mutual information and Chi squared was applied to all features, as these two methods can handle both numeric and categorical features. The 8 features with the most predictive power for rate spread using Pearson Correlation and Mutual Information are shown below in Figure 3.3. Figure 3.3: Feature Selection 12
  • 14. 4. Machine Learning Model 4.1 Choosing Machine Learning Algorithm The problem at hand is prediction of continuous values, training input and training labels are available. Therefore, it is a supervised learning of the regression type. Azure machine learning studio was used for training the model. Factors to consider when choosing a machine learning algorithm: • Problem Category: Specific algorithms are meant to solve specific problems. Every algorithm has its own inductive bias. It has already been established that our problem is a supervised regression. • Size of Data set: Here it boils down to the number of features. Certain data sets can have a large number of features compared to number of data points. This becomes a bias/variance issue. Low bias-high variance models tend to over fit on data sets with large number of features. High bias-low variance models may not over fit but will be less accurate. Our data set has ample data points compared to number of features. • Linearity: Some algorithms depend on linearity in the data set. For example, linear regression assumes data trends follow a straight line. In our case there are a large number of nominal variables, data set lacks linearity. • Accuracy Required: Sometimes only an approximation is needed and at other times higher accuracy is needed. Some algorithms are not very accurate but an advantage is that they do not over-fit. In our case the metric of measurement is R squared, therefore, accuracy is of most importance. • Training Time: Training time and computational power requirement is different for each al- gorithm. Models such as neural network regression needs long training time. Larger data sets need longer model training time, as well as higher accuracy is related to longer training time. Taking the above points into account 3 algorithms were chosen. R squared when applied to the data set were as follows: 1. Boosted Decision Tree Regression: 0.74 2. Decision Forest Regression: 0.68 3. Linear Regression: 0.41 Therefore, Boosted Decision Tree Regression was chosen for further tuning of the model to enhance accuracy. Azure’s Boosted tree for regression uses gradient boosting. 1. Gradient boosting starts by making a leaf first, which is the average of the label values. 2. Pseudo-residuals are then calculated for each data point by subtracting the average value of the label i.e. the first leaf from the actual label values. 13
  • 15. 3. Then a decision tree is created to predict the Pseudo-residuals instead of the label. The result of the decision tree prediction which is a residual value is multiplied by the learning rate and added to the first leaf. This results in a new label average. 4. Then the new label average is used to calculate a new set of Pseudo-residuals. Another tree is created to predict a new residual value. The new residual value is multiplied by the learning rate and added to the previous tree. This process is repeated to reduce the residuals and converge the prediction close to target values. Therefore, this is a sequential ensemble method. It tried to fit together a sequence of weak learners to make a strong model. Predictions are combined through a weighted sum (learning rate) to make a final prediction. 4.2 Tuning Hyper-parameters and Cross-Validation Hyper-parameters for the algorithm: • Maximum number of leaves per tree • Minimum number of samples per leaf node • Learning rate • Number of trees constructed Refer to Figure 4.1 below which shows work flow in Azure ML studio to train the model. Two models were trained, one with tuning model hyper-parameters and one without. The model with tuned hyper-parameters had better accuracy compared to the model with default parameters as shown in the metrics in Figure 4.2 in the next page. Figure 4.1: Azure ML Tuning Hyperparameters Workflow 14
  • 16. Figure 4.2: Tuning Hyperparameters Results After retrieving the best numbers for hyper-parameters from ‘Tune Model Hyper-parameters’ mod- ule, further model training was carried out. Cross validation was done to witness this model’s behaviour across 10 folds. Refer to Figure 4.3 below to see Azure ML work flow. Model seemed to generalise well as seen in the next page in Figure 4.4. R squared values across the folds were close with a small standard deviation on 0.0024. Mean R Squared of the 10 folds was 0.7945. Figure 4.3: Azure ML Cross Validation Workflow 15
  • 17. Figure 4.4: Cross Validation Results 16
  • 18. 4.3 Joint Plot of Predicted Rate Spread Final model training was done with 75% of the training data and tested with the remaining 25%. Refer to Figure 4.5 to see Azure ML work flow. The root mean squared error (RMSE) was 0.73. It is apparent from the joint plot in Figure 4.6 in the next page that the model predicted the rate spread as continuous values, whereas the training labels were discreet. Predicted values have a right skewed distribution compared to multiple peaks in the training label distribution. The fitted regression line shows a linear relationship between the predicted values and the training label. It can also be inferred from fading scatter points that the model does a better job at predicting high and low values of rate spread. Model is more error prone for the middle values (3,4, 5). Figure 4.5: Azure ML Final Model Worklflow 17
  • 19. Figure 4.6: Jointplot of Predicted Rate Spread 18
  • 20. 5. Conclusions & way forward This analysis concludes that predictions of interest rate spread can be confidently made from the information collected from loan applications. In particular, loan amount, loan type, property type, preapproval requirement and loan purpose have significant effect on determining interest rate spread. Trained model has proven to be sufficiently effective in predicting rate spread from given information on loan applications. Future work can involve acquiring more domain knowledge to continue feature engineering and data collection. The model predicted rate spread as continuous float values from a training label that was discreet. Further understanding of whether rate spread must be discreet or a continuous float is needed. Expert knowledge can be consulted to provide further understanding of domain. Trained model can be deployed as a web service and be used for predicting in a data pipeline. Overtime if more data is collected, new data need to be used to retrain and update the existing model. Retraining model periodically with new data and enhanced features will increase predictive capabilities. Resources utilised for this analysis were: Azure ML Studio, Jupyter Notebook, Anaconda, Python 3.0, Packages (Scikit-learn, SciPy, Numpy, Pandas, Matplotlib, Seaborn). 19
  • 21. 6. Appendix 6.1 Machine Learning with Python in Jupyter Notebook 1000 import pandas as pd from sklearn import preprocessing 1002 import sklearn . model_selection as ms from sklearn import linear_model 1004 import sklearn . metrics as sklm import numpy as np 1006 import numpy . random as nr import matplotlib . pyplot as p l t 1008 import seaborn as sns import scipy . s t a t s as ss 1010 import math import matplotlib . pyplot as p l t 1012 %matplotlib i n l i n e t r a i n = pd . read_csv ( ’ train_values . csv ’ ) 1014 l a b e l = pd . read_csv ( ’ t r a i n _ l a b e l s . csv ’ ) t e s t = pd . read_csv ( ’ test_values . csv ’ ) 1016 df = pd . merge ( train , label , how = ’ inner ’ , on = ’ row_id ’ ) df . shape 1018 df . columns df . select_dtypes ( include = [ ’ f l o a t 6 4 ’ ] ) . d e s c r i b e () . transpose () 1020 df . columns = [ ’ row_id ’ , ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’ loan_amount ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’ lender ’ , ’ co_applicant ’ ] column = { ’msa_md ’ : −1, 1022 ’ state_code ’ : −1, ’ county_code ’ : −1} 1024 df . r e p l a c e ( column , np . nan , i n p l a c e = True ) df . isna () . sum () 1026 mode_statecode = df [ ’ state_code ’ ] . mode () 1028 mode_income = df [ ’ income ’ ] . mode () mode_pop = df [ ’ population ’ ] . mode () 1030 mode_minpop = df [ ’min_pop ’ ] . mode () mode_mdfamincome = df [ ’ medianfam_income ’ ] . mode () 1032 mode_tractmsamd = df [ ’ tract_msamd ’ ] . median () mode_owneroccup= df [ ’ owner_occup ’ ] . mode () 1034 mode_oneto4fam= df [ ’ oneto4_fam ’ ] . mode () 1036 df [ ’ state_code ’ ] = df [ ’ state_code ’ ] . f i l l n a (( mode_statecode ) ) df [ ’ income ’ ] = df [ ’ income ’ ] . f i l l n a (( mode_income) ) 1038 df [ ’ population ’ ] = df [ ’ population ’ ] . f i l l n a (( mode_pop) ) df [ ’min_pop ’ ] = df [ ’min_pop ’ ] . f i l l n a (( mode_minpop) ) 1040 df [ ’ medianfam_income ’ ] = df [ ’ medianfam_income ’ ] . f i l l n a (( mode_pop) ) df [ ’ tract_msamd ’ ] = df [ ’ tract_msamd ’ ] . f i l l n a (( mode_pop) ) 1042 df [ ’ owner_occup ’ ] = df [ ’ owner_occup ’ ] . f i l l n a (( mode_pop) ) df [ ’ oneto4_fam ’ ] = df [ ’ oneto4_fam ’ ] . f i l l n a (( mode_pop) ) 1044 df . to_csv ( ’ pythontest . csv ’ , index = False ) 20
  • 22. 1000 import pandas as pd import numpy as np 1002 import numpy . random as nr import matplotlib . pyplot as p l t 1004 import seaborn as sns import scipy . s t a t s as ss 1006 import math from sklearn import preprocessing 1008 import sklearn . model_selection as ms from sklearn import linear_model 1010 from sklearn . model_selection import t r a i n _ t e s t _ s p l i t from sklearn . preprocessing import StandardScaler 1012 from sklearn . model_selection import cross_val_score import sklearn . metrics as sklm 1014 from sklearn . ensemble import RandomForestRegressor from sklearn . metrics import mean_squared_error 1016 pd . pandas . set_option ( ’ display . max_columns ’ , None) import warnings 1018 warnings . f i l t e r w a r n i n g s ( ’ ignore ’ ) 1020 df = f i n a l = pd . read_csv ( ’ pythontrain . csv ’ ) t e s t = pd . read_csv ( ’ pythontest . csv ’ ) 1022 df df . shape 1024 df . columns df . isna () . sum () 1026 t e s t . head () 1028 df . loan_type = df . loan_type . astype (np . object ) df . property_type = df . property_type . astype (np . object ) 1030 df . loan_purpose = df . loan_purpose . astype (np . object ) df . occupancy = df . occupancy . astype (np . object ) 1032 df . occupancy = df . occupancy . astype (np . object ) df . preapproval = df . preapproval . astype (np . object ) 1034 df .msa_md = df .msa_md. astype (np . object ) df . state_code = df . state_code . astype (np . object ) 1036 df . county_code = df . county_code . astype (np . object ) df . e t h n i c i t y = df . e t h n i c i t y . astype (np . object ) 1038 df . race = df . race . astype (np . object ) df . sex = df . sex . astype (np . object ) 1040 df . lender = df . lender . astype (np . object ) df . dtypes 1042 numeric = df [ [ ’ loan_amount ’ , ’ income ’ , ’ population ’ , ’min_pop ’ , ’ medianfam_income ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ ] ] 1044 c a t e g o r i c a l = df [ [ ’ loan_type ’ , ’ property_type ’ , ’ loan_purpose ’ , ’ occupancy ’ , ’ preapproval ’ , ’msa_md ’ , ’ state_code ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ race ’ , ’ sex ’ , ’ lender ’ ] ] log_num = np . log ( numeric ) 1046 1048 def plot_scatter ( df , cols , col_y = ’ ratespread ’ ) : sns . set_style ( " darkgrid " ) 1050 f o r c o l in c o l s : f i g = p l t . f i g u r e ( f i g s i z e =(7 ,6) ) 1052 ax = f i g . gca () df . plot . s c a t t e r (x = col , y = col_y , ax = ax ) 1054 ax . s e t _ t i t l e ( ’ Scatter plot of ’ + col_y + ’ vs . ’ + c o l ) ax . set_xlabel ( c o l ) 1056 ax . set_ylabel ( col_y ) p l t . show () 1058 plot_scatter ( df , log_num) 21
  • 23. 1000 X = df . drop ( ’ ratespread ’ , axis = ’ columns ’ ) y = df . ratespread 1002 X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X, y , t e s t _ s i z e = 0 . 1 ) X_train . shape , X_test . shape 1004 X_train . d e s c r i b e () . transpose () 1006 t r a i n i n g = [ x f o r x in X_train . columns i f x not in [ ’ row_id ’ , ’ county_code ’ , ’ e t h n i c i t y ’ , ’ sex ’ , ’ population ’ , ’ tract_msamd ’ , ’ owner_occup ’ , ’ oneto4_fam ’ , ’ co_applicant ’ , ’ ratespread ’ ] ] import numpy as np 1008 import pandas as pd import matplotlib as mpl 1010 import matplotlib . pyplot as p l t import seaborn as sns 1012 import warnings ; warnings . f i l t e r w a r n i n g s ( action=’ once ’ ) 1014 l a r g e = 22; med = 16; small = 12 params = { ’ axes . t i t l e s i z e ’ : large , 1016 ’ legend . f o n t s i z e ’ : med , ’ f i g u r e . f i g s i z e ’ : (16 , 10) , 1018 ’ axes . l a b e l s i z e ’ : med , ’ axes . t i t l e s i z e ’ : med , 1020 ’ xtick . l a b e l s i z e ’ : med , ’ ytick . l a b e l s i z e ’ : med , 1022 ’ f i g u r e . t i t l e s i z e ’ : l a r g e } p l t . rcParams . update ( params ) 1024 p l t . s t y l e . use ( ’ seaborn−whitegrid ’ ) sns . set_style ( " white " ) 1026 %matplotlib i n l i n e import random 1028 s c a l e r = StandardScaler () 1030 s c a l e r . f i t ( X_train [ t r a i n i n g ] ) 1032 f r e g= RandomForestRegressor ( n_estimators =2000,max_depth=32, c r i t e r i o n="mse" , random_state=1234) f r e g . f i t ( X_train [ t r a i n i n g ] , y_train ) 1034 p r e d i c t i o n= f r e g . p r e d i c t ( X_train [ t r a i n i n g ] ) print ( ’ random f o r e s t t r a i n mse : {} ’ . format ( mean_squared_error ( y_train , p r e d i c t i o n ) ) ) 1036 p r e d i c t i o n = f r e g . p r e d i c t ( X_test [ t r a i n i n g ] ) print ( ’ random f o r e s t t e s t mse : {} ’ . format ( mean_squared_error ( y_test , p r e d i c t i o n ) ) ) 1038 p r e d i c t i o n 1 = [ ] 1040 f o r model in [ f r e g ] : p r e d i c t i o n 1 . append (pd . S e r i e s ( model . p r e d i c t ( t e s t [ t r a i n i n g ] ) ) ) 1042 f i n a l = pd . concat ( prediction1 , axis =1) . mean( axis =1) temp1 = pd . concat ( [ t e s t . row_id , f i n a l ] , axis =1) 1044 temp1 . columns = [ ’ row_id ’ , ’ ratespread ’ ] temp1 . head () 1046 temp1 . to_csv ( ’ p r e d i c t i o n . csv ’ , index=False ) 1048 importance = pd . S e r i e s ( f r e g . feature_importances_ ) importance . index = t r a i n i n g 1050 importance . sort_values ( i n p l a c e=True , ascending=False ) importance . plot . bar ( f i g s i z e =(18 ,6) , c o l o r =[ ’ darkblue ’ , ’ red ’ , ’ gold ’ , ’ pink ’ , ’ purple ’ , ’ darkcyan ’ ] ) 22
  • 24. 6.2 Machine Learning Competition Dashboard Figure 6.1: Competition Score 23