SlideShare a Scribd company logo
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 1/12
Esteban Ribero, Assignment #4 - MSDS 422 | Winter 2019
Evaluating Random Forests and Gradient Boosting for
Regression
Purpose and summary of results
The purpose of this exercise is to expand the analysis of the regression methods from assignment 2 employing
Random Forests and Gradient Boosting Regression Trees and comparing them with the best linear regression
models used previously. A cross-validation design was used for comparisons. The data set used to perform the
study is again the Boston Housing Study that contains 506 census tract observations and 13 variables. The
target variable to predict is the median value of homes in Boston in 1970. Table 1 extracted from Miller (2015),
shown in the appendix, describes the variables contained in the data set.
Several versions of random forest and gradient boosting were fitted to the data and cross-validated by tweaking
the hypermeters. Random forest and gradient boosting performed significantly better than all the linear
regression models used previously. The default settings for these two methods already produced great results
reducing the Root Mean Squared Error across the test sets from more than 0.52 to less than 0.35, a 32%
improvement! Tweaking some hyperparameters led to an optimized gradient boosting model with an RSME of
0.30 and a ‘pseudo R=Squared’ of 0.90, well above the ~072 for the linear regression models. A comparison of
the feature importance is provided and some managerial implications for real state brokerage firm are as a result
of the study.
Loading the required packages
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
import seaborn as sns # pretty plotting, including heat map
import sklearn.linear_model # modeling routines from Scikit Learn
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt # for root mean-squared error calculation
from sklearn.preprocessing import StandardScaler #for scaling the data
from sklearn.model_selection import KFold #for cross-validation
#importing the regressors to be used
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
In [2]: #setup for displaying multiple outputs from a single Jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 2/12
The data
In [3]: # loading the data into a dataframe
boston_input = pd.read_csv('boston.csv')
# drop neighborhood from the data being considered
boston = boston_input.drop('neighborhood', 1)
#Setting up the data for fitting the models into numpy arrays
prelim_model_data = np.array([boston.mv,
boston.crim,
boston.zn,
boston.indus,
boston.chas,
boston.nox,
boston.rooms,
boston.age,
boston.dis,
boston.rad,
boston.tax,
boston.ptratio,
boston.lstat]).T
The data was already explored in the prior analysis and some concerns regarding the distribution of some
explanatory variables, as well as the presence of several outliers and extreme outliers, were raised. For the
purpose of this and the prior analysis this concerns were not addressed and only a simple standardization of the
data using the StandardScaler was performed. This was done for the linear regression models that are
susceptible to strong variations in the scales, and although this is not necessary for random forest and gradient
boosting, we used the same standardized data set for ease of comparisons. The standardization centers the
data around 0 and unified all units to the standard deviation. This was done even for the target variable.
In [4]: # Scaling the data using standardization
scaler = StandardScaler()
model_data = scaler.fit_transform(prelim_model_data)
The following boxplot shows the results of the standardization for reference
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 3/12
In [5]: #Boxplot for standardized variables
var_names = ['mv', 'crim','zn', 'indus', 'chas', 'nox', 
'rooms','age','dis','rad','tax','ptratio','lstat']
model_data_df = pd.DataFrame(model_data, columns = var_names)
fig, axis = plt.subplots(figsize=(12,10))
ax = plt.title('Boxplot for Standardized Features')
ax = sns.boxplot(data=model_data_df, orient="h")
As can be observed in the Boxplots all the variables have been centered and standardized. This process
maintains the shape of the distributions so the differences in the range of values, the distributions, and the
presence of outliers can be easily observed.
Regression models and cross-validation
As before, the following set of code sets up the models to be evaluated. In this study, I will be comparing the
best linear regression models from the prior exercise with a different set of random forest and gradient boosting.
The best performing linear regressions used for comparison are, the baseline Linear regression (no
regularization), the Ridge regression with alpha = 50, and the Lasso regression with alpha = 0.01.
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 4/12
In [6]: #Setup code for regression models being considered
RANDOM_SEED = 1 #to obtain reproducible results
SET_FIT_INTERCEPT = True #to include intercept in the regression
##Specifying the set of regression models being evaluated
names = ['a_Linear_Regression',
'b_Ridge_Regression_50',
'c_Lasso_Regression_0.01',
'd_Random_Forest',
'e_Random_Forest_100_log2',
'f_Random_Forest_100_4',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'i_Gradient_Boosting_3_500',
'j_Gradient_Boosting_2_500_6',
'k_Gradient_Boosting_3_100_0.3',
'l_Gradient_Boosting_3_50_0.3']
For the set of Random Forests I used the default parameters from scikit-learn first, then fine-tuned them
iteratively until I got to a satisfactory result. After a full exploration I ended up with four versions. Sampling with
replacement using boostrap was used for all four:
d_Random_Forest (the default):
With 100 trees, unconstrained depth for the trees, and the ability to use all features.
e_Random_Forest_100_log2:
With 100 trees as well and unconstrained depth, but constraining the amount of available features for each
tree to log2, which in this case equals to setting the max_features = 3. This creates random exploration
over the features leading to more diversity.
f_Random_Forest_100_4:
Same as before but increasing the range of feature exploration to 4
g_Random_Forest_10_500_4:
With 500 trees to average across more trees reducing chances of overfitting, limiting the maximum depth of
each tree to 10 with the same goal, and max_features to 4.
For the set of Gradient Boosting Regression Trees, I ended up with five versions:
h_Gradient_Boosting (the default):
With 100 trees, maximum depth of 3, and the learning rate = 0.1.
i_Gradient_Boosting_3_500:
Same as above but with 500 trees for a more complex model
j_Gradient_Boosting_2_500_6:
Same as before but reducing the maximum depth to 2 in an attempt to reduce overfitting by pruning earlier,
as well as limiting the feature exploration to 6.
k_Gradient_Boosting_3_100_0.3:
Similar to the default model but increasing the learning rate to 0.3.
k_Gradient_Boosting_3_50_0.3:
Same as above but reducing the number of trees to 50 to reduce model complexity and balance the
increased learning rate.
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 5/12
In [7]: #code to set the paramaters of the regressors
regressors = [
LinearRegression(fit_intercept = SET_FIT_INTERCEPT),
Ridge(alpha = 50, solver = 'cholesky',
fit_intercept = SET_FIT_INTERCEPT,
normalize = False, # data was standardized before
random_state = RANDOM_SEED),
Lasso(alpha = 0.01, max_iter=10000, tol=0.01,
fit_intercept = SET_FIT_INTERCEPT,
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features='auto',
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features= 'log2',
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features= 4,
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 500, bootstrap=True,
max_depth = 10, max_features= 4,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 100,
max_depth = 3, max_features=None,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 500,
max_depth = 3, max_features=None,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 500,
max_depth = 2, max_features=6,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 100,
max_depth = 3, max_features=None,
learning_rate = 0.3, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 50,
max_depth = 3, max_features=None,
learning_rate = 0.3, subsample = 1,
random_state = RANDOM_SEED),
]
The following code sets numpy arrays for storing the results as Python iterates over the for loops during the
cross validation. Although the main performance indicator for this study is the Root Mean Squared Error (RSME)
on the test sets, we will collect the RSME for the train sets to more easily identify overfitting as well as a
measure of the variance explained by the model, R-squared for the linear regression models, and pseudo R-
squared for the random forest and gradient boosting.
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 6/12
In [9]: #Setting up numpy arrays for storing results
N_FOLDS = 10 #number of fold for cross-validation
rmse_test = np.zeros((len(names), N_FOLDS))
rmse_train = np.zeros((len(names), N_FOLDS))
r2_test = np.zeros((len(names), N_FOLDS))
r2_train = np.zeros((len(names), N_FOLDS))
As before, I used a cross-validation design with 10 folds. This means that we will cut the data into a training-set
and a test-set ten times. We will train the models in each training test and validate their prediction accuracy in
each of the ten test sets.
In [10]: # specifying the k-fold cross-validation design
kf = KFold(n_splits = N_FOLDS, shuffle=True, random_state = RANDOM_SEED)
index_for_fold = 0 # fold count initialized
for train_index, test_index in kf.split(model_data):
# the structure of modeling data for this study has the
# response variable coming first and explanatory variables later
# so 1:model_data.shape[1] slices for explanatory variables
# and 0 is the index for the response variable
X_train = model_data[train_index, 1:model_data.shape[1]]
X_test = model_data[test_index, 1:model_data.shape[1]]
y_train = model_data[train_index, 0]
y_test = model_data[test_index, 0]
index_for_method = 0 # initialize
for name, reg_model in zip(names, regressors):
## fit on the train set for this fold
rmodel = reg_model.fit(X_train, y_train)
## evaluate on the modelfor this fold
y_test_predict = reg_model.predict(X_test)
y_train_predict = reg_model.predict(X_train)
#R-squared
r2_test[index_for_method, index_for_fold] = 
r2_score(y_test, y_test_predict)
r2_train[index_for_method, index_for_fold] = 
r2_score(y_train, y_train_predict)
#Root-mean squared error
fold_method_rmse_test = 
sqrt(mean_squared_error(y_test, y_test_predict))
fold_method_rmse_train = 
sqrt(mean_squared_error(y_train, y_train_predict))
rmse_test[index_for_method, index_for_fold] = 
fold_method_rmse_test
rmse_train[index_for_method, index_for_fold] = 
fold_method_rmse_train
index_for_method += 1
index_for_fold += 1
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 7/12
The following code creates a Pandas DataFrame with the results of each fold the averages the results across all
folds
In [11]: ##creating multilevel index for dataframes
model_name = names #to avoid confusion in next line
multi_index = pd.MultiIndex.from_product(
[model_name, np.arange(N_FOLDS)],
names=['model','fold'])
##the dataframe
fit_results_df = 
pd.DataFrame(np.hstack((rmse_train.reshape(N_FOLDS*len(names),1),
rmse_test.reshape(N_FOLDS*len(names),1),
r2_train.reshape(N_FOLDS*len(names),1),
r2_test.reshape(N_FOLDS*len(names),1))),
index=multi_index,
columns=['Train_RMSE','Test_RMSE',
'Train_r2','Test_r2'])
##averaging results across folds
av_fit = fit_results_df.groupby('model').mean()
Results
The following table shows the average results of the cross-validation across the 10 folds. The average RMSE for
the training sets and the test sets is presented as well as the coefficient of determination, r-Square.
In [12]: print('----- Results of cross-validation across 10 folds -----nn',
round(av_fit, ndigits=3))
----- Results of cross-validation across 10 folds -----
Train_RMSE Test_RMSE Train_r2 Test_r
2
model
a_Linear_Regression 0.510 0.516 0.739 0.719
b_Ridge_Regression_50 0.521 0.520 0.729 0.716
c_Lasso_Regression_0.01 0.516 0.519 0.734 0.715
d_Random_Forest 0.129 0.350 0.983 0.863
e_Random_Forest_100_log2 0.126 0.339 0.984 0.878
f_Random_Forest_100_4 0.125 0.328 0.984 0.886
g_Random_Forest_10_500_4 0.140 0.326 0.981 0.888
h_Gradient_Boosting 0.153 0.310 0.977 0.896
i_Gradient_Boosting_3_500 0.039 0.308 0.998 0.897
j_Gradient_Boosting_2_500_6 0.121 0.342 0.985 0.872
k_Gradient_Boosting_3_100_0.3 0.071 0.304 0.995 0.899
l_Gradient_Boosting_3_50_0.3 0.123 0.301 0.985 0.901
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 8/12
As can be seen above, the default random forest performs pretty well relative to the linear models. Its average
test RSME across the 10 folds is 0.350, well below the 0.516 the best performer of the linear regression models.
There is a clear sign of overfitting given that its average performance on the train data is so much better, with an
RMSE of 0.129 and an R-square of 0.983 (vs 0.863 on the test set). Constraining the space for feature
exploration to 3 (log2 of 12) does improve the performance slightly reducing the RSME to 0.339. After several
iterations, the best setting for max_features was 4. It reduces the RSME even further to 0.328. There is still
overfitting, so the random forest with an increased number of trees to 500 and a limit to the depth of the trees to
10 does reduce the RSME slightly to 0.326.
It is possible that there is an even better combination of parameters that would improve the performance of the
random forests given that there are still signs of overfitting. However, moving to gradient boosting improved
performance much faster: The default setting for gradient boosting provided a model with an test RSME of
0.310 and a r-Squared of 0.896. increasing the number of trees in the model to 500 while keeping the default
max_depth to 3 and the learning_rate to 0.1, increased the performance on the training set significantly (RSME
= 0.039, R-Squared = 0.998!) and the performance on the test set slightly (RSME = 0.308, R-Squared 0.897. So,
there was an improvement overall but clearly overfitting the data given the increased complexity of the model.
Note that for gradient boosting, unlike random forest, more trees increase complexity following the data more
closely, while for random forests more trees reduce the chances of overfitting.
In attempt to decrease overfitting for the gradient boosting with 500 trees, I limited the depth of the trees to 2
and introduced randomness on the feature space by restricting the exploration of features to 6, in this example.
This did not improve performance and actually made the model perform worse than the default gradient
boosting. Then another exploration pertained to increasing the learning rate while reducing the number of trees
used. Since these two parameters are highly related, the more trees the more complex the model and the higher
the learning rate the more strongly the corrections at each iteration, I started with the default number of trees
(100) and increase the learning rate to 0.3. This had a more significant impact in performance reducing the
RSME on the test data to 0.304. The best so far. The model is likely overfitting sine the RSME on the train set is
0.071 and the train R-Square is 0.995! So there is still a chance to improve it slightly. Indeed, reducing the
number of trees to only 50 while the learning rate is still at 0.3 and max_depth = 3 provides the best
performance: Test RMSE 0.301, test R-squared .901! Not bad.
Feature importance
Now that we got a good performing model let explore the contribution and importance of each of the features.
To do that we will train the best models using the full data set.
The following code collects the full data for the features and the target variable
In [13]: #X and Y train values for full dataset
X_f = model_data[:, 1:model_data.shape[1]]
y_f = model_data[:, 0]
The following code fit the linear regression models against the full data set also estimates and stores the
magnitude of the regression coefficients. The intercept is not collected since we only care about the importance
of the features for comparability with the trees models.
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 9/12
In [14]: #setting the numpy array to collect the
#coeficients for the three linear models
regression_coef = np.zeros((len(names[:3]), #[:3] slices for just the
#linear regression models
model_data.shape[1]-1))
#same code as before but using .coef_ to collect coefficients
index_for_method = 0 # initialize
for name, reg_model in zip(names[:3], regressors[:3]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#regresion coeficients (features)
regression_coef[index_for_method]=reg_model.coef_
index_for_method += 1
The following code does the same but for the tree models and uses .featureimportance instead of .coef_ to get
the importance of each feature.
In [15]: #setting the numpy array to collect the
#feature importance for the tree models
feature_importance = np.zeros((len(names[3:]), #[3:] slices for the
#trees models
model_data.shape[1]-1))
index_for_method = 0 # initialize
for name, reg_model in zip(names[3:], regressors[3:]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#feature importance
feature_importance[index_for_method] =
rmodel.feature_importances_
index_for_method += 1
The following two box of code creates a pandas dataframe with the coefficients for the linear regression models
and the feature importance for the trees models. Note that for ease of visualization with seaborn plots the shape
of the dataframe is changed to a tidy format.
In [16]: #creating a dataframe for storing the feature importance for each model
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
feature_importance_pd = pd.DataFrame(feature_importance)
feature_importance_pd.columns = column_names
feature_importance_pd.index = names[3:]
#creating a subset of the dataframe for the best
#perfoming ensemble of trees models for visualization
fi_df = feature_importance_pd.loc[['d_Random_Forest',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'l_Gradient_Boosting_3_50_0.3'],:]
#reshaping the layout for ease f visualization
fi_df = fi_df.stack().reset_index() #making it tidy
fi_df.columns = ['model','feature','importance']
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 10/12
To properly compare the importance of each feature across linear and trees models I converted the regression
coefficients of the linear models to their absolute values
In [17]: #creating a dataframe for storing the regression coefficients
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
regression_coef_pd = pd.DataFrame(regression_coef)
regression_coef_pd.columns = column_names
regression_coef_pd.index = names[:3]
#reshaping the layout for ease of visualization
cf_df = regression_coef_pd
cf_df = cf_df.stack().reset_index()#making it tidy
cf_df.columns = ['model','feature','abs_magnitude']
#converting the magnitude of the coefficients to absolute values
cf_df['abs_magnitude'] = np.absolute(cf_df['abs_magnitude'])
The following plot compares the absolute magnate of the linear regression coefficients and the importance
given the features for the best and default random forest and gradient boosting models
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 11/12
In [18]: sns.set()
fig, axis = plt.subplots(1,2, figsize=(12,8))
ax = plt.suptitle("__________ Feature Importance __________")
ax = plt.subplot(1,2,1)
ax = plt.title('feature coefficients (abs value)')
ax = sns.stripplot(data=cf_df, size=10, x='abs_magnitude',
y='feature', hue='model', palette ="Blues")
ax = plt.subplot(1,2,2)
ax = plt.title('feature importance')
ax = sns.stripplot(data=fi_df, size=10, x='importance',
y='feature', hue='model', palette ="Reds")
As can been seen, there is a different pattern between the linear regression models and the trees models.
Although the two measures of importance are not the same and so direct comparison has to be done carefully,
relative to the importance of the number of rooms (rooms) and the percent of the population of lower socio-
economic status (lstat), the tree-based models give much less importance to the other features. In fact, the
importance zn, indus, chas and rad, is 0 or close except for one of the optimized random forests that gives
some small importance to indust. This model also has the more balanced importance relative to the other tree-
based models
The best performing gradient boosting model relies mostly on the number of rooms and the percent of the
population of low socioeconomic status as well as a little on the distance (dis) to employment centers. Some
smaller importance is given to the pupil/teacher ratio in public schools (ptratio) and air pollution (nox).
2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 12/12
Conclusion
Ensemble models based on trees are among the most widely used models in machine learning given their
strong performance and ease of training. This study adds evidence that these models are indeed great tools for
supervised learning. With the optimized gradient boosting model, the real estate brokerage firm can confidently
use this machine learning technique to estimate the values of house in Boston at the time. Particular attention is
to be given to the number of rooms and the percentage of the population of lower socio-economic status. This
might be the most obvious features however and so additional attention is to be given to the distance from
employment centers and the level of air pollution. The crime rate and the tax rate are less important but still
contribute to the precision of the model. Using the model’s predictions would likely be a more accurate
estimation of the value of residential real estate than even the assessment of an expert so it is strongly advised
that the firm uses the model as the primary method and complement it with more traditional approaches if
desired
References:
Thomas W. Miller. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python.
Pearson Education, Old Tappan, N.J., 2015. Data sets and programs available at http://www.ftpress.com/miller/
(http://www.ftpress.com/miller/) and https://github.com/mtpa/ (https://github.com/mtpa/).
Appendix

More Related Content

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Random Forrest and Gradient Boosting

  • 1. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 1/12 Esteban Ribero, Assignment #4 - MSDS 422 | Winter 2019 Evaluating Random Forests and Gradient Boosting for Regression Purpose and summary of results The purpose of this exercise is to expand the analysis of the regression methods from assignment 2 employing Random Forests and Gradient Boosting Regression Trees and comparing them with the best linear regression models used previously. A cross-validation design was used for comparisons. The data set used to perform the study is again the Boston Housing Study that contains 506 census tract observations and 13 variables. The target variable to predict is the median value of homes in Boston in 1970. Table 1 extracted from Miller (2015), shown in the appendix, describes the variables contained in the data set. Several versions of random forest and gradient boosting were fitted to the data and cross-validated by tweaking the hypermeters. Random forest and gradient boosting performed significantly better than all the linear regression models used previously. The default settings for these two methods already produced great results reducing the Root Mean Squared Error across the test sets from more than 0.52 to less than 0.35, a 32% improvement! Tweaking some hyperparameters led to an optimized gradient boosting model with an RSME of 0.30 and a ‘pseudo R=Squared’ of 0.90, well above the ~072 for the linear regression models. A comparison of the feature importance is provided and some managerial implications for real state brokerage firm are as a result of the study. Loading the required packages In [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt # static plotting import seaborn as sns # pretty plotting, including heat map import sklearn.linear_model # modeling routines from Scikit Learn from sklearn.metrics import mean_squared_error, r2_score from math import sqrt # for root mean-squared error calculation from sklearn.preprocessing import StandardScaler #for scaling the data from sklearn.model_selection import KFold #for cross-validation #importing the regressors to be used from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor In [2]: #setup for displaying multiple outputs from a single Jupyter cell from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
  • 2. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 2/12 The data In [3]: # loading the data into a dataframe boston_input = pd.read_csv('boston.csv') # drop neighborhood from the data being considered boston = boston_input.drop('neighborhood', 1) #Setting up the data for fitting the models into numpy arrays prelim_model_data = np.array([boston.mv, boston.crim, boston.zn, boston.indus, boston.chas, boston.nox, boston.rooms, boston.age, boston.dis, boston.rad, boston.tax, boston.ptratio, boston.lstat]).T The data was already explored in the prior analysis and some concerns regarding the distribution of some explanatory variables, as well as the presence of several outliers and extreme outliers, were raised. For the purpose of this and the prior analysis this concerns were not addressed and only a simple standardization of the data using the StandardScaler was performed. This was done for the linear regression models that are susceptible to strong variations in the scales, and although this is not necessary for random forest and gradient boosting, we used the same standardized data set for ease of comparisons. The standardization centers the data around 0 and unified all units to the standard deviation. This was done even for the target variable. In [4]: # Scaling the data using standardization scaler = StandardScaler() model_data = scaler.fit_transform(prelim_model_data) The following boxplot shows the results of the standardization for reference
  • 3. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 3/12 In [5]: #Boxplot for standardized variables var_names = ['mv', 'crim','zn', 'indus', 'chas', 'nox', 'rooms','age','dis','rad','tax','ptratio','lstat'] model_data_df = pd.DataFrame(model_data, columns = var_names) fig, axis = plt.subplots(figsize=(12,10)) ax = plt.title('Boxplot for Standardized Features') ax = sns.boxplot(data=model_data_df, orient="h") As can be observed in the Boxplots all the variables have been centered and standardized. This process maintains the shape of the distributions so the differences in the range of values, the distributions, and the presence of outliers can be easily observed. Regression models and cross-validation As before, the following set of code sets up the models to be evaluated. In this study, I will be comparing the best linear regression models from the prior exercise with a different set of random forest and gradient boosting. The best performing linear regressions used for comparison are, the baseline Linear regression (no regularization), the Ridge regression with alpha = 50, and the Lasso regression with alpha = 0.01.
  • 4. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 4/12 In [6]: #Setup code for regression models being considered RANDOM_SEED = 1 #to obtain reproducible results SET_FIT_INTERCEPT = True #to include intercept in the regression ##Specifying the set of regression models being evaluated names = ['a_Linear_Regression', 'b_Ridge_Regression_50', 'c_Lasso_Regression_0.01', 'd_Random_Forest', 'e_Random_Forest_100_log2', 'f_Random_Forest_100_4', 'g_Random_Forest_10_500_4', 'h_Gradient_Boosting', 'i_Gradient_Boosting_3_500', 'j_Gradient_Boosting_2_500_6', 'k_Gradient_Boosting_3_100_0.3', 'l_Gradient_Boosting_3_50_0.3'] For the set of Random Forests I used the default parameters from scikit-learn first, then fine-tuned them iteratively until I got to a satisfactory result. After a full exploration I ended up with four versions. Sampling with replacement using boostrap was used for all four: d_Random_Forest (the default): With 100 trees, unconstrained depth for the trees, and the ability to use all features. e_Random_Forest_100_log2: With 100 trees as well and unconstrained depth, but constraining the amount of available features for each tree to log2, which in this case equals to setting the max_features = 3. This creates random exploration over the features leading to more diversity. f_Random_Forest_100_4: Same as before but increasing the range of feature exploration to 4 g_Random_Forest_10_500_4: With 500 trees to average across more trees reducing chances of overfitting, limiting the maximum depth of each tree to 10 with the same goal, and max_features to 4. For the set of Gradient Boosting Regression Trees, I ended up with five versions: h_Gradient_Boosting (the default): With 100 trees, maximum depth of 3, and the learning rate = 0.1. i_Gradient_Boosting_3_500: Same as above but with 500 trees for a more complex model j_Gradient_Boosting_2_500_6: Same as before but reducing the maximum depth to 2 in an attempt to reduce overfitting by pruning earlier, as well as limiting the feature exploration to 6. k_Gradient_Boosting_3_100_0.3: Similar to the default model but increasing the learning rate to 0.3. k_Gradient_Boosting_3_50_0.3: Same as above but reducing the number of trees to 50 to reduce model complexity and balance the increased learning rate.
  • 5. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 5/12 In [7]: #code to set the paramaters of the regressors regressors = [ LinearRegression(fit_intercept = SET_FIT_INTERCEPT), Ridge(alpha = 50, solver = 'cholesky', fit_intercept = SET_FIT_INTERCEPT, normalize = False, # data was standardized before random_state = RANDOM_SEED), Lasso(alpha = 0.01, max_iter=10000, tol=0.01, fit_intercept = SET_FIT_INTERCEPT, random_state = RANDOM_SEED), RandomForestRegressor(n_estimators = 100, bootstrap=True, max_depth = None, max_features='auto', random_state = RANDOM_SEED), RandomForestRegressor(n_estimators = 100, bootstrap=True, max_depth = None, max_features= 'log2', random_state = RANDOM_SEED), RandomForestRegressor(n_estimators = 100, bootstrap=True, max_depth = None, max_features= 4, random_state = RANDOM_SEED), RandomForestRegressor(n_estimators = 500, bootstrap=True, max_depth = 10, max_features= 4, random_state = RANDOM_SEED), GradientBoostingRegressor(n_estimators = 100, max_depth = 3, max_features=None, learning_rate = 0.1, subsample = 1, random_state = RANDOM_SEED), GradientBoostingRegressor(n_estimators = 500, max_depth = 3, max_features=None, learning_rate = 0.1, subsample = 1, random_state = RANDOM_SEED), GradientBoostingRegressor(n_estimators = 500, max_depth = 2, max_features=6, learning_rate = 0.1, subsample = 1, random_state = RANDOM_SEED), GradientBoostingRegressor(n_estimators = 100, max_depth = 3, max_features=None, learning_rate = 0.3, subsample = 1, random_state = RANDOM_SEED), GradientBoostingRegressor(n_estimators = 50, max_depth = 3, max_features=None, learning_rate = 0.3, subsample = 1, random_state = RANDOM_SEED), ] The following code sets numpy arrays for storing the results as Python iterates over the for loops during the cross validation. Although the main performance indicator for this study is the Root Mean Squared Error (RSME) on the test sets, we will collect the RSME for the train sets to more easily identify overfitting as well as a measure of the variance explained by the model, R-squared for the linear regression models, and pseudo R- squared for the random forest and gradient boosting.
  • 6. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 6/12 In [9]: #Setting up numpy arrays for storing results N_FOLDS = 10 #number of fold for cross-validation rmse_test = np.zeros((len(names), N_FOLDS)) rmse_train = np.zeros((len(names), N_FOLDS)) r2_test = np.zeros((len(names), N_FOLDS)) r2_train = np.zeros((len(names), N_FOLDS)) As before, I used a cross-validation design with 10 folds. This means that we will cut the data into a training-set and a test-set ten times. We will train the models in each training test and validate their prediction accuracy in each of the ten test sets. In [10]: # specifying the k-fold cross-validation design kf = KFold(n_splits = N_FOLDS, shuffle=True, random_state = RANDOM_SEED) index_for_fold = 0 # fold count initialized for train_index, test_index in kf.split(model_data): # the structure of modeling data for this study has the # response variable coming first and explanatory variables later # so 1:model_data.shape[1] slices for explanatory variables # and 0 is the index for the response variable X_train = model_data[train_index, 1:model_data.shape[1]] X_test = model_data[test_index, 1:model_data.shape[1]] y_train = model_data[train_index, 0] y_test = model_data[test_index, 0] index_for_method = 0 # initialize for name, reg_model in zip(names, regressors): ## fit on the train set for this fold rmodel = reg_model.fit(X_train, y_train) ## evaluate on the modelfor this fold y_test_predict = reg_model.predict(X_test) y_train_predict = reg_model.predict(X_train) #R-squared r2_test[index_for_method, index_for_fold] = r2_score(y_test, y_test_predict) r2_train[index_for_method, index_for_fold] = r2_score(y_train, y_train_predict) #Root-mean squared error fold_method_rmse_test = sqrt(mean_squared_error(y_test, y_test_predict)) fold_method_rmse_train = sqrt(mean_squared_error(y_train, y_train_predict)) rmse_test[index_for_method, index_for_fold] = fold_method_rmse_test rmse_train[index_for_method, index_for_fold] = fold_method_rmse_train index_for_method += 1 index_for_fold += 1
  • 7. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 7/12 The following code creates a Pandas DataFrame with the results of each fold the averages the results across all folds In [11]: ##creating multilevel index for dataframes model_name = names #to avoid confusion in next line multi_index = pd.MultiIndex.from_product( [model_name, np.arange(N_FOLDS)], names=['model','fold']) ##the dataframe fit_results_df = pd.DataFrame(np.hstack((rmse_train.reshape(N_FOLDS*len(names),1), rmse_test.reshape(N_FOLDS*len(names),1), r2_train.reshape(N_FOLDS*len(names),1), r2_test.reshape(N_FOLDS*len(names),1))), index=multi_index, columns=['Train_RMSE','Test_RMSE', 'Train_r2','Test_r2']) ##averaging results across folds av_fit = fit_results_df.groupby('model').mean() Results The following table shows the average results of the cross-validation across the 10 folds. The average RMSE for the training sets and the test sets is presented as well as the coefficient of determination, r-Square. In [12]: print('----- Results of cross-validation across 10 folds -----nn', round(av_fit, ndigits=3)) ----- Results of cross-validation across 10 folds ----- Train_RMSE Test_RMSE Train_r2 Test_r 2 model a_Linear_Regression 0.510 0.516 0.739 0.719 b_Ridge_Regression_50 0.521 0.520 0.729 0.716 c_Lasso_Regression_0.01 0.516 0.519 0.734 0.715 d_Random_Forest 0.129 0.350 0.983 0.863 e_Random_Forest_100_log2 0.126 0.339 0.984 0.878 f_Random_Forest_100_4 0.125 0.328 0.984 0.886 g_Random_Forest_10_500_4 0.140 0.326 0.981 0.888 h_Gradient_Boosting 0.153 0.310 0.977 0.896 i_Gradient_Boosting_3_500 0.039 0.308 0.998 0.897 j_Gradient_Boosting_2_500_6 0.121 0.342 0.985 0.872 k_Gradient_Boosting_3_100_0.3 0.071 0.304 0.995 0.899 l_Gradient_Boosting_3_50_0.3 0.123 0.301 0.985 0.901
  • 8. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 8/12 As can be seen above, the default random forest performs pretty well relative to the linear models. Its average test RSME across the 10 folds is 0.350, well below the 0.516 the best performer of the linear regression models. There is a clear sign of overfitting given that its average performance on the train data is so much better, with an RMSE of 0.129 and an R-square of 0.983 (vs 0.863 on the test set). Constraining the space for feature exploration to 3 (log2 of 12) does improve the performance slightly reducing the RSME to 0.339. After several iterations, the best setting for max_features was 4. It reduces the RSME even further to 0.328. There is still overfitting, so the random forest with an increased number of trees to 500 and a limit to the depth of the trees to 10 does reduce the RSME slightly to 0.326. It is possible that there is an even better combination of parameters that would improve the performance of the random forests given that there are still signs of overfitting. However, moving to gradient boosting improved performance much faster: The default setting for gradient boosting provided a model with an test RSME of 0.310 and a r-Squared of 0.896. increasing the number of trees in the model to 500 while keeping the default max_depth to 3 and the learning_rate to 0.1, increased the performance on the training set significantly (RSME = 0.039, R-Squared = 0.998!) and the performance on the test set slightly (RSME = 0.308, R-Squared 0.897. So, there was an improvement overall but clearly overfitting the data given the increased complexity of the model. Note that for gradient boosting, unlike random forest, more trees increase complexity following the data more closely, while for random forests more trees reduce the chances of overfitting. In attempt to decrease overfitting for the gradient boosting with 500 trees, I limited the depth of the trees to 2 and introduced randomness on the feature space by restricting the exploration of features to 6, in this example. This did not improve performance and actually made the model perform worse than the default gradient boosting. Then another exploration pertained to increasing the learning rate while reducing the number of trees used. Since these two parameters are highly related, the more trees the more complex the model and the higher the learning rate the more strongly the corrections at each iteration, I started with the default number of trees (100) and increase the learning rate to 0.3. This had a more significant impact in performance reducing the RSME on the test data to 0.304. The best so far. The model is likely overfitting sine the RSME on the train set is 0.071 and the train R-Square is 0.995! So there is still a chance to improve it slightly. Indeed, reducing the number of trees to only 50 while the learning rate is still at 0.3 and max_depth = 3 provides the best performance: Test RMSE 0.301, test R-squared .901! Not bad. Feature importance Now that we got a good performing model let explore the contribution and importance of each of the features. To do that we will train the best models using the full data set. The following code collects the full data for the features and the target variable In [13]: #X and Y train values for full dataset X_f = model_data[:, 1:model_data.shape[1]] y_f = model_data[:, 0] The following code fit the linear regression models against the full data set also estimates and stores the magnitude of the regression coefficients. The intercept is not collected since we only care about the importance of the features for comparability with the trees models.
  • 9. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 9/12 In [14]: #setting the numpy array to collect the #coeficients for the three linear models regression_coef = np.zeros((len(names[:3]), #[:3] slices for just the #linear regression models model_data.shape[1]-1)) #same code as before but using .coef_ to collect coefficients index_for_method = 0 # initialize for name, reg_model in zip(names[:3], regressors[:3]): # fit on the method rmodel = reg_model.fit(X_f, y_f) #regresion coeficients (features) regression_coef[index_for_method]=reg_model.coef_ index_for_method += 1 The following code does the same but for the tree models and uses .featureimportance instead of .coef_ to get the importance of each feature. In [15]: #setting the numpy array to collect the #feature importance for the tree models feature_importance = np.zeros((len(names[3:]), #[3:] slices for the #trees models model_data.shape[1]-1)) index_for_method = 0 # initialize for name, reg_model in zip(names[3:], regressors[3:]): # fit on the method rmodel = reg_model.fit(X_f, y_f) #feature importance feature_importance[index_for_method] = rmodel.feature_importances_ index_for_method += 1 The following two box of code creates a pandas dataframe with the coefficients for the linear regression models and the feature importance for the trees models. Note that for ease of visualization with seaborn plots the shape of the dataframe is changed to a tidy format. In [16]: #creating a dataframe for storing the feature importance for each model column_names = var_names[1:].copy() #using the list of variables names ##the dataframe feature_importance_pd = pd.DataFrame(feature_importance) feature_importance_pd.columns = column_names feature_importance_pd.index = names[3:] #creating a subset of the dataframe for the best #perfoming ensemble of trees models for visualization fi_df = feature_importance_pd.loc[['d_Random_Forest', 'g_Random_Forest_10_500_4', 'h_Gradient_Boosting', 'l_Gradient_Boosting_3_50_0.3'],:] #reshaping the layout for ease f visualization fi_df = fi_df.stack().reset_index() #making it tidy fi_df.columns = ['model','feature','importance']
  • 10. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 10/12 To properly compare the importance of each feature across linear and trees models I converted the regression coefficients of the linear models to their absolute values In [17]: #creating a dataframe for storing the regression coefficients column_names = var_names[1:].copy() #using the list of variables names ##the dataframe regression_coef_pd = pd.DataFrame(regression_coef) regression_coef_pd.columns = column_names regression_coef_pd.index = names[:3] #reshaping the layout for ease of visualization cf_df = regression_coef_pd cf_df = cf_df.stack().reset_index()#making it tidy cf_df.columns = ['model','feature','abs_magnitude'] #converting the magnitude of the coefficients to absolute values cf_df['abs_magnitude'] = np.absolute(cf_df['abs_magnitude']) The following plot compares the absolute magnate of the linear regression coefficients and the importance given the features for the best and default random forest and gradient boosting models
  • 11. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 11/12 In [18]: sns.set() fig, axis = plt.subplots(1,2, figsize=(12,8)) ax = plt.suptitle("__________ Feature Importance __________") ax = plt.subplot(1,2,1) ax = plt.title('feature coefficients (abs value)') ax = sns.stripplot(data=cf_df, size=10, x='abs_magnitude', y='feature', hue='model', palette ="Blues") ax = plt.subplot(1,2,2) ax = plt.title('feature importance') ax = sns.stripplot(data=fi_df, size=10, x='importance', y='feature', hue='model', palette ="Reds") As can been seen, there is a different pattern between the linear regression models and the trees models. Although the two measures of importance are not the same and so direct comparison has to be done carefully, relative to the importance of the number of rooms (rooms) and the percent of the population of lower socio- economic status (lstat), the tree-based models give much less importance to the other features. In fact, the importance zn, indus, chas and rad, is 0 or close except for one of the optimized random forests that gives some small importance to indust. This model also has the more balanced importance relative to the other tree- based models The best performing gradient boosting model relies mostly on the number of rooms and the percent of the population of low socioeconomic status as well as a little on the distance (dis) to employment centers. Some smaller importance is given to the pupil/teacher ratio in public schools (ptratio) and air pollution (nox).
  • 12. 2/3/2019 Ribero_Esteban_Assignment 4 http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 12/12 Conclusion Ensemble models based on trees are among the most widely used models in machine learning given their strong performance and ease of training. This study adds evidence that these models are indeed great tools for supervised learning. With the optimized gradient boosting model, the real estate brokerage firm can confidently use this machine learning technique to estimate the values of house in Boston at the time. Particular attention is to be given to the number of rooms and the percentage of the population of lower socio-economic status. This might be the most obvious features however and so additional attention is to be given to the distance from employment centers and the level of air pollution. The crime rate and the tax rate are less important but still contribute to the precision of the model. Using the model’s predictions would likely be a more accurate estimation of the value of residential real estate than even the assessment of an expert so it is strongly advised that the firm uses the model as the primary method and complement it with more traditional approaches if desired References: Thomas W. Miller. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python. Pearson Education, Old Tappan, N.J., 2015. Data sets and programs available at http://www.ftpress.com/miller/ (http://www.ftpress.com/miller/) and https://github.com/mtpa/ (https://github.com/mtpa/). Appendix