Random Forrest and Gradient Boosting

2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 1/12
Esteban Ribero, Assignment #4 - MSDS 422 | Winter 2019
Evaluating Random Forests and Gradient Boosting for
Regression
Purpose and summary of results
The purpose of this exercise is to expand the analysis of the regression methods from assignment 2 employing
Random Forests and Gradient Boosting Regression Trees and comparing them with the best linear regression
models used previously. A cross-validation design was used for comparisons. The data set used to perform the
study is again the Boston Housing Study that contains 506 census tract observations and 13 variables. The
target variable to predict is the median value of homes in Boston in 1970. Table 1 extracted from Miller (2015),
shown in the appendix, describes the variables contained in the data set.
Several versions of random forest and gradient boosting were fitted to the data and cross-validated by tweaking
the hypermeters. Random forest and gradient boosting performed significantly better than all the linear
regression models used previously. The default settings for these two methods already produced great results
reducing the Root Mean Squared Error across the test sets from more than 0.52 to less than 0.35, a 32%
improvement! Tweaking some hyperparameters led to an optimized gradient boosting model with an RSME of
0.30 and a ‘pseudo R=Squared’ of 0.90, well above the ~072 for the linear regression models. A comparison of
the feature importance is provided and some managerial implications for real state brokerage firm are as a result
of the study.
Loading the required packages
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
import seaborn as sns # pretty plotting, including heat map
import sklearn.linear_model # modeling routines from Scikit Learn
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt # for root mean-squared error calculation
from sklearn.preprocessing import StandardScaler #for scaling the data
from sklearn.model_selection import KFold #for cross-validation
#importing the regressors to be used
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
In [2]: #setup for displaying multiple outputs from a single Jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The data
In [3]: # loading the data into a dataframe
boston_input = pd.read_csv('boston.csv')
# drop neighborhood from the data being considered
boston = boston_input.drop('neighborhood', 1)
#Setting up the data for fitting the models into numpy arrays
prelim_model_data = np.array([boston.mv,
boston.crim,
boston.zn,
boston.indus,
boston.chas,
boston.nox,
boston.rooms,
boston.age,
boston.dis,
boston.rad,
boston.tax,
boston.ptratio,
boston.lstat]).T
The data was already explored in the prior analysis and some concerns regarding the distribution of some
explanatory variables, as well as the presence of several outliers and extreme outliers, were raised. For the
purpose of this and the prior analysis this concerns were not addressed and only a simple standardization of the
data using the StandardScaler was performed. This was done for the linear regression models that are
susceptible to strong variations in the scales, and although this is not necessary for random forest and gradient
boosting, we used the same standardized data set for ease of comparisons. The standardization centers the
data around 0 and uniﬁed all units to the standard deviation. This was done even for the target variable.
In [4]: # Scaling the data using standardization
scaler = StandardScaler()
model_data = scaler.fit_transform(prelim_model_data)
The following boxplot shows the results of the standardization for reference

In [5]: #Boxplot for standardized variables
var_names = ['mv', 'crim','zn', 'indus', 'chas', 'nox',
'rooms','age','dis','rad','tax','ptratio','lstat']
model_data_df = pd.DataFrame(model_data, columns = var_names)
fig, axis = plt.subplots(figsize=(12,10))
ax = plt.title('Boxplot for Standardized Features')
ax = sns.boxplot(data=model_data_df, orient="h")
As can be observed in the Boxplots all the variables have been centered and standardized. This process
maintains the shape of the distributions so the diﬀerences in the range of values, the distributions, and the
presence of outliers can be easily observed.
Regression models and cross-validation
As before, the following set of code sets up the models to be evaluated. In this study, I will be comparing the
best linear regression models from the prior exercise with a diﬀerent set of random forest and gradient boosting.
The best performing linear regressions used for comparison are, the baseline Linear regression (no
regularization), the Ridge regression with alpha = 50, and the Lasso regression with alpha = 0.01.

In [6]: #Setup code for regression models being considered
RANDOM_SEED = 1 #to obtain reproducible results
SET_FIT_INTERCEPT = True #to include intercept in the regression
##Specifying the set of regression models being evaluated
names = ['a_Linear_Regression',
'b_Ridge_Regression_50',
'c_Lasso_Regression_0.01',
'd_Random_Forest',
'e_Random_Forest_100_log2',
'f_Random_Forest_100_4',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'i_Gradient_Boosting_3_500',
'j_Gradient_Boosting_2_500_6',
'k_Gradient_Boosting_3_100_0.3',
'l_Gradient_Boosting_3_50_0.3']
For the set of Random Forests I used the default parameters from scikit-learn first, then fine-tuned them
iteratively until I got to a satisfactory result. After a full exploration I ended up with four versions. Sampling with
replacement using boostrap was used for all four:
d_Random_Forest (the default):
With 100 trees, unconstrained depth for the trees, and the ability to use all features.
e_Random_Forest_100_log2:
With 100 trees as well and unconstrained depth, but constraining the amount of available features for each
tree to log2, which in this case equals to setting the max_features = 3. This creates random exploration
over the features leading to more diversity.
f_Random_Forest_100_4:
Same as before but increasing the range of feature exploration to 4
g_Random_Forest_10_500_4:
With 500 trees to average across more trees reducing chances of overfitting, limiting the maximum depth of
each tree to 10 with the same goal, and max_features to 4.
For the set of Gradient Boosting Regression Trees, I ended up with five versions:
h_Gradient_Boosting (the default):
With 100 trees, maximum depth of 3, and the learning rate = 0.1.
i_Gradient_Boosting_3_500:
Same as above but with 500 trees for a more complex model
j_Gradient_Boosting_2_500_6:
Same as before but reducing the maximum depth to 2 in an attempt to reduce overfitting by pruning earlier,
as well as limiting the feature exploration to 6.
k_Gradient_Boosting_3_100_0.3:
Similar to the default model but increasing the learning rate to 0.3.
k_Gradient_Boosting_3_50_0.3:
Same as above but reducing the number of trees to 50 to reduce model complexity and balance the
increased learning rate.

In [7]: #code to set the paramaters of the regressors
regressors = [
LinearRegression(fit_intercept = SET_FIT_INTERCEPT),
Ridge(alpha = 50, solver = 'cholesky',
fit_intercept = SET_FIT_INTERCEPT,
normalize = False, # data was standardized before
random_state = RANDOM_SEED),
Lasso(alpha = 0.01, max_iter=10000, tol=0.01,
fit_intercept = SET_FIT_INTERCEPT,
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features='auto',
max_depth = None, max_features= 'log2',
max_depth = None, max_features= 4,
max_depth = 10, max_features= 4,
GradientBoostingRegressor(n_estimators = 100,
max_depth = 3, max_features=None,
learning_rate = 0.1, subsample = 1,
max_depth = 2, max_features=6,
]
The following code sets numpy arrays for storing the results as Python iterates over the for loops during the
cross validation. Although the main performance indicator for this study is the Root Mean Squared Error (RSME)
on the test sets, we will collect the RSME for the train sets to more easily identify overﬁtting as well as a
measure of the variance explained by the model, R-squared for the linear regression models, and pseudo R-
squared for the random forest and gradient boosting.

In [9]: #Setting up numpy arrays for storing results
N_FOLDS = 10 #number of fold for cross-validation
rmse_test = np.zeros((len(names), N_FOLDS))
rmse_train = np.zeros((len(names), N_FOLDS))
r2_test = np.zeros((len(names), N_FOLDS))
r2_train = np.zeros((len(names), N_FOLDS))
As before, I used a cross-validation design with 10 folds. This means that we will cut the data into a training-set
and a test-set ten times. We will train the models in each training test and validate their prediction accuracy in
each of the ten test sets.
In [10]: # specifying the k-fold cross-validation design
kf = KFold(n_splits = N_FOLDS, shuffle=True, random_state = RANDOM_SEED)
index_for_fold = 0 # fold count initialized
for train_index, test_index in kf.split(model_data):
# the structure of modeling data for this study has the
# response variable coming first and explanatory variables later
# so 1:model_data.shape[1] slices for explanatory variables
# and 0 is the index for the response variable
X_train = model_data[train_index, 1:model_data.shape[1]]
X_test = model_data[test_index, 1:model_data.shape[1]]
y_train = model_data[train_index, 0]
y_test = model_data[test_index, 0]
index_for_method = 0 # initialize
for name, reg_model in zip(names, regressors):
## fit on the train set for this fold
rmodel = reg_model.fit(X_train, y_train)
## evaluate on the modelfor this fold
y_test_predict = reg_model.predict(X_test)
y_train_predict = reg_model.predict(X_train)
#R-squared
r2_test[index_for_method, index_for_fold] =
r2_score(y_test, y_test_predict)
r2_train[index_for_method, index_for_fold] =
r2_score(y_train, y_train_predict)
#Root-mean squared error
fold_method_rmse_test =
sqrt(mean_squared_error(y_test, y_test_predict))
fold_method_rmse_train =
sqrt(mean_squared_error(y_train, y_train_predict))
rmse_test[index_for_method, index_for_fold] =
fold_method_rmse_test
rmse_train[index_for_method, index_for_fold] =
fold_method_rmse_train
index_for_method += 1
index_for_fold += 1

The following code creates a Pandas DataFrame with the results of each fold the averages the results across all
folds
In [11]: ##creating multilevel index for dataframes
model_name = names #to avoid confusion in next line
multi_index = pd.MultiIndex.from_product(
[model_name, np.arange(N_FOLDS)],
names=['model','fold'])
##the dataframe
fit_results_df =
pd.DataFrame(np.hstack((rmse_train.reshape(N_FOLDS*len(names),1),
rmse_test.reshape(N_FOLDS*len(names),1),
r2_train.reshape(N_FOLDS*len(names),1),
r2_test.reshape(N_FOLDS*len(names),1))),
index=multi_index,
columns=['Train_RMSE','Test_RMSE',
'Train_r2','Test_r2'])
##averaging results across folds
av_fit = fit_results_df.groupby('model').mean()
Results
The following table shows the average results of the cross-validation across the 10 folds. The average RMSE for
the training sets and the test sets is presented as well as the coeﬃcient of determination, r-Square.
In [12]: print('----- Results of cross-validation across 10 folds -----nn',
round(av_fit, ndigits=3))
----- Results of cross-validation across 10 folds -----
Train_RMSE Test_RMSE Train_r2 Test_r
2
model
a_Linear_Regression 0.510 0.516 0.739 0.719
b_Ridge_Regression_50 0.521 0.520 0.729 0.716
c_Lasso_Regression_0.01 0.516 0.519 0.734 0.715
d_Random_Forest 0.129 0.350 0.983 0.863
e_Random_Forest_100_log2 0.126 0.339 0.984 0.878
f_Random_Forest_100_4 0.125 0.328 0.984 0.886
g_Random_Forest_10_500_4 0.140 0.326 0.981 0.888
h_Gradient_Boosting 0.153 0.310 0.977 0.896
i_Gradient_Boosting_3_500 0.039 0.308 0.998 0.897
j_Gradient_Boosting_2_500_6 0.121 0.342 0.985 0.872
k_Gradient_Boosting_3_100_0.3 0.071 0.304 0.995 0.899
l_Gradient_Boosting_3_50_0.3 0.123 0.301 0.985 0.901

As can be seen above, the default random forest performs pretty well relative to the linear models. Its average
test RSME across the 10 folds is 0.350, well below the 0.516 the best performer of the linear regression models.
There is a clear sign of overfitting given that its average performance on the train data is so much better, with an
RMSE of 0.129 and an R-square of 0.983 (vs 0.863 on the test set). Constraining the space for feature
exploration to 3 (log2 of 12) does improve the performance slightly reducing the RSME to 0.339. After several
iterations, the best setting for max_features was 4. It reduces the RSME even further to 0.328. There is still
overfitting, so the random forest with an increased number of trees to 500 and a limit to the depth of the trees to
10 does reduce the RSME slightly to 0.326.
It is possible that there is an even better combination of parameters that would improve the performance of the
random forests given that there are still signs of overfitting. However, moving to gradient boosting improved
performance much faster: The default setting for gradient boosting provided a model with an test RSME of
0.310 and a r-Squared of 0.896. increasing the number of trees in the model to 500 while keeping the default
max_depth to 3 and the learning_rate to 0.1, increased the performance on the training set significantly (RSME
= 0.039, R-Squared = 0.998!) and the performance on the test set slightly (RSME = 0.308, R-Squared 0.897. So,
there was an improvement overall but clearly overfitting the data given the increased complexity of the model.
Note that for gradient boosting, unlike random forest, more trees increase complexity following the data more
closely, while for random forests more trees reduce the chances of overfitting.
In attempt to decrease overfitting for the gradient boosting with 500 trees, I limited the depth of the trees to 2
and introduced randomness on the feature space by restricting the exploration of features to 6, in this example.
This did not improve performance and actually made the model perform worse than the default gradient
boosting. Then another exploration pertained to increasing the learning rate while reducing the number of trees
used. Since these two parameters are highly related, the more trees the more complex the model and the higher
the learning rate the more strongly the corrections at each iteration, I started with the default number of trees
(100) and increase the learning rate to 0.3. This had a more significant impact in performance reducing the
RSME on the test data to 0.304. The best so far. The model is likely overfitting sine the RSME on the train set is
0.071 and the train R-Square is 0.995! So there is still a chance to improve it slightly. Indeed, reducing the
number of trees to only 50 while the learning rate is still at 0.3 and max_depth = 3 provides the best
performance: Test RMSE 0.301, test R-squared .901! Not bad.
Feature importance
Now that we got a good performing model let explore the contribution and importance of each of the features.
To do that we will train the best models using the full data set.
The following code collects the full data for the features and the target variable
In [13]: #X and Y train values for full dataset
X_f = model_data[:, 1:model_data.shape[1]]
y_f = model_data[:, 0]
The following code fit the linear regression models against the full data set also estimates and stores the
magnitude of the regression coefficients. The intercept is not collected since we only care about the importance
of the features for comparability with the trees models.

In [14]: #setting the numpy array to collect the
#coeficients for the three linear models
regression_coef = np.zeros((len(names[:3]), #[:3] slices for just the
#linear regression models
model_data.shape[1]-1))
#same code as before but using .coef_ to collect coefficients
for name, reg_model in zip(names[:3], regressors[:3]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#regresion coeficients (features)
regression_coef[index_for_method]=reg_model.coef_
The following code does the same but for the tree models and uses .featureimportance instead of .coef_ to get
the importance of each feature.
In [15]: #setting the numpy array to collect the
#feature importance for the tree models
feature_importance = np.zeros((len(names[3:]), #[3:] slices for the
#trees models
model_data.shape[1]-1))
for name, reg_model in zip(names[3:], regressors[3:]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#feature importance
feature_importance[index_for_method] =
rmodel.feature_importances_
The following two box of code creates a pandas dataframe with the coeﬃcients for the linear regression models
and the feature importance for the trees models. Note that for ease of visualization with seaborn plots the shape
of the dataframe is changed to a tidy format.
In [16]: #creating a dataframe for storing the feature importance for each model
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
feature_importance_pd = pd.DataFrame(feature_importance)
feature_importance_pd.columns = column_names
feature_importance_pd.index = names[3:]
#creating a subset of the dataframe for the best
#perfoming ensemble of trees models for visualization
fi_df = feature_importance_pd.loc[['d_Random_Forest',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'l_Gradient_Boosting_3_50_0.3'],:]
#reshaping the layout for ease f visualization
fi_df = fi_df.stack().reset_index() #making it tidy
fi_df.columns = ['model','feature','importance']

To properly compare the importance of each feature across linear and trees models I converted the regression
coeﬃcients of the linear models to their absolute values
In [17]: #creating a dataframe for storing the regression coefficients
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
regression_coef_pd = pd.DataFrame(regression_coef)
regression_coef_pd.columns = column_names
regression_coef_pd.index = names[:3]
#reshaping the layout for ease of visualization
cf_df = regression_coef_pd
cf_df = cf_df.stack().reset_index()#making it tidy
cf_df.columns = ['model','feature','abs_magnitude']
#converting the magnitude of the coefficients to absolute values
cf_df['abs_magnitude'] = np.absolute(cf_df['abs_magnitude'])
The following plot compares the absolute magnate of the linear regression coeﬃcients and the importance
given the features for the best and default random forest and gradient boosting models

In [18]: sns.set()
fig, axis = plt.subplots(1,2, figsize=(12,8))
ax = plt.suptitle("__________ Feature Importance __________")
ax = plt.subplot(1,2,1)
ax = plt.title('feature coefficients (abs value)')
ax = sns.stripplot(data=cf_df, size=10, x='abs_magnitude',
y='feature', hue='model', palette ="Blues")
ax = plt.subplot(1,2,2)
ax = plt.title('feature importance')
ax = sns.stripplot(data=fi_df, size=10, x='importance',
y='feature', hue='model', palette ="Reds")
As can been seen, there is a diﬀerent pattern between the linear regression models and the trees models.
Although the two measures of importance are not the same and so direct comparison has to be done carefully,
relative to the importance of the number of rooms (rooms) and the percent of the population of lower socio-
economic status (lstat), the tree-based models give much less importance to the other features. In fact, the
importance zn, indus, chas and rad, is 0 or close except for one of the optimized random forests that gives
some small importance to indust. This model also has the more balanced importance relative to the other tree-
based models
The best performing gradient boosting model relies mostly on the number of rooms and the percent of the
population of low socioeconomic status as well as a little on the distance (dis) to employment centers. Some
smaller importance is given to the pupil/teacher ratio in public schools (ptratio) and air pollution (nox).

Conclusion
Ensemble models based on trees are among the most widely used models in machine learning given their
strong performance and ease of training. This study adds evidence that these models are indeed great tools for
supervised learning. With the optimized gradient boosting model, the real estate brokerage firm can confidently
use this machine learning technique to estimate the values of house in Boston at the time. Particular attention is
to be given to the number of rooms and the percentage of the population of lower socio-economic status. This
might be the most obvious features however and so additional attention is to be given to the distance from
employment centers and the level of air pollution. The crime rate and the tax rate are less important but still
contribute to the precision of the model. Using the model’s predictions would likely be a more accurate
estimation of the value of residential real estate than even the assessment of an expert so it is strongly advised
that the firm uses the model as the primary method and complement it with more traditional approaches if
desired
References:
Thomas W. Miller. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python.
Pearson Education, Old Tappan, N.J., 2015. Data sets and programs available at http://www.ftpress.com/miller/
(http://www.ftpress.com/miller/) and https://github.com/mtpa/ (https://github.com/mtpa/).
Appendix

Random Forrest and Gradient Boosting

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Random Forrest and Gradient Boosting