Ensemble Modelling - Assignment 3 - DA

Ensemble Modeling
Assignment 3 -Data Analytics
Syam Murali ( A0134602U)
Arvind Kozhiyalam ( A0134599N)
Upma Vermani ( A0134605M)
Arun Sankar ( A0134606X)

Executive Summary
Overview
Currently, the business orders for next day are based on yesterday’s demand.
However several factors affect bike rental demand. It is essential that these factors are considered while
forecasting. The proposed model considers the different factors while generating forecasts.
Comparison of Model Forecasts for July 2012 – Dec 2012
Model Profit ($) % increase in profit
over current model
Current Model 794,128 0
Proposed Model 1,004,036 26.4%
Through the implementation of the proposed model the business stands to gain an additional profit of
26.4% ($209,908).
The proposed model is an ensemble of Linear Regression Models and is built on 18 months of data. (Jan
2011 to Jun 2012). The profit projections are tested on July 2012 – Dec 2012 data.
Recommendations
1. During weekdays, there is high demand between 7am - 9am and 5pm - 7pm, moderate towards
the afternoon and negligible demand in the night. It is recommended that the business ensure that
sufficient bikes are made available during peak hours based on the model’s hourly predictions.
2. Bike demand is highly seasonal with winter season having the highest overall demand.
3. Registered users are the primary business drivers who rent on a regular basis, for commute to
work and back; while casual users have an occasional high demand when weather conditions are
suitable and on holiday/weekends. Significant differences were observed in usage of bikes by
casual and registered users and the business should forecast for these set of users separately.
4. The demand for casual bike rentals are mainly affected by 3 factors: Temperature, Weekend and
Humidity. Demand for casual bike rents sharply increases during weekends. It is recommended
that the business account for increased casual bike demand during the weekends.
5. High demand for bike rents (registered users) occurs during winter months. Demand is also
depended on day of the week; with the demand decreasing during weekends. Demand for bikes
from registered users are less during the weekends, the business can cut down on costs by not
overstocking. ( based on the model’s predictions)
6. Higher temperatures lead to increase in demand whereas increase in humidity leady to a
lowering of demand. Bikers prefer warm days which are not humid.
7. The proposed model’s performance declines with age. Therefore, business should also consider
frequently retraining the model, preferably every month.
8. Test data for multiple years would help generate better models.

Data Cleaning and Exploratory data analysis
Data Description
The data for analysis is from a two-year log of bikes being rented in a bike sharing system in
Washington, D.C., USA, known as Capital Bike Sharing (CBS).
Hourly data was considered for the analysis. The data set has a total of 17,379 hourly observations.
List of Variables
S.No. Variable Description Variable Type
1 instant Record index
2 dteday Date Continuous
3 season Season Categorical
4 yr Year Categorical
5 mnth Month Categorical
6 hr Hour Categorical
7 holiday Whether day is holiday or not Flag
8 weekday Day of the week Categorical
9 workingday If day is neither weekend nor holiday is 1,
otherwise is 0
Flag
10 weathersit 1. Clear, Few clouds, Partly cloudy, Partly
cloudy
2. Mist + Cloudy, Mist + Broken clouds,
Mist + Few clouds, Mist
3. Light Snow, Light Rain + Thunderstorm
+ Scattered clouds, Light Rain + Scattered
clouds
4. Heavy Rain + Ice Pallets + Thunderstorm
+ Mist, Snow + Fog
Categorical
11 temp Normalized temperature in Celsius Continuous
12 atemp Normalized feeling temperature in Celsius Continuous
13 hum Normalized humidity Continuous
14 windspeed Normalized wind speed Continuous
15 casual Count of casual users Continuous
16 registered Count of registered users Continuous
17 cnt Count of total rental bikes including both casual
and registered
Continuous
Data Cleaning
Outliers
The predictor variables were inspected for unusual values.
Boxplots of the predictor variables indicated that all values were within acceptable range.
On inspecting the boxplot for the variable; "cnt" (Total count) some extreme values were noticed.

Total Bike Rentals
Casual Bike Rentals
Registered Bike Rentals
However the outliers do not seem to be very extreme. Moreover, on aggregation hourly data to daily
level, the values seem to be at an acceptable level.
Since the values are within acceptable limits, all data points are retained for analysis.
Data Partitioning
For the purpose of this analysis, the data set is partitioned into two sections. The train set comprises of
the hourly data from 2011 and the testing set contains the same for 2012.
The training set day comprises of 8645 observations, whereas the testing set has 8734 observations (all
observations are at hourly level).

Missing Values and Incorrect Observations
On inspection for missing values, it was noticed that the training data set had 115 missing observations.
The table below shows on such instance of missing observations. There is time difference of 11 hours
between successive observations.
Record Date Hour
396 17-01-2011 23
397 18-01-2011 12
Missing observations can affect the performance of time series based predictions.
Certain anomalies were detected in the temperature distribution
Temperature Distribution
The above graph shows the effects of windchill below about the temperature of 0.3 and humidity above
about 0.48.
Outliers: We noted that there are outliers to the lower right of the main grouping. Upon analysis, we
have found that these 24 points all occur on a single day in August. It is safe to assume that, there would
have been error in the capturing the data.
Exploratory Data Analysis
Variable Correlations with Cnt ( Total bike rentals )
Variable correlations were computed to understand the relationship cnt (Total number of bikes rented)
has with the rest of the variables.
season mnth hr holiday weekday workingday weathersit temp atemp
0.22 0.18 0.41 -0.02 0 0.01 -0.14 0.45 0.45
hum windspeed
-0.29 0.09

The total number of bikes rented shows the highest correlation with temperature (0.45) and hour of the
day. Higher temperature seems to drive up the number of bike rental. Also, it shows a negative
correlation with humidity. Higher humidity appears to affect the total rentals negatively.
Whereas factors like holiday and working day, etc. do not seem to have much effect on the total bike
rentals.
Since the total bike rentals is a sum of casual rentals and rentals by registered users, we look into
whether the above relationships hold true for casual and registered rentals as well.
Variable Correlations with Casual
0.14 0.09 0.3 0.05 -0.01 -0.32 -0.16 0.48 0.47
hum windspeed
-0.31 0.07
Similar relationships exist for temperature, humidity and hour of the day. However, working day seems
to be a factor for casual rentals with less rentals occurring during working days.
Variable Correlations with Registered
0.22 0.19 0.39 -0.05 0 0.13 -0.12 0.38 0.38
hum windspeed
-0.24 0.08
Registered bike rentals and Total bike rentals share a similar relationship with the variables.
Temperature, time of the day and humidity seem to be major factors.
Correlations between Variables
The correlations between variables are examined to identify variables which are highly correlated with
each other .Highly correlated variables would lead to inaccurate prediction results.
Temp (temperature) and atemp (feel like temperature) are highly correlated (0.99). We remove atemp,
since both the variables convey almost the same information.

Season and Month are also highly correlated. The variables are retained and will be examined during
modeling to understand their impact on the model's VIF.
Impact of variables on Bike Rentals
Season: It is important to understand the importance of Season on Bike rentals, as climatic conditions
greatly affect the ridership.
Effect of Season on Ridership
The above chart says that in the summer months, the difference between the number of registered and
casual riders on a given day is small. Conversely, in the colder months, the difference is large - there are
more days with large differences. This makes enough of an impact to keep in the model.
Continuing in this manner, it turns out that the strongest indicators of the difference in the number of
registered v/s casual riders on a given day are:
1. Season
2. Work Day (i.e. non-holiday or weekend)
3. Weather (rain, sun, cloud, snow)
4. Year of the bike share program
This implies that registered riders are more likely to ride than casual riders in the winter, on a working
day, in worse weather.

Day of Week and Weather
The hourly demand for bikes shows two peaks - one at 8 in the morning and another at 5 in the evening.
The demand is steady during the noon hours and reduces slowly after 5 pm.
This relationship is also noticed for registered rentals. This can be attributed to registered rentals using
bikes for daily commute to and back from office/work.
Casual bike users show a clear difference in this regard. The demand for casual bike rentals slowly
increase during the day and peaks at around 5 pm. There is little morning demand for bikes amongst
casual users during the morning hours.
Weather does not greatly affect the shape of the demand curve with respect to time of day, just the
magnitude and density.

Distinctly differing shape between a work and non-work day, with 2 spikes in increase in ownership in
Weekday, suggests uptake in travelling to Work and on way back home.
Thus, there is a clear difference between the two groups Casual and Registered. Therefore, we will be
analyzing these two groups separately.
Wind speed
Wind speed does not seem to have a significant effect on the number of bike rentals. There are a few
notable outliers for casual bike rentals in this regard.
Humidity
There is a drop in the number of rentals with increase in humidity. This could be because conditions
with higher humidity are generally not conducive for biking.
New Variables
To aid in predictive modeling, a few trend variables were created.
trendYesterday: This variable contains the value for the previous day's demand.
trendWeek: This variable is a moving average of the number of rentals for the past 7 days.
trendMonth: This variable is a moving average of the number of rentals for the past 30 days.
trendPrev: This variable contains the value for number of rentals for the previous day of the week. (If
current day is a Monday, then the trend variable will be equal to the number of rentals for the previous
Monday.

Since Casual and Registered users are quite distinct in their rental behavior, the trend variables were
created separately for the two groups.
e.g. trendYesterday_C is the trend variable for casual rentals
trendYesterday_R is the similar variable for registered rentals
Modeling
Variables
Model Targets
Separate models were generated for Casual and Registered Rentals.
Casual rentals models Target = Casual (count of casual users)
Registered rental models Target = Registered (count of registered users)
Model Input variables
Model Casual Rentals Model Registered
Rentals
Season Categorical Season Categorical
Mnth Categorical mnth Categorical
Hr Categorical hr Categorical
Holiday Categorical holiday Categorical
Weekday Categorical weekday Categorical
workingday Flag workingday Flag
weathersit Categorical weathersit Categorical
Temp Continuous temp Continuous
Hum Continuous hum Continuous
windspeed Continuous windspeed Continuous
trendYesterday_C Continuous trendYesterday_R Continuous
trendWeek_C Continuous trendWeek_R Continuous
trendMonth_C Continuous trendMonth_R Continuous
trendPrevW_C Continuous trendPrevW_R Continuous
Training and testing set
Train set: Jan 2011 - Dec 2011
Test set: Jan 2012 - Dec 2012
Model Building and Testing
Two separate models were generated for Casual and Registered Rentals, and their predictions were
combined to get the total rental prediction.

Different models were generated and their performance benchmarked to select the best model. The
following accuracy measures were considered for benchmarking.
1. RMSE : Root Mean Squared Error
2. MAD : Mean Absolute Deviation
Model 0: Naive Model
This is a random walk model where the previous day's demand is the prediction of the next day. This
model is used for benchmarking the other models. The hourly predictions from the model were
aggregated to daily level for benchmarking and profit calculations.
Benchmarking was done on the test data set.
Accuracy of Naive Model
Measure Value
RMSE 1513.34
MAD 1112.26
Profit $1,441,205
Model 1: Linear Regression
Linear regression was used to predict the number of bike rentals. Step wise variable selection was used
to add and remove variables from the model.
Model for Casual Rentals
Final Model Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.23076 0.791070 -4.084 4.47e-05 ***
## seasonsummer 0.991416 0.335680 2.953 0.003151 **
## seasonwinter 1.035420 0.342314 3.025 0.002496 **
## hr 0.083244 0.022956 3.626 0.000289 ***
## holidayTRUE 3.329527 0.872225 3.817 0.000136 ***
## weekdayTuesday 1.862357 0.552811 3.369 0.000758 ***
## weekdayWednesday 2.312750 0.576970 4.008 6.16e-05 ***
## weekdayThursday 2.414725 0.579703 4.165 3.14e-05 ***
## weekdayFriday 3.880203 0.568145 6.830 9.09e-12 ***
## weekdaySaturday 9.053754 0.574337 15.764 < 2e-16 ***
## weekdaySunday 6.480375 0.542391 11.948 < 2e-16 ***
## `weathersitlight weather` -2.765033 0.502845 -5.499 3.93e-08 ***
## temp 11.566455 1.036110 11.163 < 2e-16 ***
## hum -5.813031 0.810490 -7.172 7.99e-13 ***
## trendYesterday_C 0.970608 0.009976 97.299 < 2e-16 ***
## trendWeek_C -0.217633 0.016549 -13.150 < 2e-16 ***

## trendMonth_C 0.204651 0.013894 14.729 < 2e-16 ***
## trendPrevW_C -0.087799 0.009934 -8.839 < 2e-16 ***
Model Accuracy:
Residual standard error: 12.5 on 8627 degrees of freedom
Multiple R-squared: 0.8966, Adjusted R-squared: 0.8964
F-statistic: 4401 on 17 and 8627 DF, p-value: < 2.2e-16
Residuals:
Min 1Q Median 3Q Max
-114.943 -5.695 -0.286 4.535 93.291
Inference:
The model has a high Adjusted R squared value of 0.896. Thus, the model does a good job of predicting
the casual rentals.
The demand for casual bike rentals are mainly affected by 3 factors: Temperature, Weekend and
Humidity. Higher temperatures lead to increase in demand whereas increase in humidity leady to a
lowering of demand. Demand for casual bike rents sharply increases during weekends.
Model for Registered Rentals
Step wise variable selection was employed to fine tune the model.
Final Model Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.010317 3.288333 0.915 0.360
## seasonsummer 7.881060 1.626050 4.847 1.28e-06 ***
## seasonwinter 16.270329 2.382837 6.828 9.18e-12 ***
## mnth 0.768124 0.324656 2.366 0.018 *
## hr 2.878827 0.124998 23.031 < 2e-16 ***
## holidayTRUE -19.045360 3.930302 -4.846 1.28e-06 ***
## weekdaySaturday -16.954830 1.830363 -9.263 < 2e-16 ***
## weekdaySunday -13.558294 1.901725 -7.129 1.09e-12 ***
## `weathersitlight weather` -19.623097 2.341296 -8.381 < 2e-16 ***
## temp 78.759302 5.144246 15.310 < 2e-16 ***
## hum -48.388516 3.669374 -13.187 < 2e-16 ***
## trendYesterday_R 0.789897 0.008391 94.136 < 2e-16 ***
## trendWeek_R -0.662315 0.018233 -36.326 < 2e-16 ***
## trendMonth_R 0.410428 0.030785 13.332 < 2e-16 ***
## trendPrevW_R 0.045168 0.008667 5.212 1.91e-07 ***
Model Accuracy:
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 58.24 on 8630 degrees of freedom

Multiple R-squared: 0.7174, Adjusted R-squared: 0.7169
F-statistic: 1565 on 14 and 8630 DF, p-value: < 2.2e-16
Residuals:
Min 1Q Median 3Q Max
-274.92 -30.33 -4.91 21.03 340.44
Inference:
The model also has a good Adjusted R squared value of 0.72. However, this is not as good as the model
for casual predictions.
High demand for bike rents (registered users) occurs during winter months. Demand is also depended on
day of the week; with the demand decreasing during weekends. Increase in temperature drives up the
demand whereas increased humidity drives it down.
Linear Regression Model – Accuracy
To test the overall model's accuracy, the models were run on the test data and the predictions were
summed up. The hourly predictions were further aggregated to daily values.
The model accuracy for daily predictions:
Measure Value
RMSE 852.26
MAD 758.32
Profit $1,762,746.16
The RMSE for this model is significantly better than the Naive Model.
There is also a significant increase in profit.
Improvement over Naive Model: $321,541.16
Model 2: Random Forest
Model for Casual Rentals
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6915, 6917, 6916, 6916, 6916
On tuning the model with different mtry( number of variables sampled) :
mtry RMSE
2 13.57467
13 10.65070
24 10.85156

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 13.
Model for Registered Rentals
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6916, 6916, 6916, 6917, 6915
On tuning the model with different mtry( number of variables sampled) :
mtry RMSE
2 45.94593
13 22.93559
24 22.72873
RMSE was used to select the optimal model using the smallest value. The final value used for the model
was mtry = 24.
The hourly predictions were further aggregated to daily values.
Random Forest Model – Accuracy
The model accuracy for daily predictions:
Measure Value
RMSE 932.92
MAD 781.48
Profit $1,760,181.58
The RMSE for this model is significantly better than the Naive Model.
There is also a significant increase in profit.
Improvement over Naive Model: $318,976.57
Model Comparison
The models are compared to identify the best performing model.
Model RMSE MAD Profit ($)
Naïve Model 1513.34 1112.26 1,441,205
Linear Regression 852.26 758.32 1,762,746
Random Forest 932.92 781.48 1,760,182
Linear Regression outperforms the others models. Both RMSE and MAD for Linear Regression are
lower than the other models.

Ensembles
It is observed notice that Linear Regression Models provides the best testing results.
An ensemble of linear models was generated to improve prediction accuracy.
In order to create sufficient variance in the predictions, each Model was generated by random sampling
of the training data. (80% of training data)
10 different models were created by random sampling of data.
The different models were tested on the test set to identify the best performing models.
Performance of different Linear Regression Models considered for Ensemble
Model RMSE Profit
2 862.8418 1758522
3 840.572 1767477
4 903.3133 1744826
5 807.7037 1778867
6 837.5898 1768395
7 856.2363 1761868
8 863.7491 1759411
9 868.3082 1756820
10 836.6417 1769006
The models 3, 5, 6 and 10 have lower RMSE values when compared to the other models.
The individual models may not be very stable, since they were generated from random sampling of data.
These models were then ensemble together to generate the final model.
The predictions from the ensemble models are averaged to get the final predictions.
Ensemble Accuracy
Measures
RMSE 830.27
Total Profit from Naive Predictions $1441205
Total Profit from Model Predictions $1771058
Improvement due to Model over Naive Model $329852.6
Model Comparison
Model RMSE Profit ($)
Naïve Model 1513.34 1,441,205
Linear Regression 852.26 1,762,746
Random Forest 932.92 1,760,182
Ensemble Model 830.27 1,77,1058

The ensemble model has the least RMSE value and highest Profit.
Profit
Profit Calculations
 Profit per day = Revenue per day – Costs per day
 Revenue = (bikes rented out by customers * revenue per bike
= min (actual demand, predicted demand) * revenue per bike
 Costs = predicted demand * loan cost per bike
 Revenue per bike = $3
 Loan cost per bike = $2
Ensemble Model Profit
Model profit for 2012 = $1,771,058
Model profit as a percentage of total costs for 2012 = 49.2%
Naïve Model
Model profit for 2012 = $1,441,205
Model profit as a percentage of total costs for 2012 = 35.2%
Ensemble Model’s Performance with Revenue per rental
Ensemble model performance is better only when revenue is high compared to the costs
The model performance was evaluated for different revenue values keeping cost constant at $2.
The model’s performance over the naïve model increases with increasing Revenue per rental.
However, at revenue greater than $10 per bike, the naïve model performs better. This is due to the fact
that at high revenue levels importance of cost gets
diminished.
Revenue Model Naïve Improvement
2 35,61,897 32,84,974 2,76,923

2.1 1,59,302 -2,18,187.10 3,77,489
2.2 3,38,386 -33,810.20 3,72,196
2.3 5,17,470 1,50,566.70 3,66,903
2.4 6,96,554 3,34,943.60 3,61,610
2.5 8,75,638 5,19,320.50 3,56,317
2.6 10,54,722 7,03,697.40 3,51,024
2.7 12,33,806 8,88,074.30 3,45,731
2.8 14,12,890 10,72,451.20 3,40,438
2.9 15,91,974 12,56,828.10 3,35,145
3 17,71,058 14,41,205 3,29,853
3.1 19,50,142 16,25,581.90 3,24,560
3.2 21,29,225 18,09,958.80 3,19,267
3.3 23,08,309 19,94,335.70 3,13,974
3.4 2487393 21,78,712.60 3,08,681
3.5 2666477 23,63,089.50 3,03,388
Both models give the same prediction when revenue = $9.24
Ensemble Model’s Performance with Season and Months
To understand if model performance depended on seasons or months, the RMSE value was computed
for each season and month.
RMSE
Spring 769.7461478
Summer 781.1242547
Fall 955.2417344
Winter 796.5976288
The model performance is lower for fall when compared to other seasons.
RMSE
January 682.0312805
February 670.8317117
March 923.6624846
April 791.4443095
May 816.0132923
June 746.1080826
July 767.1820004
August 906.9803393
September 1128.546405
October 1000.305523
November 732.9792307

December 651.5160767
Similarly, the model performance is lower for August, September, October and March.
Ensemble Model’s Performance with Aging
The model performance was determined for each day to determine the performance change with aging
of the model. It was observed that the model performance declines slowly as the model ages.
Ensemble model with 18 months data
The models were rebuilt using 18 months training data.
Since Profit is calculated only for six months, we consider RMSE for checking model accuracy.
Model RMSE Profit
1 679.5514 999,557.1
2 657.0268 1758522
3 675.3358 1,000,259
4 660.0453 1,002,866
5 652.6955 1,004,317
6 644.9173 1,004,920
7 678.2894 998,691.5
8 664.2854 1,002,217
9 673.5327 1,001,058
10 696.3528 996,825.5
All the models have significantly better RMSE (lower value) when compared to the model built with 12
months data.
For ensemble, we consider models 2, 4, 5 and 6 since they have the smallest RMSE values.
0
100
200
300
400
500
600
700
800
900 1
19
37
55
73
91
109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
R
M
S
E
Day

Accuracy of the Final Ensemble
RMSE: 653.1097
Total Profit from Naive Predictions: $794,128
Total Profit from Model Predictions: $1,004,036
Improvement due to Model over Naive Model: $209,908.3
Ensemble of models created with 18 months of data clearly outperforms an ensemble created with
12 months of training data.
Data balancing
The data was balanced by ensuring that the first six months of the year have half the probability of
being sampled compared to the next six months.
The model was rebuilt using the balanced data.
Final accuracy and profit of the model after fine tuning the ensembles.
RMSE: 667.2481
Total Profit from Naive Predictions: $ 794,128
Total Profit from Model Predictions: $1,001,679
Improvement due to Model over Naive Model: $207,550.6
The model without data balancing performs slightly better than a model with balanced data.
Since the model is by default samples the training set, balancing of the data did not yield any further
performance boost.

Ensemble Modelling - Assignment 3 - DA

Recommended

Recommended

More Related Content

Similar to Ensemble Modelling - Assignment 3 - DA

Similar to Ensemble Modelling - Assignment 3 - DA (20)

Ensemble Modelling - Assignment 3 - DA