IEOR 265 Final Paper_Minchao Lin

IEOR 265 Final Project
Application of Machine Learning Techniques to Forecast
Bike Rental Demand in the Capital Bikeshare Program in
Washington, D.C.
by
Minchao Lin
May 8, 2015

Abstract
Forecasting demand is a crucial issue in efficient resource management and different
machine learning techniques can help build and refine a model to learn from observed data and
make predictions. Specifically, supervised learning in machine learning helps in modeling the
relation between a set of predictor variables and one or more response variables on the basis of a
finite set of observations.
The objective of this project is to combine historical usage patterns with weather data in
order to forecast the total count of bikes rented during each hour of the bike sharing system in
Washington, D.C. In this paper, multiple machine learning techniques including ordinary least
squares regression, lasso regression, elastic net, ensemble learning methods, neural network and
local linear regression are discussed and their efficiencies in predicting the response variables are
evaluated and compared.

1 Introduction
1.1 Background
A bicycle sharing system provides bicycles available for shared use to individuals on a
short term basis. These systems are becoming more and more popular in major cities as a
convenient means of transportation. As of June 2014, public bicycle sharing systems were
available on five continents, including 712 cities, operating approximately 806,200 bicycles at
37,500 stations. With these systems, bicycle rental is completely automated via a network of
kiosk locations throughout a city and people are able to rent a bike from one location and return
it to a different location. In order to determine the right number of bicycles that meet the demand
in the city, historical data is a good resource to help perform the demand analysis.
1.2 Data Description
Hourly rental data spanning two years from 2011 to 2012 are provided for this project, with
variables including date & time, season, holiday, working day, weather, temperature, humidity,
wind speed, number of registered and non-registered user rentals initiated, and number of total
rentals. To test the effectiveness of a model, the historical data will be split into three sets. The
training set is comprised of the first 15 days of each month, the test set is comprised of the days
of 16 to 19 of each month, while the validation set includes the 20th to the end of the month.
Details on the predictor variables and response variables are listed in the Appendix. Each method
in the following sections is first performed on the training set and test on the test set with the
mean squared errors calculated every time. For those methods with lower mean squared errors,
we will apply these methods to the combination of training set and test set together and test on
the validation set to get the root mean squared logarithmic error.

1.2.1 Convert Categorical Variables to Dummy Variables
One of the solutions to the dummy variable trap is to drop one of the categorical variables. If
there are m number of categories, use m-1 in the model, the value left out can be thought of as
the reference value and the fit values of the remaining categories represent the change from this
reference. For the bike sharing demand data, year, month, hour, weekday, season, holiday,
working day, and weather are categorical variables and are converted to dummy variables.
1.2.2 Relationship between numerical predictor variables
Figure 1. Scatter plot matrix that plots each numerical variables against one another.
The above scatter plot matrix shows the relationship between each numerical predictor variables
as well as the response variable with the order from top to bottom or from left to right having
variable names: temperature, “feels like” temperature, relative humidity, wind speed, total
number of rentals. The plot shows rather independent relationships between each variables
except for the one between temperature and “feels like” temperature, which is reasonable as

these two variables are generally very close to each other. Because multicollinearity can increase
the variance of the coefficient estimates and make the estimates very sensitive to minor changes,
we will apply regularization to the methods to counteract this tendency.
1.3 Performance Metrics
For regression problem, the method of measuring the distance between the estimated outputs
from the actual outputs is used to quantify the model's performance. The Mean Squared Error
penalizes the bigger difference more because of the square effect. On the other hand, if we want
to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The
effect of introducing the logarithm function is to balance the emphasis on small and big
predictive errors. For this project, the effectiveness of the models will be evaluated based on the
Mean Squared Error (MSE) and the Root Mean Squared Logarithmic Error (RMSLE):
√
1
𝑛
∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2
𝑛
𝑖=1
Where:
 n is the number of hours in the test set
 pi is the predicted count
 ai is the actual count
 log(x) is the natural logarithm

2 Ordinary Least Squares Regression
2.1 Method Description
Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear
regression model, with the goal to minimize the differences between the observed responses and
the predicted responses.
Let X be a n ×p dimensional training data input matrix where n is the total number of
observations and p is the number of features for each observation, Y be a n ×1 dimensional
vector of the training data response values, where n is the total number of observations, and β be
a p ×1 dimensional vector of unknown parameters. Then the OLS estimate of β for the linear
model is defined as
𝛽̂ = (𝑋′
𝑋)−1
(𝑋′
𝑌)
2.2 Performance Metric
Mean Squared Error = 10015
2.3 Result Analysis
The mean squared error is rather high and the relationship between bike rental demand and its
exogenous factors appear to be rather complex and nonlinear, making it difficult to be modeled
through traditional linear regression.

3 Lasso Regularization and Elastic Net
3.1.1 Lasso Regularization
Lasso Regression is a regularized version of linear regression which uses the constraint L1-norm
to minimize the sum of squared errors. In this paper, a 5-fold-cross-validated sequence of models
with lasso is fitted in order to produce shrinkage estimates with potentially lower predictive
errors than ordinary least squares.
3.1.2 Elastic Net
Elastic net is a combination of ridge regression and lasso regularization. Similar to lasso, elastic
net can also generate zero-valued coefficients. Empirical studies suggested that elastic net can
outperform lasso on data with highly correlated predictors.
Figure 2. Lambda vs. MSE for Lasso fit Figure 3. Lambda vs. MSE for Elastic Net fit

Mean Squared Error of Lasso = 10101
Mean Squared Error of Elastic Net = 10118
3.3 Result Analysis
The large mean squared errors of both lasso and elastic net indicate that even regularized linear
regression is not a good approach to forecast the bike sharing demand. In the following sections,
we will explore multiple nonlinear regression techniques.
4 Ensemble Learning and Ensemble Regularization
Ensemble methods use multiple learning algorithms to obtain better predictive performance. An
ensemble is a technique for combining many weak learners in order to produce a strong learner.
4.1.1 Least Squares Boosting
Least Squares Boosting is a type of ensemble learning which fits regression ensembles in order
to minimize mean squared error. At every step, the ensemble fits a new learner to the difference
between the observed response and the aggregated prediction of all learners grown previously.
4.1.2 Bagging
Bagging is another type of ensemble learning which works by training learners on resampled
versions of the data. The resampling is done by bootstrapping observations in the training set.
Although the flexibility of ensemble makes ensemble easier to over-fit the training data, bagging
tend to reduce this problem.

4.1.3 Ensemble Regularization
Ensemble regularization helps choose fewer weak learners in a way that does not diminish
predictive performance. Specifically, it finds an optimal set of learner weights by tuning the lasso
parameter to minimize the ensemble resubstitution error.
Mean Squared Error of Least Squares Boosting = 10030
Mean Squared Error of Bagging = 3656.5
Mean Squared Error of Regularized Bagging = 3473.4
Root Mean Squared Logarithmic Error of Regularized Bagging for Validation data set = 0.63302
4.3 Result Analysis
4.3.1 Least Squares Boosting
1) Figure 1 estimates the generalization error
by cross validation. The line shows that it is
sufficient to obtain satisfactory performance
from a smaller ensemble, perhaps one
containing from 100 to 120 trees.
Figure 4. Number of trees vs. Cross-validated MSE
2) Variables for generating the model with higher number representing greater importance:

We see that hour, month, atemp, temp humidity, season, and year have greater importance.
3) Let the errors be the difference between the
predicted and the real count .The normal probability
plot of errors shows that residuals are closed to
normally distributed in the center of the data while
skewe away from normal above and below the mean.
Figure 5. Normal Probability Plot
4) We separate the errors into groups by different categorical variables to see if there is any
distribution in a certain period that is significantly different from that of the others.
Figure 6. Breakdown of Errors by hour Figure 7. Breakdown of Errors by month

Figure 8. Breakdown of Errors by weekday Figure 9. Breakdown of Errors by season
We observe that the errors during hours at 7, 8, 17, and 18 indicate a major change of errors
from those that are before or after these hours. For the breakdown of errors by weekday,
Saturday and Sunday pattern appear to be different from those of the workdays where the
variance of the errors tend to be smaller.
4.3.1 Bagging
1) Importance of Variables:
Compared to Least Squares Boosting, “hours” now becomes the only variable that stands out in
the value for importance.
4.3.2 Regularized Bagging
1) Comparing regularized and unregularized ensembles:

Figure 10. Lasso parameter vs. Resubstitution MSE Figure 11. Lasso parameter vs. number of learners with nonzero weights
(‘x’ denotes value at lambda = 0 & logarithmic scale, same for all five figures)
Figure 12. Lambda vs. MSE for resubstituion and cross-validation Figure 13. Lambda vs. Number of learners for resubstituion and cv
From Figure 11, we can see that the number
of learners has reduced by over 1/3 for
regularized ensemble. Because the
resubstitution MSE values are likely to be
overly optimistic, we cross validate the
ensemble for different values of lambda.
Figure 14. Number of trees

The cross-validated error in Figure 12 shows that the cross-validation MSE is almost flat for
lambda up to a bit over 103
. With the regularization, there are only 42 trees in the new ensemble,
notably reduced from the 200 in the unregularized ensemble. The reduced ensemble is about
19.8% the size of the original while giving lower loss.
2) Figure 15 suggests our model encounters
some problems in predicting higher counts
where all of residuals are biased in the same
direction, this shows that there is some effects
occurred during high counts that the model
doesn't do a good job in capturing.
Figure 15. Predicted values vs. Residuals
3) Use a simple chart to show predicted versus actual count for 6 months of data in 2011:
Figure 16. True Count vs. Regularized Bag Ensemble for days 16 to 19 of January to June in 2011

The blue line represents the real counts while the red line represents the predicted counts. The
data includes dates of 16 to 19 of January to June in 2011. As illustrated by the graph, the model
is not very efficient in capturing the peak values of real data.
5 Neural Network (NN)
According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60), “a
neural network is a system composed of many simple processing elements operating in parallel
whose function is determined by network structure, connection strengths, and the processing
performed at computing elements or nodes.” Generally, a neural network consists of many
processing units connected by communication channels which carry numeric. The processing
units operate only on their local data and on the inputs they receive via the connections.
In order to fit a neural network to the bike sharing demand data, parameters to configure include
the type of neural network, the number of layers for the neural network, the number of neurons
in each layer, the transfer functions between layers, the performance metric and the training
function. After multiple attempts, the best network structure is a cascade-forward three-layer
network with neurons 10, 15, and 10 for each layer. The transfer functions are tangent sigmoid
for the first three layers and linear for the last. The performance metric is Mean squared error and
training function is Bayesian regularization backpropagation, which updates the weight and bias
values according to Levenberg-Marquardt optimization to minimize a combination of squared
errors and weights.

5.2 Best Performance Metric
Mean Squared Error = 2582.7
Root Mean Squared Logarithmic Error for Validation data set = 0.79834
5.3 Result Analysis
Figure 17. Cascade-forward network with three layers each with neurons 10, 15 and 10
Figure 18. Epochs vs. MSE Figure 19. Target value vs. Output
Figure 17 displays a view of the cascade-forward network that generates the best result. For
Figure 18, the mean squared error for training data began to stay constant while the MSE for test
data kept dropping after around 34 epochs, which means the model started to over-fit the data
afterwards, so we stopped here. The R square in Figure 19 indicates that the model is a good fit.

6 Local Linear Regression (LLR)
For a regression model that is highly nonlinear and of unknown structure, local linear regression
may be applied to fit the model. The local linear regression performs weighted local averaging
with the weights determined by a kernel function. Within a radius of bandwidth h from an
observation x0 in the training data, the new input x will use the parameters for x0 to generate new
responses, where for each x0, its parameters β are determined locally using ordinary least squares
optimization.
For the equation above we have,
𝑊ℎ = 𝑑𝑖𝑎𝑔(𝐾 (
‖𝑥1 − 𝑥0‖
ℎ
) , … , 𝐾 (
‖𝑥 𝑛 − 𝑥0‖
ℎ
))
𝑋0 = 𝑋 − 1 𝑛 𝑥0′
where X and Y are defined as in the previous description for ordinary least squares regression.
For the bike sharing demand data, h = 0.5, 5, 10, 30 are each applied to the method and Gaussian
kernel function is used.
6.2 Best Performance Metric
Mean square error > 10000

6.3 Result Analysis
The high mean squared errors for each bandwidth h indicates that local linear regression might
not be an ideal method to predict the bike sharing demand, yet by adjusting the bandwidth h,
there is possibly some space to improve the performance metric.
7 Conclusion
To conclude, the methods that generate better mean squared errors are ensemble learning and
neural network among all the methods. For this two methods, cross validation are performed to
ensure the model quality and multiple attempts are necessary in order to find the best fit. For
ensemble learning, regularization of bag ensemble has seen significant improved result for the
performance metric. For neural network, number of layers from one to four and number of
neurons from small to large has been tested on. For further steps, more attempts of different
tuning parameters for current methods might improve the predicted results and deeper analysis of
data and its features can be analyzed to identify any missing relationships. For each method,
Matlab code is provided in the Appendix for further explanation.
References
"Neural Network Toolbox." Http://www.mathworks.com/help/nnet/. Web.
S, Warren, and Cary Sarle, Cary. Ftp://ftp.sas.com/pub/neural/FAQ.html. Web.

Appendix
(i) Predictor Variables
Categorical Variables:
·Year: 2011 to 2012
·Month: January to December
·Hour: 01 to 24 hours
·Weekday: Sunday to Monday
·Season: spring to winter
·holiday: whether the day is considered a holiday
·working day: whether the day is neither a weekend nor holiday
·weather:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
Numerical Variables:
·temperature in Celsius
· “feels like” temperature in Celsius
·relative humidity
·wind speed
(ii) Response Variables
·number of non-registered user rentals initiated
·number of registered user rentals initiated
·number of total rentals
(iii) Matlab code
The Matlab code for each method performed above is attached in a separate file named
“matlabcode_IEOR265paper_MinchaoLin.html”

IEOR 265 Final Paper_Minchao Lin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IEOR 265 Final Paper_Minchao Lin

Similar to IEOR 265 Final Paper_Minchao Lin (20)

IEOR 265 Final Paper_Minchao Lin