Analysis on Bike Rental Data to Predict Future Use
1. MA 575
Analysis on Bike Rental Data to
Predict Future Use
By: Miles Avila, Kevin Choi, JungTak Joo, Kimberly
Nguyen, Tianyuan Zhou
12/9/2014
Casual Model Building: J.J., T.Z. Registered Model Building: K.C., J.J. K.N. Introduction & Background:
M.A. Modeling and Analysis: K.N. Prediction & Discussion: T.Z. Proofread & formatting: M.A., K.C., K.N
2. Analysis on Bike Rental Data to Predict Future Use
Abstract
The goal of this analysis is to predict the number of bike users on any given day in a year
using linear model techniques. Due to the increasing popularity of bike sharing and the amount of
available data, predictive models and analysis are seemingly more important to better understand
bike users and programs. Our analysis begins with exploratory data analysis techniques including
scatterplots of the original data. The exploratory analysis provided preliminary insight about our
dataset, which helped us create our early models. We proceeded to improve our models using
variable selection, transformation, comparison, and testing for non-constant variance. Our final
predictive model is divided into two separate models: casual and registered bike users. The final
casual model includes bias from the bike user population, due mostly to increases in bike users in
2012, and the registered model, after using the mean shift, shows unbiasedness and large variance.
Our predictive models suggests that the worst predictions for both models occurred around holidays
and during extreme weather conditions.
Introduction
Bike sharing is an innovative transportation program, ideal for short distance point-to-point
trips providing users the ability to pick up a bicycle at any self-serve bike-station and return it to any
other bike-station located within the system's service area. These systems have become popular in
major metropolitan areas around the world. Currently, there are over 500 bike-sharing programs
worldwide, which is composed of over 500 thousand bicycles. Today, there exists great interest in
these systems due to their important role in traffic, environmental, and health issues. The way in
which these bikes may be rented is automated, which, when coupled with other sensor data such as
temperature and weather characteristics, facilitates the process of predicting use of the bikes in the
future. From the perspective of the companies that own these systems, it is of interest to create
accurate models in order to predict bike use on any given day. In contrast to other methods of
transportation, such as bus or subway, the duration of travel, departure and arrival positions are
explicitly recorded in these systems. This is a unique feature that lets the bike sharing system act as a
virtual sensor network that can be utilized as a tool for sensing mobility in the city. It may be
possible, even, to detect which events are most important in a city by monitoring these data.
Background
In this study, we are creating a model that predicts the number of bike-sharing users on any
given day in a particular year to the same day in a different year. In general, predictions are difficult
because there are many variables that are unaccounted for in our dataset. These include, but are not
limited to, business affairs among bike-sharing companies, an increase in popularity among the
services (i.e. has bike-sharing become a societal trend), and especially cost fluctuations of the
services.
The data here are a mix of numerical and categorical variables. These include the count of
users on any given day, split by casual and registered users, along with the state of the weather
(measured by temperature, actual temperature (feeling temperature), humidity, wind speed, and
weather sit), and finally in conjunction with categorical variables describing what kind of day it was
(weekday, holiday, season, and month). The data set is collected from the years 2011 and 2012 in
Washington, D.C.
3. The core data set is related to the two-year historical log corresponding to years 2011 and
2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in
http://capitalbikeshare.com/system-data. UCI Machine Learning Repository aggregated the data into
two hourly and daily basis datasets, and added the corresponding weather and seasonal information.
Weather information are extracted from http://www.freemeteo.com.
The essential goal of this study was to create a linear model that predicts the amount of bike
users on a given day with constant variance and minimal residual values.
Modeling & Analysis
The first step we took in this process was to examine a scatterplot matrix in order to
understand the correlations among the variables (A1).
…………………………………………………………………………………………………………
….
From here, we created an initial model with Count (cnt) as the predictor and we included all
the variables in the dataset as the regressors (A2). To assess our model, we first tested whether or not
our model violates the assumption of constant variance (A3). At a significance level of .05, we can
barely conclude that this model has constant variance. The non-constant variance test shows our p-
value is 0.05701636. Nonetheless, from the residual plot we can conclude that this model is linear
(A4).
4. Next, we chose to transform the response variable with a logarithmic transformation, by
convention (A5). We tested once more for constant variance, and contrary to our expectations, this
model was far from having constant variance (A6). We also found that this model is not linear in
nature, based on the residual plot (A7).
Understanding that neither of these are the best model, we chose to utilize the AIC tool to
determine which variables should be included in order to obtain the best model. We conducted AIC
in the backward directions (A8). Running a linear model on this data we obtain the following model
(A9):
cnt=1975.08+424.48*season2+850.09*season3+1151.59*season4+185.36*month2+354.96*
month3+897.26*month4+1637.06*month5+1337.41*month6+573.99*month7+699.64*month8+112
5.10*month9+960.06*month10+552.16*month11+495.29*month12-386.34*holiday+3084.11*temp-
1330.70*humidity-2015.69*windspeed-280.06*weathersit2-1596.43*weathersit3
We also test for constant variance, and the p-value is large enough to fail to reject the null hypothesis
at .05 (A10). Having met the assumptions of constant variance and normality, we decided to use the
preceding model to predict the 2012 bicycle data.
We found that on average, our predictions were lower than the actual value of the cnt of users
in 2012 on any given day (A11).
In an attempt to explain this result, we hypothesized that this may be due to the different
behaviors that casual and registered users display towards the bike sharing service, given the
different factors. For example, on an extremely cold day, a casual user may decide to take their car
rather than use the bike share service, where a registered user may decide to use the bike sharing
service despite the bad weather, because they have already paid for their account. Also, we thought
advertisement would have different impact on casual and registered users. This led us to the decision
of creating separate models for casual and registered users in an attempt to obtain smaller residuals
when predicting 2012 data.
5. We started by creating a model for just casual users. Having run a backward selection on all
our variables, we obtained the following model from our backward selection (A12):
casual= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+
1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*
month9+960.0629*month10+552.1595*month11+495.2866*month12-
280.0560*weathersit2 -
1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -
1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp
However, the backward selection model violates the assumptions of constant variance (A13)
and linearity (A14). In order to fix these violations and improve the linearity of the model, we ran a
Box-Cox method and chose to transform the response variable to the power of .4 (A15). The chosen
power transformation makes sense because the inverse response plot showed a slight square root
relation between number of casual users and the chosen regressors.
The model for casual users after the power transformation is (A16):
casual0.4
= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+
1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*
month9+960.0629*month10+552.1595*month11+495.2866*month12-
280.0560*weathersit2 -
1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -
1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp
Furthermore, we checked for linearity (A17) and non-constant variance (A18) for the above model.
Our tests yielded the following results:
6. In comparison to the original backward selected model for causal users (A14), our model
with the Box-Cox method shows more linearity. In addition, the p-value from the non-constant
variance test in the transformed model, in comparison to the original backward selected model,
shows more constant variance. The p-value went from 3.42573E-05 (A13) in the original model to
0.006777438 in the transformed model (A18). Clearly, the transformed model using the Box-Cox
method is better for casual users.
We tried to further improve our variance for the casual model by removing outliers. Utilizing
the outlier test, we removed two potential outliers. We re-ran the transformed backward selected
model but it did not improve the constancy of our variance. Therefore, we reverted back to the
transformed causal model above (A16) to predict the 2012 bicycle dataset.
The mean of the residuals of the actual number of casual users is approximately
300 However, the mean of the residuals of our 2012 data using the transformed model is
approximately 1.92 (A19). Although the prediction results are not ideal, we decided we’ll leave the
model for now and go on to the registered users and see if we’ll get better behavior from that group
and then possibly (figure out) why our predictions have large residuals.
In addition to the causal model, we also created a model for registered users. Initially, we put all the
variables into a backward selection algorithm in order to decide which variables are most significant
(A20). Running a linear model on the significant variables yields the following model (A21):
registered =
1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1
29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6
11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-
853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee
kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3
+1952.3536*atemp-906.3608*hum-961.8719*windspeed
The test of nonconstant variance yielded a p-value of 0.7760796, leading us to conclude that our
model has constant variance(A22). In addition, the model fulfills the linear assumption (A23):
7. I
The mean of the residuals from this model is 1765 (A24). Like the casual model, the registered
model is also underestimating. Before considering any transformations to fix the underestimations in
our models, we decided to take a second look at our data to figure out if there was another cause. We
noticed that the numbers of both registered and casual users in 2012 seem to be much larger than
those numbers in 2011, so we calculated average numbers of registered and casual users in both years.
We found that on average, there is a mean increase of 342 casual users and a mean increase of 1859
registered users in 2012 from 2011 (A25). At the same time, temperatures, humidity, and weather
situations overall didn’t change significantly (month, week of days, and holidays don’t change either,
obviously). Therefore, we have strong evidence to believe that these increases are not due to any of
the variables that are available to us in the dataset, but due to other factors that we do not have
information about such as increasing popularity of the system or advertisement. In order to capture
these increases, we applied a mean shift to the model for the registered users. In other words, our
model for the registered users have now become (A24):
registered =
1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1
29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6
11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-
853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee
kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3
+1952.3536*atemp-906.3608*hum-961.8719*windspeed + 1764.549*year
In the model above, we added a “year” variable, and we obtained the coefficient of this variable from
the mean residuals of our predicted values of 2012 data. However, we decided against applying a
similar mean shift to the casual data because our casual users model has a transformed response
variable. The transformed response variable affects the mean shift and hinders its predictability and
interpretability.
Prediction
From our constructed model using 2011 data, we were able to explain a fair amount of
variability in both registered and casual users of capital bikeshare system in 2012. (R2
of around .66
8. in both cases) . The casual user model has a bias due to the underestimated amount of users in 2012.
The underestimated amount of users could account for many different factors including bicycle
trends and advertisement, but these factors are not included in our dataset. However, the variance of
the casual user model is rather small, with a MSE of 8.78 . Our registered user model is unbiased
after the mean shift where the mean residual is basically zero. However, due to large amount of
registered users, (and thus large fluctuations of data) our estimation of registered users in 2012 have
large variance, with an MSE of 754653. Overall, the worst predictions for both models occurred
around holidays where there were either a lot of people or very little people using bikes, and in
extreme weather conditions (such as when hurricane Sandy hit in October 2012) where very few, if
any users were using the bike system. Nonetheless, our model predicted well (A26).
Discussion
One should note that the mean shifts we applied to our registered model is a special case to
this project. In this project, we had the luxury of observing the 2012 data and knowing about this
average increase and therefore able to make the proper adjustment for our model. However, in most
real life situation, we would be using the data we have to create a model that predicts future
outcomes, in these situations we would not know the future value of response variables ahead of time.
Therefore, we need to be especially careful when we build these models. We need to gather as much
information as possible to maximize our chance to capture all the predicting variables. Furthermore,
for the dataset that are likely to see an increase in values (both predictor and response) we should
monitor the data closely and update it frequently and quickly after we’ve received new information
regarding the data. Finally, for data that shows a strong and clear trend or pattern related to time,
other statistical technique such as time series modeling would be more appropriate to use and results
in better prediction of the data.