SlideShare a Scribd company logo
1 of 10
Download to read offline
MA 575
Analysis on Bike Rental Data to
Predict Future Use
By: Miles Avila, Kevin Choi, JungTak Joo, Kimberly
Nguyen, Tianyuan Zhou
12/9/2014
Casual Model Building: J.J., T.Z. Registered Model Building: K.C., J.J. K.N. Introduction & Background:
M.A. Modeling and Analysis: K.N. Prediction & Discussion: T.Z. Proofread & formatting: M.A., K.C., K.N
Analysis on Bike Rental Data to Predict Future Use
Abstract
The goal of this analysis is to predict the number of bike users on any given day in a year
using linear model techniques. Due to the increasing popularity of bike sharing and the amount of
available data, predictive models and analysis are seemingly more important to better understand
bike users and programs. Our analysis begins with exploratory data analysis techniques including
scatterplots of the original data. The exploratory analysis provided preliminary insight about our
dataset, which helped us create our early models. We proceeded to improve our models using
variable selection, transformation, comparison, and testing for non-constant variance. Our final
predictive model is divided into two separate models: casual and registered bike users. The final
casual model includes bias from the bike user population, due mostly to increases in bike users in
2012, and the registered model, after using the mean shift, shows unbiasedness and large variance.
Our predictive models suggests that the worst predictions for both models occurred around holidays
and during extreme weather conditions.
Introduction
Bike sharing is an innovative transportation program, ideal for short distance point-to-point
trips providing users the ability to pick up a bicycle at any self-serve bike-station and return it to any
other bike-station located within the system's service area. These systems have become popular in
major metropolitan areas around the world. Currently, there are over 500 bike-sharing programs
worldwide, which is composed of over 500 thousand bicycles. Today, there exists great interest in
these systems due to their important role in traffic, environmental, and health issues. The way in
which these bikes may be rented is automated, which, when coupled with other sensor data such as
temperature and weather characteristics, facilitates the process of predicting use of the bikes in the
future. From the perspective of the companies that own these systems, it is of interest to create
accurate models in order to predict bike use on any given day. In contrast to other methods of
transportation, such as bus or subway, the duration of travel, departure and arrival positions are
explicitly recorded in these systems. This is a unique feature that lets the bike sharing system act as a
virtual sensor network that can be utilized as a tool for sensing mobility in the city. It may be
possible, even, to detect which events are most important in a city by monitoring these data.
Background
In this study, we are creating a model that predicts the number of bike-sharing users on any
given day in a particular year to the same day in a different year. In general, predictions are difficult
because there are many variables that are unaccounted for in our dataset. These include, but are not
limited to, business affairs among bike-sharing companies, an increase in popularity among the
services (i.e. has bike-sharing become a societal trend), and especially cost fluctuations of the
services.
The data here are a mix of numerical and categorical variables. These include the count of
users on any given day, split by casual and registered users, along with the state of the weather
(measured by temperature, actual temperature (feeling temperature), humidity, wind speed, and
weather sit), and finally in conjunction with categorical variables describing what kind of day it was
(weekday, holiday, season, and month). The data set is collected from the years 2011 and 2012 in
Washington, D.C.
The core data set is related to the two-year historical log corresponding to years 2011 and
2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in
http://capitalbikeshare.com/system-data. UCI Machine Learning Repository aggregated the data into
two hourly and daily basis datasets, and added the corresponding weather and seasonal information.
Weather information are extracted from http://www.freemeteo.com.
The essential goal of this study was to create a linear model that predicts the amount of bike
users on a given day with constant variance and minimal residual values.
Modeling & Analysis
The first step we took in this process was to examine a scatterplot matrix in order to
understand the correlations among the variables (A1).
…………………………………………………………………………………………………………
….
From here, we created an initial model with Count (cnt) as the predictor and we included all
the variables in the dataset as the regressors (A2). To assess our model, we first tested whether or not
our model violates the assumption of constant variance (A3). At a significance level of .05, we can
barely conclude that this model has constant variance. The non-constant variance test shows our p-
value is 0.05701636. Nonetheless, from the residual plot we can conclude that this model is linear
(A4).
Next, we chose to transform the response variable with a logarithmic transformation, by
convention (A5). We tested once more for constant variance, and contrary to our expectations, this
model was far from having constant variance (A6). We also found that this model is not linear in
nature, based on the residual plot (A7).
Understanding that neither of these are the best model, we chose to utilize the AIC tool to
determine which variables should be included in order to obtain the best model. We conducted AIC
in the backward directions (A8). Running a linear model on this data we obtain the following model
(A9):
cnt=1975.08+424.48*season2+850.09*season3+1151.59*season4+185.36*month2+354.96*
month3+897.26*month4+1637.06*month5+1337.41*month6+573.99*month7+699.64*month8+112
5.10*month9+960.06*month10+552.16*month11+495.29*month12-386.34*holiday+3084.11*temp-
1330.70*humidity-2015.69*windspeed-280.06*weathersit2-1596.43*weathersit3
We also test for constant variance, and the p-value is large enough to fail to reject the null hypothesis
at .05 (A10). Having met the assumptions of constant variance and normality, we decided to use the
preceding model to predict the 2012 bicycle data.
We found that on average, our predictions were lower than the actual value of the cnt of users
in 2012 on any given day (A11).
In an attempt to explain this result, we hypothesized that this may be due to the different
behaviors that casual and registered users display towards the bike sharing service, given the
different factors. For example, on an extremely cold day, a casual user may decide to take their car
rather than use the bike share service, where a registered user may decide to use the bike sharing
service despite the bad weather, because they have already paid for their account. Also, we thought
advertisement would have different impact on casual and registered users. This led us to the decision
of creating separate models for casual and registered users in an attempt to obtain smaller residuals
when predicting 2012 data.
We started by creating a model for just casual users. Having run a backward selection on all
our variables, we obtained the following model from our backward selection (A12):
casual= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+
1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*
month9+960.0629*month10+552.1595*month11+495.2866*month12-
280.0560*weathersit2 -
1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -
1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp
However, the backward selection model violates the assumptions of constant variance (A13)
and linearity (A14). In order to fix these violations and improve the linearity of the model, we ran a
Box-Cox method and chose to transform the response variable to the power of .4 (A15). The chosen
power transformation makes sense because the inverse response plot showed a slight square root
relation between number of casual users and the chosen regressors.
The model for casual users after the power transformation is (A16):
casual0.4
= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+
1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*
month9+960.0629*month10+552.1595*month11+495.2866*month12-
280.0560*weathersit2 -
1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -
1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp
Furthermore, we checked for linearity (A17) and non-constant variance (A18) for the above model.
Our tests yielded the following results:
In comparison to the original backward selected model for causal users (A14), our model
with the Box-Cox method shows more linearity. In addition, the p-value from the non-constant
variance test in the transformed model, in comparison to the original backward selected model,
shows more constant variance. The p-value went from 3.42573E-05 (A13) in the original model to
0.006777438 in the transformed model (A18). Clearly, the transformed model using the Box-Cox
method is better for casual users.
We tried to further improve our variance for the casual model by removing outliers. Utilizing
the outlier test, we removed two potential outliers. We re-ran the transformed backward selected
model but it did not improve the constancy of our variance. Therefore, we reverted back to the
transformed causal model above (A16) to predict the 2012 bicycle dataset.
The mean of the residuals of the actual number of casual users is approximately
300 However, the mean of the residuals of our 2012 data using the transformed model is
approximately 1.92 (A19). Although the prediction results are not ideal, we decided we’ll leave the
model for now and go on to the registered users and see if we’ll get better behavior from that group
and then possibly (figure out) why our predictions have large residuals.
In addition to the causal model, we also created a model for registered users. Initially, we put all the
variables into a backward selection algorithm in order to decide which variables are most significant
(A20). Running a linear model on the significant variables yields the following model (A21):
registered =
1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1
29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6
11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-
853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee
kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3
+1952.3536*atemp-906.3608*hum-961.8719*windspeed
The test of nonconstant variance yielded a p-value of 0.7760796, leading us to conclude that our
model has constant variance(A22). In addition, the model fulfills the linear assumption (A23):
I
The mean of the residuals from this model is 1765 (A24). Like the casual model, the registered
model is also underestimating. Before considering any transformations to fix the underestimations in
our models, we decided to take a second look at our data to figure out if there was another cause. We
noticed that the numbers of both registered and casual users in 2012 seem to be much larger than
those numbers in 2011, so we calculated average numbers of registered and casual users in both years.
We found that on average, there is a mean increase of 342 casual users and a mean increase of 1859
registered users in 2012 from 2011 (A25). At the same time, temperatures, humidity, and weather
situations overall didn’t change significantly (month, week of days, and holidays don’t change either,
obviously). Therefore, we have strong evidence to believe that these increases are not due to any of
the variables that are available to us in the dataset, but due to other factors that we do not have
information about such as increasing popularity of the system or advertisement. In order to capture
these increases, we applied a mean shift to the model for the registered users. In other words, our
model for the registered users have now become (A24):
registered =
1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1
29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6
11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-
853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee
kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3
+1952.3536*atemp-906.3608*hum-961.8719*windspeed + 1764.549*year
In the model above, we added a “year” variable, and we obtained the coefficient of this variable from
the mean residuals of our predicted values of 2012 data. However, we decided against applying a
similar mean shift to the casual data because our casual users model has a transformed response
variable. The transformed response variable affects the mean shift and hinders its predictability and
interpretability.
Prediction
From our constructed model using 2011 data, we were able to explain a fair amount of
variability in both registered and casual users of capital bikeshare system in 2012. (R2
of around .66
in both cases) . The casual user model has a bias due to the underestimated amount of users in 2012.
The underestimated amount of users could account for many different factors including bicycle
trends and advertisement, but these factors are not included in our dataset. However, the variance of
the casual user model is rather small, with a MSE of 8.78 . Our registered user model is unbiased
after the mean shift where the mean residual is basically zero. However, due to large amount of
registered users, (and thus large fluctuations of data) our estimation of registered users in 2012 have
large variance, with an MSE of 754653. Overall, the worst predictions for both models occurred
around holidays where there were either a lot of people or very little people using bikes, and in
extreme weather conditions (such as when hurricane Sandy hit in October 2012) where very few, if
any users were using the bike system. Nonetheless, our model predicted well (A26).
Discussion
One should note that the mean shifts we applied to our registered model is a special case to
this project. In this project, we had the luxury of observing the 2012 data and knowing about this
average increase and therefore able to make the proper adjustment for our model. However, in most
real life situation, we would be using the data we have to create a model that predicts future
outcomes, in these situations we would not know the future value of response variables ahead of time.
Therefore, we need to be especially careful when we build these models. We need to gather as much
information as possible to maximize our chance to capture all the predicting variables. Furthermore,
for the dataset that are likely to see an increase in values (both predictor and response) we should
monitor the data closely and update it frequently and quickly after we’ve received new information
regarding the data. Finally, for data that shows a strong and clear trend or pattern related to time,
other statistical technique such as time series modeling would be more appropriate to use and results
in better prediction of the data.
Appendix
1:pairs(~cnt+season+mnth+holiday+weekday+workingday+weathersit+temp+atemp+
hum+windspeed)
2:lm1<-
lm(cnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+fact
or(weathersit)+temp+atemp+hum+windspeed)
3:ncvTest(lm1)
4:plot(TestingSet$cnt, resid(lm1))
5:logcnt<-log(cnt) lm2<-
lm(logcnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+f
actor(weathersit)+temp+atemp+hum+windspeed)
summary(lm2)
6: ncvTest(lm2)
7:plot(TestingSet$cnt, resid(lm2))
8: starting.model <- lm(cnt ~ 1, data=TestingSet)
step(starting.model, scope = ~factor(season) + factor(mnth) + holiday +
factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum +
windspeed, direction = "forward")
backward.model <- step(lm1, scope = ~1, direction = "backward")
9: summary(backward.model)
10: ncvTest(backward.model)
11: fit1 <- predict(backward.model, TestingSet)
residuals1 <- TestingSet$cnt-fit1
plot(residuals1 ~ TestingSet$instant)
mean(residuals1)
12: starting.casual1 <- lm(casual ~ factor(season) +
factor(mnth) + holiday + factor(weekday) + workingday +
factor(weathersit) + temp + atemp + hum + windspeed)
step(starting.casual1, scope = ~ 1, direction ="backward")
backwardCasual <- lm(casual ~ factor(mnth) + holiday + factor(weekday) +
factor(weathersit) + temp + hum + windspeed)
13:ncvTest(backwardCasual)
14: plot(backwardCasual)
15: invResPlot(backwardCasual)
16: backwardCasual3 <- lm((casual)^0.4 ~ factor(mnth) + holiday +
factor(weekday) + factor(weathersit) + temp + hum + windspeed)
17: plot(backwardCasual3)
18: ncvTest(backwardCasual3)
19: fitTCasual <- predict(backwardCasual3, TestingSet)
residualTCasual <- (TestingSet$casual)^0.4 - fitTCasual)
mean(residualTCasual)
fitCasual <- (fitTCasual)^(5/2)
residualCasual <- (TestingSet$casual - residualCasual)
mean(residualCasual)
20:starting.registered1 <- lm(registered ~ factor(season) + factor(mnth) +
holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp
+ hum + windspeed)
step(starting.registered1, scope = ~ 1, direction ="backward")
21:backwardRegistered<-lm(formula = registered ~ factor(season) +
factor(mnth) + holiday + factor(weekday) + factor(weathersit) + atemp +
hum + windspeed)
summary(backwardRegistered)
22: ncvTest(backwardRegistered)
23: plot(backwardRegistered)
24: fitregistered <- predict(backwardRegistered, TestingSet)
residualRegistered <- (TestingSet$registered - fitregistered)
mean(residualRegistered)
25: mean(TestingSet$Casual)-mean(TrainingSet$Casual)
mean(TestingSet$Registered)-mean(TrainingSet$Registered)
26:RSSCasual <- sum(((TestingSet$casual)^0.4 - fitTCasual)^2)
MSECasual <- SSECasual / 341
SYYCasual <- sum(((TestingSet$casual)^0.4 - mean(TestingSet$casual)^0.4)^2)
SSRegCasual <- SYYCasual - RSSCasual
R2Casual <- SSRegCasual/SYYCasual
RSSRegistered <- sum((TestingSet$registered - fitregstered)^2)
MSERegistered <- RSSRegistered / 337
SYYRegistered <- sum((TestingSet$registered -
mean(TestingSet$registered )^2)
SSRegRegisteredl <- SYYRegistered - RSSRegisteredl
R2Registeredl <- SSRegRegistered/SYYRegistered

More Related Content

What's hot

SEO-all about Search engine optimization
SEO-all about Search engine optimizationSEO-all about Search engine optimization
SEO-all about Search engine optimizationAnusree Krishnanunni
 
Intelligent transportation system
Intelligent transportation system Intelligent transportation system
Intelligent transportation system Naveen raj
 
Intelligent Transportation System Modified
Intelligent Transportation System ModifiedIntelligent Transportation System Modified
Intelligent Transportation System ModifiedDurgesh Mishra
 
Digital marketing q paper model 2
Digital marketing q paper   model 2Digital marketing q paper   model 2
Digital marketing q paper model 2Venkatesh Ganapathy
 
Road safety fundamentals
Road safety fundamentalsRoad safety fundamentals
Road safety fundamentalsCarlos Osio
 
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial Rodrigue Tchamna
 
E-Ticket presentation
E-Ticket presentationE-Ticket presentation
E-Ticket presentationSergio Santos
 
M tech e-challan device
M tech e-challan deviceM tech e-challan device
M tech e-challan deviceParag C. Bari
 
Uber SEO Analysis & Opportunities by Ilyas Teker
Uber SEO Analysis & Opportunities by Ilyas TekerUber SEO Analysis & Opportunities by Ilyas Teker
Uber SEO Analysis & Opportunities by Ilyas TekerIlyas Teker
 
Solution To Pune Traffic Final (3)
Solution To Pune Traffic   Final (3)Solution To Pune Traffic   Final (3)
Solution To Pune Traffic Final (3)Vipul P. Karnik
 
SOCIAL MEDIA MANAGEMENT
SOCIAL MEDIA MANAGEMENTSOCIAL MEDIA MANAGEMENT
SOCIAL MEDIA MANAGEMENTtuvibeagency
 
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...Martin Villig
 
Digital Marketing: Tools
Digital Marketing: ToolsDigital Marketing: Tools
Digital Marketing: ToolsNeeti Naag
 
Ibm big data-platform
Ibm big data-platformIbm big data-platform
Ibm big data-platformIBM Sverige
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning AlgorithmsPredictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithmssushantparte
 

What's hot (20)

SEO-all about Search engine optimization
SEO-all about Search engine optimizationSEO-all about Search engine optimization
SEO-all about Search engine optimization
 
Intelligent transportation system
Intelligent transportation system Intelligent transportation system
Intelligent transportation system
 
Web traffic analysis example
Web traffic analysis exampleWeb traffic analysis example
Web traffic analysis example
 
Intelligent Transportation System Modified
Intelligent Transportation System ModifiedIntelligent Transportation System Modified
Intelligent Transportation System Modified
 
Digital marketing q paper model 2
Digital marketing q paper   model 2Digital marketing q paper   model 2
Digital marketing q paper model 2
 
Intelligent Transportation System
Intelligent Transportation SystemIntelligent Transportation System
Intelligent Transportation System
 
Road safety fundamentals
Road safety fundamentalsRoad safety fundamentals
Road safety fundamentals
 
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial
Sumo, Simulation of Urban Mobility, (DLR, Open Source) tutorial
 
E-Ticket presentation
E-Ticket presentationE-Ticket presentation
E-Ticket presentation
 
M tech e-challan device
M tech e-challan deviceM tech e-challan device
M tech e-challan device
 
Uber SEO Analysis & Opportunities by Ilyas Teker
Uber SEO Analysis & Opportunities by Ilyas TekerUber SEO Analysis & Opportunities by Ilyas Teker
Uber SEO Analysis & Opportunities by Ilyas Teker
 
Solution To Pune Traffic Final (3)
Solution To Pune Traffic   Final (3)Solution To Pune Traffic   Final (3)
Solution To Pune Traffic Final (3)
 
SOCIAL MEDIA MANAGEMENT
SOCIAL MEDIA MANAGEMENTSOCIAL MEDIA MANAGEMENT
SOCIAL MEDIA MANAGEMENT
 
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...
Bolt founder story - how to make an impact? Martin Villig / TechChill 2020, R...
 
Google Analytics Overview
Google Analytics OverviewGoogle Analytics Overview
Google Analytics Overview
 
Digital Marketing: Tools
Digital Marketing: ToolsDigital Marketing: Tools
Digital Marketing: Tools
 
Ibm big data-platform
Ibm big data-platformIbm big data-platform
Ibm big data-platform
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Effectiveness of digital Marketing
Effectiveness of digital MarketingEffectiveness of digital Marketing
Effectiveness of digital Marketing
 
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning AlgorithmsPredictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
 

Viewers also liked

Entonacion phonetics
Entonacion   phoneticsEntonacion   phonetics
Entonacion phoneticsluis
 
Orientaciones pedagógicas petc
Orientaciones pedagógicas petcOrientaciones pedagógicas petc
Orientaciones pedagógicas petcClaudis21
 
Dati bilancio QUI! Group 1° semestre 2016
Dati bilancio QUI! Group 1° semestre 2016Dati bilancio QUI! Group 1° semestre 2016
Dati bilancio QUI! Group 1° semestre 2016Gregorio Fogliani
 
Generación de relaciones intensidad duración frecuencia para cuencas
Generación de relaciones intensidad duración frecuencia para cuencasGeneración de relaciones intensidad duración frecuencia para cuencas
Generación de relaciones intensidad duración frecuencia para cuencasConsultoria Estudios Cedsa Cedsa
 
Почему СМИ не могут быть для всех «мимими»
Почему СМИ не могут быть для всех «мимими»Почему СМИ не могут быть для всех «мимими»
Почему СМИ не могут быть для всех «мимими»Медведев Маркетинг
 
Using Platelet Rich Plasma for Orthopedic Conditions
Using Platelet Rich Plasma for Orthopedic ConditionsUsing Platelet Rich Plasma for Orthopedic Conditions
Using Platelet Rich Plasma for Orthopedic Conditionsregenmedsr
 
ueda2011 ak-diabetic cardiomyopathy_d.ali
ueda2011 ak-diabetic cardiomyopathy_d.aliueda2011 ak-diabetic cardiomyopathy_d.ali
ueda2011 ak-diabetic cardiomyopathy_d.aliueda2015
 
Project Management - Bike Rental Pitch
Project Management - Bike Rental PitchProject Management - Bike Rental Pitch
Project Management - Bike Rental PitchCarlDelaney7
 
Bike rental shop
Bike rental shopBike rental shop
Bike rental shopRoth020292
 

Viewers also liked (14)

Yana my friend
Yana my friendYana my friend
Yana my friend
 
Entonacion phonetics
Entonacion   phoneticsEntonacion   phonetics
Entonacion phonetics
 
Orientaciones pedagógicas petc
Orientaciones pedagógicas petcOrientaciones pedagógicas petc
Orientaciones pedagógicas petc
 
Bus ou Barramento
Bus ou BarramentoBus ou Barramento
Bus ou Barramento
 
Cheryl Thurwanger Resume 1
Cheryl Thurwanger Resume 1Cheryl Thurwanger Resume 1
Cheryl Thurwanger Resume 1
 
6後半
6後半6後半
6後半
 
Dati bilancio QUI! Group 1° semestre 2016
Dati bilancio QUI! Group 1° semestre 2016Dati bilancio QUI! Group 1° semestre 2016
Dati bilancio QUI! Group 1° semestre 2016
 
Generación de relaciones intensidad duración frecuencia para cuencas
Generación de relaciones intensidad duración frecuencia para cuencasGeneración de relaciones intensidad duración frecuencia para cuencas
Generación de relaciones intensidad duración frecuencia para cuencas
 
Почему СМИ не могут быть для всех «мимими»
Почему СМИ не могут быть для всех «мимими»Почему СМИ не могут быть для всех «мимими»
Почему СМИ не могут быть для всех «мимими»
 
Using Platelet Rich Plasma for Orthopedic Conditions
Using Platelet Rich Plasma for Orthopedic ConditionsUsing Platelet Rich Plasma for Orthopedic Conditions
Using Platelet Rich Plasma for Orthopedic Conditions
 
Geologia
Geologia   Geologia
Geologia
 
ueda2011 ak-diabetic cardiomyopathy_d.ali
ueda2011 ak-diabetic cardiomyopathy_d.aliueda2011 ak-diabetic cardiomyopathy_d.ali
ueda2011 ak-diabetic cardiomyopathy_d.ali
 
Project Management - Bike Rental Pitch
Project Management - Bike Rental PitchProject Management - Bike Rental Pitch
Project Management - Bike Rental Pitch
 
Bike rental shop
Bike rental shopBike rental shop
Bike rental shop
 

Similar to Analysis on Bike Rental Data to Predict Future Use

6101-Project Report
6101-Project Report6101-Project Report
6101-Project ReportLove Tyagi
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Arpita Majumder
 
Project template for presenting it before the panel
Project template for presenting it before the panelProject template for presenting it before the panel
Project template for presenting it before the panelAkshatMehrotra14
 
Effect of Weather on Uber Ridership_rev1 (1)
Effect of Weather on Uber Ridership_rev1 (1)Effect of Weather on Uber Ridership_rev1 (1)
Effect of Weather on Uber Ridership_rev1 (1)Anusha Mamillapalli
 
Project template for projects looks like this
Project template for projects looks like thisProject template for projects looks like this
Project template for projects looks like thiskaniuppu
 
Rides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA BikeRides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA BikeIRJET Journal
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinMinchao Lin
 
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...IRJET Journal
 
Accident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningAccident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningIRJET Journal
 
Modelling Mobile payment services revenue using Artificial Neural Network
Modelling Mobile payment services revenue using Artificial Neural Network Modelling Mobile payment services revenue using Artificial Neural Network
Modelling Mobile payment services revenue using Artificial Neural Network Kyalo Richard
 
Ensemble Modelling - Assignment 3 - DA
Ensemble Modelling - Assignment 3 - DAEnsemble Modelling - Assignment 3 - DA
Ensemble Modelling - Assignment 3 - DAArun Sankar
 
IRJET- Facial Age Estimation with Age Difference
IRJET-  	  Facial Age Estimation with Age DifferenceIRJET-  	  Facial Age Estimation with Age Difference
IRJET- Facial Age Estimation with Age DifferenceIRJET Journal
 
Creative Methods for Transportation Modeling
Creative Methods for Transportation ModelingCreative Methods for Transportation Modeling
Creative Methods for Transportation ModelingJohn-Mark Palacios
 
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...Shakas Technologies
 
Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...dperl88
 
HealthOrzo – Your Health Matters
HealthOrzo – Your Health MattersHealthOrzo – Your Health Matters
HealthOrzo – Your Health MattersIRJET Journal
 

Similar to Analysis on Bike Rental Data to Predict Future Use (20)

6101-Project Report
6101-Project Report6101-Project Report
6101-Project Report
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
 
Project template for presenting it before the panel
Project template for presenting it before the panelProject template for presenting it before the panel
Project template for presenting it before the panel
 
Effect of Weather on Uber Ridership_rev1 (1)
Effect of Weather on Uber Ridership_rev1 (1)Effect of Weather on Uber Ridership_rev1 (1)
Effect of Weather on Uber Ridership_rev1 (1)
 
Project template for projects looks like this
Project template for projects looks like thisProject template for projects looks like this
Project template for projects looks like this
 
Final presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab KothariFinal presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab Kothari
 
Rides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA BikeRides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA Bike
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao Lin
 
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
 
Accident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningAccident Prediction System Using Machine Learning
Accident Prediction System Using Machine Learning
 
Modelling Mobile payment services revenue using Artificial Neural Network
Modelling Mobile payment services revenue using Artificial Neural Network Modelling Mobile payment services revenue using Artificial Neural Network
Modelling Mobile payment services revenue using Artificial Neural Network
 
Data mining
Data miningData mining
Data mining
 
Ensemble Modelling - Assignment 3 - DA
Ensemble Modelling - Assignment 3 - DAEnsemble Modelling - Assignment 3 - DA
Ensemble Modelling - Assignment 3 - DA
 
Final presentation
Final presentationFinal presentation
Final presentation
 
IRJET- Facial Age Estimation with Age Difference
IRJET-  	  Facial Age Estimation with Age DifferenceIRJET-  	  Facial Age Estimation with Age Difference
IRJET- Facial Age Estimation with Age Difference
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
 
Creative Methods for Transportation Modeling
Creative Methods for Transportation ModelingCreative Methods for Transportation Modeling
Creative Methods for Transportation Modeling
 
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...
Data Driven Energy Economy Prediction for Electric City Buses Using Machine L...
 
Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...
 
HealthOrzo – Your Health Matters
HealthOrzo – Your Health MattersHealthOrzo – Your Health Matters
HealthOrzo – Your Health Matters
 

Analysis on Bike Rental Data to Predict Future Use

  • 1. MA 575 Analysis on Bike Rental Data to Predict Future Use By: Miles Avila, Kevin Choi, JungTak Joo, Kimberly Nguyen, Tianyuan Zhou 12/9/2014 Casual Model Building: J.J., T.Z. Registered Model Building: K.C., J.J. K.N. Introduction & Background: M.A. Modeling and Analysis: K.N. Prediction & Discussion: T.Z. Proofread & formatting: M.A., K.C., K.N
  • 2. Analysis on Bike Rental Data to Predict Future Use Abstract The goal of this analysis is to predict the number of bike users on any given day in a year using linear model techniques. Due to the increasing popularity of bike sharing and the amount of available data, predictive models and analysis are seemingly more important to better understand bike users and programs. Our analysis begins with exploratory data analysis techniques including scatterplots of the original data. The exploratory analysis provided preliminary insight about our dataset, which helped us create our early models. We proceeded to improve our models using variable selection, transformation, comparison, and testing for non-constant variance. Our final predictive model is divided into two separate models: casual and registered bike users. The final casual model includes bias from the bike user population, due mostly to increases in bike users in 2012, and the registered model, after using the mean shift, shows unbiasedness and large variance. Our predictive models suggests that the worst predictions for both models occurred around holidays and during extreme weather conditions. Introduction Bike sharing is an innovative transportation program, ideal for short distance point-to-point trips providing users the ability to pick up a bicycle at any self-serve bike-station and return it to any other bike-station located within the system's service area. These systems have become popular in major metropolitan areas around the world. Currently, there are over 500 bike-sharing programs worldwide, which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental, and health issues. The way in which these bikes may be rented is automated, which, when coupled with other sensor data such as temperature and weather characteristics, facilitates the process of predicting use of the bikes in the future. From the perspective of the companies that own these systems, it is of interest to create accurate models in order to predict bike use on any given day. In contrast to other methods of transportation, such as bus or subway, the duration of travel, departure and arrival positions are explicitly recorded in these systems. This is a unique feature that lets the bike sharing system act as a virtual sensor network that can be utilized as a tool for sensing mobility in the city. It may be possible, even, to detect which events are most important in a city by monitoring these data. Background In this study, we are creating a model that predicts the number of bike-sharing users on any given day in a particular year to the same day in a different year. In general, predictions are difficult because there are many variables that are unaccounted for in our dataset. These include, but are not limited to, business affairs among bike-sharing companies, an increase in popularity among the services (i.e. has bike-sharing become a societal trend), and especially cost fluctuations of the services. The data here are a mix of numerical and categorical variables. These include the count of users on any given day, split by casual and registered users, along with the state of the weather (measured by temperature, actual temperature (feeling temperature), humidity, wind speed, and weather sit), and finally in conjunction with categorical variables describing what kind of day it was (weekday, holiday, season, and month). The data set is collected from the years 2011 and 2012 in Washington, D.C.
  • 3. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. UCI Machine Learning Repository aggregated the data into two hourly and daily basis datasets, and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. The essential goal of this study was to create a linear model that predicts the amount of bike users on a given day with constant variance and minimal residual values. Modeling & Analysis The first step we took in this process was to examine a scatterplot matrix in order to understand the correlations among the variables (A1). ………………………………………………………………………………………………………… …. From here, we created an initial model with Count (cnt) as the predictor and we included all the variables in the dataset as the regressors (A2). To assess our model, we first tested whether or not our model violates the assumption of constant variance (A3). At a significance level of .05, we can barely conclude that this model has constant variance. The non-constant variance test shows our p- value is 0.05701636. Nonetheless, from the residual plot we can conclude that this model is linear (A4).
  • 4. Next, we chose to transform the response variable with a logarithmic transformation, by convention (A5). We tested once more for constant variance, and contrary to our expectations, this model was far from having constant variance (A6). We also found that this model is not linear in nature, based on the residual plot (A7). Understanding that neither of these are the best model, we chose to utilize the AIC tool to determine which variables should be included in order to obtain the best model. We conducted AIC in the backward directions (A8). Running a linear model on this data we obtain the following model (A9): cnt=1975.08+424.48*season2+850.09*season3+1151.59*season4+185.36*month2+354.96* month3+897.26*month4+1637.06*month5+1337.41*month6+573.99*month7+699.64*month8+112 5.10*month9+960.06*month10+552.16*month11+495.29*month12-386.34*holiday+3084.11*temp- 1330.70*humidity-2015.69*windspeed-280.06*weathersit2-1596.43*weathersit3 We also test for constant variance, and the p-value is large enough to fail to reject the null hypothesis at .05 (A10). Having met the assumptions of constant variance and normality, we decided to use the preceding model to predict the 2012 bicycle data. We found that on average, our predictions were lower than the actual value of the cnt of users in 2012 on any given day (A11). In an attempt to explain this result, we hypothesized that this may be due to the different behaviors that casual and registered users display towards the bike sharing service, given the different factors. For example, on an extremely cold day, a casual user may decide to take their car rather than use the bike share service, where a registered user may decide to use the bike sharing service despite the bad weather, because they have already paid for their account. Also, we thought advertisement would have different impact on casual and registered users. This led us to the decision of creating separate models for casual and registered users in an attempt to obtain smaller residuals when predicting 2012 data.
  • 5. We started by creating a model for just casual users. Having run a backward selection on all our variables, we obtained the following model from our backward selection (A12): casual= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+ 1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984* month9+960.0629*month10+552.1595*month11+495.2866*month12- 280.0560*weathersit2 - 1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 - 1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp However, the backward selection model violates the assumptions of constant variance (A13) and linearity (A14). In order to fix these violations and improve the linearity of the model, we ran a Box-Cox method and chose to transform the response variable to the power of .4 (A15). The chosen power transformation makes sense because the inverse response plot showed a slight square root relation between number of casual users and the chosen regressors. The model for casual users after the power transformation is (A16): casual0.4 = 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+ 1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984* month9+960.0629*month10+552.1595*month11+495.2866*month12- 280.0560*weathersit2 - 1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 - 1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp Furthermore, we checked for linearity (A17) and non-constant variance (A18) for the above model. Our tests yielded the following results:
  • 6. In comparison to the original backward selected model for causal users (A14), our model with the Box-Cox method shows more linearity. In addition, the p-value from the non-constant variance test in the transformed model, in comparison to the original backward selected model, shows more constant variance. The p-value went from 3.42573E-05 (A13) in the original model to 0.006777438 in the transformed model (A18). Clearly, the transformed model using the Box-Cox method is better for casual users. We tried to further improve our variance for the casual model by removing outliers. Utilizing the outlier test, we removed two potential outliers. We re-ran the transformed backward selected model but it did not improve the constancy of our variance. Therefore, we reverted back to the transformed causal model above (A16) to predict the 2012 bicycle dataset. The mean of the residuals of the actual number of casual users is approximately 300 However, the mean of the residuals of our 2012 data using the transformed model is approximately 1.92 (A19). Although the prediction results are not ideal, we decided we’ll leave the model for now and go on to the registered users and see if we’ll get better behavior from that group and then possibly (figure out) why our predictions have large residuals. In addition to the causal model, we also created a model for registered users. Initially, we put all the variables into a backward selection algorithm in order to decide which variables are most significant (A20). Running a linear model on the significant variables yields the following model (A21): registered = 1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1 29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6 11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12- 853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3 +1952.3536*atemp-906.3608*hum-961.8719*windspeed The test of nonconstant variance yielded a p-value of 0.7760796, leading us to conclude that our model has constant variance(A22). In addition, the model fulfills the linear assumption (A23):
  • 7. I The mean of the residuals from this model is 1765 (A24). Like the casual model, the registered model is also underestimating. Before considering any transformations to fix the underestimations in our models, we decided to take a second look at our data to figure out if there was another cause. We noticed that the numbers of both registered and casual users in 2012 seem to be much larger than those numbers in 2011, so we calculated average numbers of registered and casual users in both years. We found that on average, there is a mean increase of 342 casual users and a mean increase of 1859 registered users in 2012 from 2011 (A25). At the same time, temperatures, humidity, and weather situations overall didn’t change significantly (month, week of days, and holidays don’t change either, obviously). Therefore, we have strong evidence to believe that these increases are not due to any of the variables that are available to us in the dataset, but due to other factors that we do not have information about such as increasing popularity of the system or advertisement. In order to capture these increases, we applied a mean shift to the model for the registered users. In other words, our model for the registered users have now become (A24): registered = 1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1 29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6 11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12- 853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3 +1952.3536*atemp-906.3608*hum-961.8719*windspeed + 1764.549*year In the model above, we added a “year” variable, and we obtained the coefficient of this variable from the mean residuals of our predicted values of 2012 data. However, we decided against applying a similar mean shift to the casual data because our casual users model has a transformed response variable. The transformed response variable affects the mean shift and hinders its predictability and interpretability. Prediction From our constructed model using 2011 data, we were able to explain a fair amount of variability in both registered and casual users of capital bikeshare system in 2012. (R2 of around .66
  • 8. in both cases) . The casual user model has a bias due to the underestimated amount of users in 2012. The underestimated amount of users could account for many different factors including bicycle trends and advertisement, but these factors are not included in our dataset. However, the variance of the casual user model is rather small, with a MSE of 8.78 . Our registered user model is unbiased after the mean shift where the mean residual is basically zero. However, due to large amount of registered users, (and thus large fluctuations of data) our estimation of registered users in 2012 have large variance, with an MSE of 754653. Overall, the worst predictions for both models occurred around holidays where there were either a lot of people or very little people using bikes, and in extreme weather conditions (such as when hurricane Sandy hit in October 2012) where very few, if any users were using the bike system. Nonetheless, our model predicted well (A26). Discussion One should note that the mean shifts we applied to our registered model is a special case to this project. In this project, we had the luxury of observing the 2012 data and knowing about this average increase and therefore able to make the proper adjustment for our model. However, in most real life situation, we would be using the data we have to create a model that predicts future outcomes, in these situations we would not know the future value of response variables ahead of time. Therefore, we need to be especially careful when we build these models. We need to gather as much information as possible to maximize our chance to capture all the predicting variables. Furthermore, for the dataset that are likely to see an increase in values (both predictor and response) we should monitor the data closely and update it frequently and quickly after we’ve received new information regarding the data. Finally, for data that shows a strong and clear trend or pattern related to time, other statistical technique such as time series modeling would be more appropriate to use and results in better prediction of the data.
  • 9. Appendix 1:pairs(~cnt+season+mnth+holiday+weekday+workingday+weathersit+temp+atemp+ hum+windspeed) 2:lm1<- lm(cnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+fact or(weathersit)+temp+atemp+hum+windspeed) 3:ncvTest(lm1) 4:plot(TestingSet$cnt, resid(lm1)) 5:logcnt<-log(cnt) lm2<- lm(logcnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+f actor(weathersit)+temp+atemp+hum+windspeed) summary(lm2) 6: ncvTest(lm2) 7:plot(TestingSet$cnt, resid(lm2)) 8: starting.model <- lm(cnt ~ 1, data=TestingSet) step(starting.model, scope = ~factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed, direction = "forward") backward.model <- step(lm1, scope = ~1, direction = "backward") 9: summary(backward.model) 10: ncvTest(backward.model) 11: fit1 <- predict(backward.model, TestingSet) residuals1 <- TestingSet$cnt-fit1 plot(residuals1 ~ TestingSet$instant) mean(residuals1) 12: starting.casual1 <- lm(casual ~ factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed) step(starting.casual1, scope = ~ 1, direction ="backward") backwardCasual <- lm(casual ~ factor(mnth) + holiday + factor(weekday) + factor(weathersit) + temp + hum + windspeed) 13:ncvTest(backwardCasual) 14: plot(backwardCasual) 15: invResPlot(backwardCasual) 16: backwardCasual3 <- lm((casual)^0.4 ~ factor(mnth) + holiday + factor(weekday) + factor(weathersit) + temp + hum + windspeed) 17: plot(backwardCasual3) 18: ncvTest(backwardCasual3) 19: fitTCasual <- predict(backwardCasual3, TestingSet) residualTCasual <- (TestingSet$casual)^0.4 - fitTCasual) mean(residualTCasual) fitCasual <- (fitTCasual)^(5/2) residualCasual <- (TestingSet$casual - residualCasual) mean(residualCasual) 20:starting.registered1 <- lm(registered ~ factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed) step(starting.registered1, scope = ~ 1, direction ="backward")
  • 10. 21:backwardRegistered<-lm(formula = registered ~ factor(season) + factor(mnth) + holiday + factor(weekday) + factor(weathersit) + atemp + hum + windspeed) summary(backwardRegistered) 22: ncvTest(backwardRegistered) 23: plot(backwardRegistered) 24: fitregistered <- predict(backwardRegistered, TestingSet) residualRegistered <- (TestingSet$registered - fitregistered) mean(residualRegistered) 25: mean(TestingSet$Casual)-mean(TrainingSet$Casual) mean(TestingSet$Registered)-mean(TrainingSet$Registered) 26:RSSCasual <- sum(((TestingSet$casual)^0.4 - fitTCasual)^2) MSECasual <- SSECasual / 341 SYYCasual <- sum(((TestingSet$casual)^0.4 - mean(TestingSet$casual)^0.4)^2) SSRegCasual <- SYYCasual - RSSCasual R2Casual <- SSRegCasual/SYYCasual RSSRegistered <- sum((TestingSet$registered - fitregstered)^2) MSERegistered <- RSSRegistered / 337 SYYRegistered <- sum((TestingSet$registered - mean(TestingSet$registered )^2) SSRegRegisteredl <- SYYRegistered - RSSRegisteredl R2Registeredl <- SSRegRegistered/SYYRegistered