SlideShare a Scribd company logo
1 of 18
Download to read offline
IEOR 265 Final Project
Application of Machine Learning Techniques to Forecast
Bike Rental Demand in the Capital Bikeshare Program in
Washington, D.C.
by
Minchao Lin
May 8, 2015
Abstract
Forecasting demand is a crucial issue in efficient resource management and different
machine learning techniques can help build and refine a model to learn from observed data and
make predictions. Specifically, supervised learning in machine learning helps in modeling the
relation between a set of predictor variables and one or more response variables on the basis of a
finite set of observations.
The objective of this project is to combine historical usage patterns with weather data in
order to forecast the total count of bikes rented during each hour of the bike sharing system in
Washington, D.C. In this paper, multiple machine learning techniques including ordinary least
squares regression, lasso regression, elastic net, ensemble learning methods, neural network and
local linear regression are discussed and their efficiencies in predicting the response variables are
evaluated and compared.
1 Introduction
1.1 Background
A bicycle sharing system provides bicycles available for shared use to individuals on a
short term basis. These systems are becoming more and more popular in major cities as a
convenient means of transportation. As of June 2014, public bicycle sharing systems were
available on five continents, including 712 cities, operating approximately 806,200 bicycles at
37,500 stations. With these systems, bicycle rental is completely automated via a network of
kiosk locations throughout a city and people are able to rent a bike from one location and return
it to a different location. In order to determine the right number of bicycles that meet the demand
in the city, historical data is a good resource to help perform the demand analysis.
1.2 Data Description
Hourly rental data spanning two years from 2011 to 2012 are provided for this project, with
variables including date & time, season, holiday, working day, weather, temperature, humidity,
wind speed, number of registered and non-registered user rentals initiated, and number of total
rentals. To test the effectiveness of a model, the historical data will be split into three sets. The
training set is comprised of the first 15 days of each month, the test set is comprised of the days
of 16 to 19 of each month, while the validation set includes the 20th to the end of the month.
Details on the predictor variables and response variables are listed in the Appendix. Each method
in the following sections is first performed on the training set and test on the test set with the
mean squared errors calculated every time. For those methods with lower mean squared errors,
we will apply these methods to the combination of training set and test set together and test on
the validation set to get the root mean squared logarithmic error.
1.2.1 Convert Categorical Variables to Dummy Variables
One of the solutions to the dummy variable trap is to drop one of the categorical variables. If
there are m number of categories, use m-1 in the model, the value left out can be thought of as
the reference value and the fit values of the remaining categories represent the change from this
reference. For the bike sharing demand data, year, month, hour, weekday, season, holiday,
working day, and weather are categorical variables and are converted to dummy variables.
1.2.2 Relationship between numerical predictor variables
Figure 1. Scatter plot matrix that plots each numerical variables against one another.
The above scatter plot matrix shows the relationship between each numerical predictor variables
as well as the response variable with the order from top to bottom or from left to right having
variable names: temperature, “feels like” temperature, relative humidity, wind speed, total
number of rentals. The plot shows rather independent relationships between each variables
except for the one between temperature and “feels like” temperature, which is reasonable as
these two variables are generally very close to each other. Because multicollinearity can increase
the variance of the coefficient estimates and make the estimates very sensitive to minor changes,
we will apply regularization to the methods to counteract this tendency.
1.3 Performance Metrics
For regression problem, the method of measuring the distance between the estimated outputs
from the actual outputs is used to quantify the model's performance. The Mean Squared Error
penalizes the bigger difference more because of the square effect. On the other hand, if we want
to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The
effect of introducing the logarithm function is to balance the emphasis on small and big
predictive errors. For this project, the effectiveness of the models will be evaluated based on the
Mean Squared Error (MSE) and the Root Mean Squared Logarithmic Error (RMSLE):
√
1
𝑛
∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2
𝑛
𝑖=1
Where:
 n is the number of hours in the test set
 pi is the predicted count
 ai is the actual count
 log(x) is the natural logarithm
2 Ordinary Least Squares Regression
2.1 Method Description
Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear
regression model, with the goal to minimize the differences between the observed responses and
the predicted responses.
Let X be a n ×p dimensional training data input matrix where n is the total number of
observations and p is the number of features for each observation, Y be a n ×1 dimensional
vector of the training data response values, where n is the total number of observations, and β be
a p ×1 dimensional vector of unknown parameters. Then the OLS estimate of β for the linear
model is defined as
𝛽̂ = (𝑋′
𝑋)−1
(𝑋′
𝑌)
2.2 Performance Metric
Mean Squared Error = 10015
2.3 Result Analysis
The mean squared error is rather high and the relationship between bike rental demand and its
exogenous factors appear to be rather complex and nonlinear, making it difficult to be modeled
through traditional linear regression.
3 Lasso Regularization and Elastic Net
3.1 Method Description
3.1.1 Lasso Regularization
Lasso Regression is a regularized version of linear regression which uses the constraint L1-norm
to minimize the sum of squared errors. In this paper, a 5-fold-cross-validated sequence of models
with lasso is fitted in order to produce shrinkage estimates with potentially lower predictive
errors than ordinary least squares.
3.1.2 Elastic Net
Elastic net is a combination of ridge regression and lasso regularization. Similar to lasso, elastic
net can also generate zero-valued coefficients. Empirical studies suggested that elastic net can
outperform lasso on data with highly correlated predictors.
3.2 Performance Metric
Figure 2. Lambda vs. MSE for Lasso fit Figure 3. Lambda vs. MSE for Elastic Net fit
Mean Squared Error of Lasso = 10101
Mean Squared Error of Elastic Net = 10118
3.3 Result Analysis
The large mean squared errors of both lasso and elastic net indicate that even regularized linear
regression is not a good approach to forecast the bike sharing demand. In the following sections,
we will explore multiple nonlinear regression techniques.
4 Ensemble Learning and Ensemble Regularization
4.1 Method Description
Ensemble methods use multiple learning algorithms to obtain better predictive performance. An
ensemble is a technique for combining many weak learners in order to produce a strong learner.
4.1.1 Least Squares Boosting
Least Squares Boosting is a type of ensemble learning which fits regression ensembles in order
to minimize mean squared error. At every step, the ensemble fits a new learner to the difference
between the observed response and the aggregated prediction of all learners grown previously.
4.1.2 Bagging
Bagging is another type of ensemble learning which works by training learners on resampled
versions of the data. The resampling is done by bootstrapping observations in the training set.
Although the flexibility of ensemble makes ensemble easier to over-fit the training data, bagging
tend to reduce this problem.
4.1.3 Ensemble Regularization
Ensemble regularization helps choose fewer weak learners in a way that does not diminish
predictive performance. Specifically, it finds an optimal set of learner weights by tuning the lasso
parameter to minimize the ensemble resubstitution error.
4.2 Performance Metric
Mean Squared Error of Least Squares Boosting = 10030
Mean Squared Error of Bagging = 3656.5
Mean Squared Error of Regularized Bagging = 3473.4
Root Mean Squared Logarithmic Error of Regularized Bagging for Validation data set = 0.63302
4.3 Result Analysis
4.3.1 Least Squares Boosting
1) Figure 1 estimates the generalization error
by cross validation. The line shows that it is
sufficient to obtain satisfactory performance
from a smaller ensemble, perhaps one
containing from 100 to 120 trees.
Figure 4. Number of trees vs. Cross-validated MSE
2) Variables for generating the model with higher number representing greater importance:
We see that hour, month, atemp, temp humidity, season, and year have greater importance.
3) Let the errors be the difference between the
predicted and the real count .The normal probability
plot of errors shows that residuals are closed to
normally distributed in the center of the data while
skewe away from normal above and below the mean.
Figure 5. Normal Probability Plot
4) We separate the errors into groups by different categorical variables to see if there is any
distribution in a certain period that is significantly different from that of the others.
Figure 6. Breakdown of Errors by hour Figure 7. Breakdown of Errors by month
Figure 8. Breakdown of Errors by weekday Figure 9. Breakdown of Errors by season
We observe that the errors during hours at 7, 8, 17, and 18 indicate a major change of errors
from those that are before or after these hours. For the breakdown of errors by weekday,
Saturday and Sunday pattern appear to be different from those of the workdays where the
variance of the errors tend to be smaller.
4.3.1 Bagging
1) Importance of Variables:
Compared to Least Squares Boosting, “hours” now becomes the only variable that stands out in
the value for importance.
4.3.2 Regularized Bagging
1) Comparing regularized and unregularized ensembles:
Figure 10. Lasso parameter vs. Resubstitution MSE Figure 11. Lasso parameter vs. number of learners with nonzero weights
(‘x’ denotes value at lambda = 0 & logarithmic scale, same for all five figures)
Figure 12. Lambda vs. MSE for resubstituion and cross-validation Figure 13. Lambda vs. Number of learners for resubstituion and cv
From Figure 11, we can see that the number
of learners has reduced by over 1/3 for
regularized ensemble. Because the
resubstitution MSE values are likely to be
overly optimistic, we cross validate the
ensemble for different values of lambda.
Figure 14. Number of trees
The cross-validated error in Figure 12 shows that the cross-validation MSE is almost flat for
lambda up to a bit over 103
. With the regularization, there are only 42 trees in the new ensemble,
notably reduced from the 200 in the unregularized ensemble. The reduced ensemble is about
19.8% the size of the original while giving lower loss.
2) Figure 15 suggests our model encounters
some problems in predicting higher counts
where all of residuals are biased in the same
direction, this shows that there is some effects
occurred during high counts that the model
doesn't do a good job in capturing.
Figure 15. Predicted values vs. Residuals
3) Use a simple chart to show predicted versus actual count for 6 months of data in 2011:
Figure 16. True Count vs. Regularized Bag Ensemble for days 16 to 19 of January to June in 2011
The blue line represents the real counts while the red line represents the predicted counts. The
data includes dates of 16 to 19 of January to June in 2011. As illustrated by the graph, the model
is not very efficient in capturing the peak values of real data.
5 Neural Network (NN)
According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60), “a
neural network is a system composed of many simple processing elements operating in parallel
whose function is determined by network structure, connection strengths, and the processing
performed at computing elements or nodes.” Generally, a neural network consists of many
processing units connected by communication channels which carry numeric. The processing
units operate only on their local data and on the inputs they receive via the connections.
5.1 Method Description
In order to fit a neural network to the bike sharing demand data, parameters to configure include
the type of neural network, the number of layers for the neural network, the number of neurons
in each layer, the transfer functions between layers, the performance metric and the training
function. After multiple attempts, the best network structure is a cascade-forward three-layer
network with neurons 10, 15, and 10 for each layer. The transfer functions are tangent sigmoid
for the first three layers and linear for the last. The performance metric is Mean squared error and
training function is Bayesian regularization backpropagation, which updates the weight and bias
values according to Levenberg-Marquardt optimization to minimize a combination of squared
errors and weights.
5.2 Best Performance Metric
Mean Squared Error = 2582.7
Root Mean Squared Logarithmic Error for Validation data set = 0.79834
5.3 Result Analysis
Figure 17. Cascade-forward network with three layers each with neurons 10, 15 and 10
Figure 18. Epochs vs. MSE Figure 19. Target value vs. Output
Figure 17 displays a view of the cascade-forward network that generates the best result. For
Figure 18, the mean squared error for training data began to stay constant while the MSE for test
data kept dropping after around 34 epochs, which means the model started to over-fit the data
afterwards, so we stopped here. The R square in Figure 19 indicates that the model is a good fit.
6 Local Linear Regression (LLR)
For a regression model that is highly nonlinear and of unknown structure, local linear regression
may be applied to fit the model. The local linear regression performs weighted local averaging
with the weights determined by a kernel function. Within a radius of bandwidth h from an
observation x0 in the training data, the new input x will use the parameters for x0 to generate new
responses, where for each x0, its parameters β are determined locally using ordinary least squares
optimization.
6.1 Method Description
For the equation above we have,
𝑊ℎ = 𝑑𝑖𝑎𝑔(𝐾 (
‖𝑥1 − 𝑥0‖
ℎ
) , … , 𝐾 (
‖𝑥 𝑛 − 𝑥0‖
ℎ
))
𝑋0 = 𝑋 − 1 𝑛 𝑥0′
where X and Y are defined as in the previous description for ordinary least squares regression.
For the bike sharing demand data, h = 0.5, 5, 10, 30 are each applied to the method and Gaussian
kernel function is used.
6.2 Best Performance Metric
Mean square error > 10000
6.3 Result Analysis
The high mean squared errors for each bandwidth h indicates that local linear regression might
not be an ideal method to predict the bike sharing demand, yet by adjusting the bandwidth h,
there is possibly some space to improve the performance metric.
7 Conclusion
To conclude, the methods that generate better mean squared errors are ensemble learning and
neural network among all the methods. For this two methods, cross validation are performed to
ensure the model quality and multiple attempts are necessary in order to find the best fit. For
ensemble learning, regularization of bag ensemble has seen significant improved result for the
performance metric. For neural network, number of layers from one to four and number of
neurons from small to large has been tested on. For further steps, more attempts of different
tuning parameters for current methods might improve the predicted results and deeper analysis of
data and its features can be analyzed to identify any missing relationships. For each method,
Matlab code is provided in the Appendix for further explanation.
References
"Neural Network Toolbox." Http://www.mathworks.com/help/nnet/. Web.
S, Warren, and Cary Sarle, Cary. Ftp://ftp.sas.com/pub/neural/FAQ.html. Web.
Appendix
(i) Predictor Variables
Categorical Variables:
·Year: 2011 to 2012
·Month: January to December
·Hour: 01 to 24 hours
·Weekday: Sunday to Monday
·Season: spring to winter
·holiday: whether the day is considered a holiday
·working day: whether the day is neither a weekend nor holiday
·weather:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
Numerical Variables:
·temperature in Celsius
· “feels like” temperature in Celsius
·relative humidity
·wind speed
(ii) Response Variables
·number of non-registered user rentals initiated
·number of registered user rentals initiated
·number of total rentals
(iii) Matlab code
The Matlab code for each method performed above is attached in a separate file named
“matlabcode_IEOR265paper_MinchaoLin.html”

More Related Content

What's hot

Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes model
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes modelPrecipitation’s Level Prediction Based on Tree Augmented Naïve Bayes model
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes modelNooria Sukmaningtyas
 
IRJET- Rainfall Forecasting using Regression Techniques
IRJET- Rainfall Forecasting using Regression TechniquesIRJET- Rainfall Forecasting using Regression Techniques
IRJET- Rainfall Forecasting using Regression TechniquesIRJET Journal
 
Applied Mathematics project final report
Applied Mathematics project final reportApplied Mathematics project final report
Applied Mathematics project final reportKang Feng
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reductionShatakirti Er
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataWen-Ting Wang
 
Interpolation 2013
Interpolation 2013Interpolation 2013
Interpolation 2013Atiqa khan
 
Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Peter Kenny
 
Chapter03 section01 Scientific Measurements By Hamdy Karim
Chapter03 section01 Scientific Measurements By Hamdy KarimChapter03 section01 Scientific Measurements By Hamdy Karim
Chapter03 section01 Scientific Measurements By Hamdy KarimHamdy Karim
 
Propagation of Error Bounds due to Active Subspace Reduction
Propagation of Error Bounds due to Active Subspace ReductionPropagation of Error Bounds due to Active Subspace Reduction
Propagation of Error Bounds due to Active Subspace ReductionMohammad
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Regression Study: Boston Housing
Regression Study: Boston HousingRegression Study: Boston Housing
Regression Study: Boston HousingRavish Kalra
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1Hyun Wong Choi
 

What's hot (20)

Pca ppt
Pca pptPca ppt
Pca ppt
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes model
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes modelPrecipitation’s Level Prediction Based on Tree Augmented Naïve Bayes model
Precipitation’s Level Prediction Based on Tree Augmented Naïve Bayes model
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
 
IRJET- Rainfall Forecasting using Regression Techniques
IRJET- Rainfall Forecasting using Regression TechniquesIRJET- Rainfall Forecasting using Regression Techniques
IRJET- Rainfall Forecasting using Regression Techniques
 
M2R Group 26
M2R Group 26M2R Group 26
M2R Group 26
 
Applied Mathematics project final report
Applied Mathematics project final reportApplied Mathematics project final report
Applied Mathematics project final report
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Car insurance - data visualization
Car insurance - data visualizationCar insurance - data visualization
Car insurance - data visualization
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial Data
 
Interpolation 2013
Interpolation 2013Interpolation 2013
Interpolation 2013
 
Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design
 
Chapter03 section01 Scientific Measurements By Hamdy Karim
Chapter03 section01 Scientific Measurements By Hamdy KarimChapter03 section01 Scientific Measurements By Hamdy Karim
Chapter03 section01 Scientific Measurements By Hamdy Karim
 
Propagation of Error Bounds due to Active Subspace Reduction
Propagation of Error Bounds due to Active Subspace ReductionPropagation of Error Bounds due to Active Subspace Reduction
Propagation of Error Bounds due to Active Subspace Reduction
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Regression Study: Boston Housing
Regression Study: Boston HousingRegression Study: Boston Housing
Regression Study: Boston Housing
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1
 

Similar to IEOR 265 Final Paper_Minchao Lin

Short-term load forecasting with using multiple linear regression
Short-term load forecasting with using multiple  linear regression Short-term load forecasting with using multiple  linear regression
Short-term load forecasting with using multiple linear regression IJECEIAES
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
 
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...IRJET Journal
 
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
IRJET- Error Reduction in Data Prediction using Least Square Regression MethodIRJET- Error Reduction in Data Prediction using Least Square Regression Method
IRJET- Error Reduction in Data Prediction using Least Square Regression MethodIRJET Journal
 
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...Estimation of Weekly Reference Evapotranspiration using Linear Regression and...
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...IDES Editor
 
IRJET - Intelligent Weather Forecasting using Machine Learning Techniques
IRJET -  	  Intelligent Weather Forecasting using Machine Learning TechniquesIRJET -  	  Intelligent Weather Forecasting using Machine Learning Techniques
IRJET - Intelligent Weather Forecasting using Machine Learning TechniquesIRJET Journal
 
Service Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen DemandService Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen Demandirrosennen
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning TechniquesData-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning TechniquesIJCSIS Research Publications
 
Estimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsEstimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsmehmet şahin
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsFridtjof Melle
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionPower System Operation
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionPower System Operation
 
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUESANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUESIRJET Journal
 

Similar to IEOR 265 Final Paper_Minchao Lin (20)

Short-term load forecasting with using multiple linear regression
Short-term load forecasting with using multiple  linear regression Short-term load forecasting with using multiple  linear regression
Short-term load forecasting with using multiple linear regression
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
Forecasting Municipal Solid Waste Generation Using a Multiple Linear Regressi...
 
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
IRJET- Error Reduction in Data Prediction using Least Square Regression MethodIRJET- Error Reduction in Data Prediction using Least Square Regression Method
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
 
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...Estimation of Weekly Reference Evapotranspiration using Linear Regression and...
Estimation of Weekly Reference Evapotranspiration using Linear Regression and...
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
report
reportreport
report
 
IRJET - Intelligent Weather Forecasting using Machine Learning Techniques
IRJET -  	  Intelligent Weather Forecasting using Machine Learning TechniquesIRJET -  	  Intelligent Weather Forecasting using Machine Learning Techniques
IRJET - Intelligent Weather Forecasting using Machine Learning Techniques
 
Service Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen DemandService Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen Demand
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning TechniquesData-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
 
Estimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsEstimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methods
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
A1802050102
A1802050102A1802050102
A1802050102
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
 
Qt unit i
Qt unit   iQt unit   i
Qt unit i
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
 
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUESANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES
ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES
 

IEOR 265 Final Paper_Minchao Lin

  • 1. IEOR 265 Final Project Application of Machine Learning Techniques to Forecast Bike Rental Demand in the Capital Bikeshare Program in Washington, D.C. by Minchao Lin May 8, 2015
  • 2. Abstract Forecasting demand is a crucial issue in efficient resource management and different machine learning techniques can help build and refine a model to learn from observed data and make predictions. Specifically, supervised learning in machine learning helps in modeling the relation between a set of predictor variables and one or more response variables on the basis of a finite set of observations. The objective of this project is to combine historical usage patterns with weather data in order to forecast the total count of bikes rented during each hour of the bike sharing system in Washington, D.C. In this paper, multiple machine learning techniques including ordinary least squares regression, lasso regression, elastic net, ensemble learning methods, neural network and local linear regression are discussed and their efficiencies in predicting the response variables are evaluated and compared.
  • 3. 1 Introduction 1.1 Background A bicycle sharing system provides bicycles available for shared use to individuals on a short term basis. These systems are becoming more and more popular in major cities as a convenient means of transportation. As of June 2014, public bicycle sharing systems were available on five continents, including 712 cities, operating approximately 806,200 bicycles at 37,500 stations. With these systems, bicycle rental is completely automated via a network of kiosk locations throughout a city and people are able to rent a bike from one location and return it to a different location. In order to determine the right number of bicycles that meet the demand in the city, historical data is a good resource to help perform the demand analysis. 1.2 Data Description Hourly rental data spanning two years from 2011 to 2012 are provided for this project, with variables including date & time, season, holiday, working day, weather, temperature, humidity, wind speed, number of registered and non-registered user rentals initiated, and number of total rentals. To test the effectiveness of a model, the historical data will be split into three sets. The training set is comprised of the first 15 days of each month, the test set is comprised of the days of 16 to 19 of each month, while the validation set includes the 20th to the end of the month. Details on the predictor variables and response variables are listed in the Appendix. Each method in the following sections is first performed on the training set and test on the test set with the mean squared errors calculated every time. For those methods with lower mean squared errors, we will apply these methods to the combination of training set and test set together and test on the validation set to get the root mean squared logarithmic error.
  • 4. 1.2.1 Convert Categorical Variables to Dummy Variables One of the solutions to the dummy variable trap is to drop one of the categorical variables. If there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference. For the bike sharing demand data, year, month, hour, weekday, season, holiday, working day, and weather are categorical variables and are converted to dummy variables. 1.2.2 Relationship between numerical predictor variables Figure 1. Scatter plot matrix that plots each numerical variables against one another. The above scatter plot matrix shows the relationship between each numerical predictor variables as well as the response variable with the order from top to bottom or from left to right having variable names: temperature, “feels like” temperature, relative humidity, wind speed, total number of rentals. The plot shows rather independent relationships between each variables except for the one between temperature and “feels like” temperature, which is reasonable as
  • 5. these two variables are generally very close to each other. Because multicollinearity can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes, we will apply regularization to the methods to counteract this tendency. 1.3 Performance Metrics For regression problem, the method of measuring the distance between the estimated outputs from the actual outputs is used to quantify the model's performance. The Mean Squared Error penalizes the bigger difference more because of the square effect. On the other hand, if we want to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The effect of introducing the logarithm function is to balance the emphasis on small and big predictive errors. For this project, the effectiveness of the models will be evaluated based on the Mean Squared Error (MSE) and the Root Mean Squared Logarithmic Error (RMSLE): √ 1 𝑛 ∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2 𝑛 𝑖=1 Where:  n is the number of hours in the test set  pi is the predicted count  ai is the actual count  log(x) is the natural logarithm
  • 6. 2 Ordinary Least Squares Regression 2.1 Method Description Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear regression model, with the goal to minimize the differences between the observed responses and the predicted responses. Let X be a n ×p dimensional training data input matrix where n is the total number of observations and p is the number of features for each observation, Y be a n ×1 dimensional vector of the training data response values, where n is the total number of observations, and β be a p ×1 dimensional vector of unknown parameters. Then the OLS estimate of β for the linear model is defined as 𝛽̂ = (𝑋′ 𝑋)−1 (𝑋′ 𝑌) 2.2 Performance Metric Mean Squared Error = 10015 2.3 Result Analysis The mean squared error is rather high and the relationship between bike rental demand and its exogenous factors appear to be rather complex and nonlinear, making it difficult to be modeled through traditional linear regression.
  • 7. 3 Lasso Regularization and Elastic Net 3.1 Method Description 3.1.1 Lasso Regularization Lasso Regression is a regularized version of linear regression which uses the constraint L1-norm to minimize the sum of squared errors. In this paper, a 5-fold-cross-validated sequence of models with lasso is fitted in order to produce shrinkage estimates with potentially lower predictive errors than ordinary least squares. 3.1.2 Elastic Net Elastic net is a combination of ridge regression and lasso regularization. Similar to lasso, elastic net can also generate zero-valued coefficients. Empirical studies suggested that elastic net can outperform lasso on data with highly correlated predictors. 3.2 Performance Metric Figure 2. Lambda vs. MSE for Lasso fit Figure 3. Lambda vs. MSE for Elastic Net fit
  • 8. Mean Squared Error of Lasso = 10101 Mean Squared Error of Elastic Net = 10118 3.3 Result Analysis The large mean squared errors of both lasso and elastic net indicate that even regularized linear regression is not a good approach to forecast the bike sharing demand. In the following sections, we will explore multiple nonlinear regression techniques. 4 Ensemble Learning and Ensemble Regularization 4.1 Method Description Ensemble methods use multiple learning algorithms to obtain better predictive performance. An ensemble is a technique for combining many weak learners in order to produce a strong learner. 4.1.1 Least Squares Boosting Least Squares Boosting is a type of ensemble learning which fits regression ensembles in order to minimize mean squared error. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. 4.1.2 Bagging Bagging is another type of ensemble learning which works by training learners on resampled versions of the data. The resampling is done by bootstrapping observations in the training set. Although the flexibility of ensemble makes ensemble easier to over-fit the training data, bagging tend to reduce this problem.
  • 9. 4.1.3 Ensemble Regularization Ensemble regularization helps choose fewer weak learners in a way that does not diminish predictive performance. Specifically, it finds an optimal set of learner weights by tuning the lasso parameter to minimize the ensemble resubstitution error. 4.2 Performance Metric Mean Squared Error of Least Squares Boosting = 10030 Mean Squared Error of Bagging = 3656.5 Mean Squared Error of Regularized Bagging = 3473.4 Root Mean Squared Logarithmic Error of Regularized Bagging for Validation data set = 0.63302 4.3 Result Analysis 4.3.1 Least Squares Boosting 1) Figure 1 estimates the generalization error by cross validation. The line shows that it is sufficient to obtain satisfactory performance from a smaller ensemble, perhaps one containing from 100 to 120 trees. Figure 4. Number of trees vs. Cross-validated MSE 2) Variables for generating the model with higher number representing greater importance:
  • 10. We see that hour, month, atemp, temp humidity, season, and year have greater importance. 3) Let the errors be the difference between the predicted and the real count .The normal probability plot of errors shows that residuals are closed to normally distributed in the center of the data while skewe away from normal above and below the mean. Figure 5. Normal Probability Plot 4) We separate the errors into groups by different categorical variables to see if there is any distribution in a certain period that is significantly different from that of the others. Figure 6. Breakdown of Errors by hour Figure 7. Breakdown of Errors by month
  • 11. Figure 8. Breakdown of Errors by weekday Figure 9. Breakdown of Errors by season We observe that the errors during hours at 7, 8, 17, and 18 indicate a major change of errors from those that are before or after these hours. For the breakdown of errors by weekday, Saturday and Sunday pattern appear to be different from those of the workdays where the variance of the errors tend to be smaller. 4.3.1 Bagging 1) Importance of Variables: Compared to Least Squares Boosting, “hours” now becomes the only variable that stands out in the value for importance. 4.3.2 Regularized Bagging 1) Comparing regularized and unregularized ensembles:
  • 12. Figure 10. Lasso parameter vs. Resubstitution MSE Figure 11. Lasso parameter vs. number of learners with nonzero weights (‘x’ denotes value at lambda = 0 & logarithmic scale, same for all five figures) Figure 12. Lambda vs. MSE for resubstituion and cross-validation Figure 13. Lambda vs. Number of learners for resubstituion and cv From Figure 11, we can see that the number of learners has reduced by over 1/3 for regularized ensemble. Because the resubstitution MSE values are likely to be overly optimistic, we cross validate the ensemble for different values of lambda. Figure 14. Number of trees
  • 13. The cross-validated error in Figure 12 shows that the cross-validation MSE is almost flat for lambda up to a bit over 103 . With the regularization, there are only 42 trees in the new ensemble, notably reduced from the 200 in the unregularized ensemble. The reduced ensemble is about 19.8% the size of the original while giving lower loss. 2) Figure 15 suggests our model encounters some problems in predicting higher counts where all of residuals are biased in the same direction, this shows that there is some effects occurred during high counts that the model doesn't do a good job in capturing. Figure 15. Predicted values vs. Residuals 3) Use a simple chart to show predicted versus actual count for 6 months of data in 2011: Figure 16. True Count vs. Regularized Bag Ensemble for days 16 to 19 of January to June in 2011
  • 14. The blue line represents the real counts while the red line represents the predicted counts. The data includes dates of 16 to 19 of January to June in 2011. As illustrated by the graph, the model is not very efficient in capturing the peak values of real data. 5 Neural Network (NN) According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60), “a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes.” Generally, a neural network consists of many processing units connected by communication channels which carry numeric. The processing units operate only on their local data and on the inputs they receive via the connections. 5.1 Method Description In order to fit a neural network to the bike sharing demand data, parameters to configure include the type of neural network, the number of layers for the neural network, the number of neurons in each layer, the transfer functions between layers, the performance metric and the training function. After multiple attempts, the best network structure is a cascade-forward three-layer network with neurons 10, 15, and 10 for each layer. The transfer functions are tangent sigmoid for the first three layers and linear for the last. The performance metric is Mean squared error and training function is Bayesian regularization backpropagation, which updates the weight and bias values according to Levenberg-Marquardt optimization to minimize a combination of squared errors and weights.
  • 15. 5.2 Best Performance Metric Mean Squared Error = 2582.7 Root Mean Squared Logarithmic Error for Validation data set = 0.79834 5.3 Result Analysis Figure 17. Cascade-forward network with three layers each with neurons 10, 15 and 10 Figure 18. Epochs vs. MSE Figure 19. Target value vs. Output Figure 17 displays a view of the cascade-forward network that generates the best result. For Figure 18, the mean squared error for training data began to stay constant while the MSE for test data kept dropping after around 34 epochs, which means the model started to over-fit the data afterwards, so we stopped here. The R square in Figure 19 indicates that the model is a good fit.
  • 16. 6 Local Linear Regression (LLR) For a regression model that is highly nonlinear and of unknown structure, local linear regression may be applied to fit the model. The local linear regression performs weighted local averaging with the weights determined by a kernel function. Within a radius of bandwidth h from an observation x0 in the training data, the new input x will use the parameters for x0 to generate new responses, where for each x0, its parameters β are determined locally using ordinary least squares optimization. 6.1 Method Description For the equation above we have, 𝑊ℎ = 𝑑𝑖𝑎𝑔(𝐾 ( ‖𝑥1 − 𝑥0‖ ℎ ) , … , 𝐾 ( ‖𝑥 𝑛 − 𝑥0‖ ℎ )) 𝑋0 = 𝑋 − 1 𝑛 𝑥0′ where X and Y are defined as in the previous description for ordinary least squares regression. For the bike sharing demand data, h = 0.5, 5, 10, 30 are each applied to the method and Gaussian kernel function is used. 6.2 Best Performance Metric Mean square error > 10000
  • 17. 6.3 Result Analysis The high mean squared errors for each bandwidth h indicates that local linear regression might not be an ideal method to predict the bike sharing demand, yet by adjusting the bandwidth h, there is possibly some space to improve the performance metric. 7 Conclusion To conclude, the methods that generate better mean squared errors are ensemble learning and neural network among all the methods. For this two methods, cross validation are performed to ensure the model quality and multiple attempts are necessary in order to find the best fit. For ensemble learning, regularization of bag ensemble has seen significant improved result for the performance metric. For neural network, number of layers from one to four and number of neurons from small to large has been tested on. For further steps, more attempts of different tuning parameters for current methods might improve the predicted results and deeper analysis of data and its features can be analyzed to identify any missing relationships. For each method, Matlab code is provided in the Appendix for further explanation. References "Neural Network Toolbox." Http://www.mathworks.com/help/nnet/. Web. S, Warren, and Cary Sarle, Cary. Ftp://ftp.sas.com/pub/neural/FAQ.html. Web.
  • 18. Appendix (i) Predictor Variables Categorical Variables: ·Year: 2011 to 2012 ·Month: January to December ·Hour: 01 to 24 hours ·Weekday: Sunday to Monday ·Season: spring to winter ·holiday: whether the day is considered a holiday ·working day: whether the day is neither a weekend nor holiday ·weather: 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog Numerical Variables: ·temperature in Celsius · “feels like” temperature in Celsius ·relative humidity ·wind speed (ii) Response Variables ·number of non-registered user rentals initiated ·number of registered user rentals initiated ·number of total rentals (iii) Matlab code The Matlab code for each method performed above is attached in a separate file named “matlabcode_IEOR265paper_MinchaoLin.html”