Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn
WhichstationshouldIchoose?
Data Scientist Team at SupStat Inc (Vivian Z...
Overview
Overview1.
Citibike Data2.
Scrapping3.
Data Description4.
Modeling5.
2/58
Citibike data and prediction http://nyc...
Citibike
Citibike is hosting a public bike service.
There are many bike stations in NYC.
People want to take bike from a s...
Citibike
4/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
...
Citibike data
Where are the data sets?
citibike is great in opening their datasets.
They provide previous datasets about t...
Citibike data
Where to find data for each stations bikes and slots?
We can visit http://citibikenyc.com/stations/json to s...
Historical data
we want to scrap data from the website every 5 minutes.
How to do that in R?
7/58
Citibike data and predic...
Data scraping
We use the following code:
This is the time we get the data.
require(rjson)
jsonURL = "http://citibikenyc.co...
Data scraping
Our data is in the form of list. We want to change it into data.frame.
What can we get from this data?
names...
Data scraping
We just need id, availableDocks, availableBikes, and executionTime.
executionTime = json_data$executionTime
...
Data scraping
And we can get something like this:
head(data)
## time station_id free bikes
## 1 2014-04-24 11:11:03 AM 72 ...
CRON
We use cron to schedule our tasks, including our web scrapper.
The log service for cron is off by default. We can fir...
CRON
Then we can restart the CRON service.
If the following command return a pid, then our cron service is on.
Or you can ...
CRON
The simplest way to add tasks is create a .sh script.
For example, we create a shell script named "citibike.sh".
It i...
CRON
The final step is to add our script to the list of cron tasks.
And we can add the following line to the end of cronta...
CRON
Other Examples for cron tasks.
0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00
This task will print ...
CRONTAB
On Apple MAC machine, we use crontab.
Create a file, or open an existing file to put your task description. such a...
CRONTAB
# make a new crontab file
sudo touch /etc/crontest
# change the content into this
sudo vi /etc/crontest
# content ...
CRONTAB
# you should see a few works in the file
# remove the cron job
crontab -r
# double check to see if the job is remo...
PostgreSQL
We choose PostgreSQL as the database, which is open-sourced and R-friendly.
We can easily connect to it with a ...
PostgreSQL
Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000
records/fetch is okay.
The...
PostgreSQL
The size of the table may be larger than the memory.
An alternative method is to directly play with PostgreSQL....
Data preprocessing
It is easy to handle date type of data with the following code:
Our data is clean, and useful informati...
Data preprocessing
We extract data from a single station, and name it "data_all". This is what we are gonna use:
Let us ex...
Time Series Model
We would like to predict the ratio of bikes in this station.
data$total <- data$bikes + data$free
data$r...
Time Series Model
The time interval between our data points is 5 minutes. Let's check if there's any trends:
five_day_ind ...
Time Series Model
Then we turn it into a time series object with frequency=288
Let's check our data
There is an NA value i...
Time Series Model
Use the following code to fill them with the previous value.
na.position <- which(is.na(data.ts))
data.t...
Time Series Model
The "seasonal" trend is obvious. We need to make use of this information.
It is a smooth function, extra...
Time Series Model
The fitted result looks like:
head(fit$time.series)
## seasonal trend remainder
## [1,] -0.2251 0.2772 -...
Time Series Model
Black line is original data showing how much percentage of bikes are available at each time
point. Red l...
Time Series Model
The green line is the trend:
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1...
Time Series Model
We get an approximation of our data by adding trend and seasonal effects. Blue line shows the
mixed effe...
Time Series Model
Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional
charges...
Time Series Model
With the R package 'forecast', we can do time series prediction easily.
library(forecast)
# h is number ...
Machine Learning Model
Machine learning could also be applied to the time series data.
Here we are going to use GBM for de...
Feature extraction
traindata = data[1:2000, ]
traindata = traindata[c("station_time", "ratio")]
names(traindata) <- c("tim...
Feature extraction
Time points to make prediction:
h = 6
new_time <- seq(from=traindata$time[nrow(traindata)],
by='5 min',...
Feature extraction
Let's combind our train and test data for further features.
test_id <- seq(nrow(traindata) + 1, by = 1,...
Feature extraction
Of course, this service may be popular in weekends than weekdays. So we need a variable to
mark it.
tra...
Feature extraction
Time stamp is useful:
hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))
mm <- as.nu...
Feature extraction
How to combine previous information? We need to compute a lagged time series.
A lagged time series is a...
Feature extraction
To use the information from 12:30 in 12:40, we can do it with lagged time series.
for (lag in 1:12) {
t...
Feature extraction
Don't worry about those NAs! They are inevitable in a lagged series.
traindata[1:10, 5:7]
## lag_1 lag_...
Feature extraction
Finally, we have our data
test <- traindata[test_id, -1]
train <- traindata[-test_id, -1]
train <- trai...
Machine Learning Model
Now we can use gbm to do prediction.
Wait, what is gbm?
46/58
Citibike data and prediction http://n...
Machine Learning Model
gbm refers to a certain supervised learning algorithm. It has a lot of names.
In the original publi...
Machine Learning Model
gbm is derived from a relatively simple principle.
Briefly speaking, it is "hundreds of heads are b...
Machine Learning Model
With the following code, we can calculate the model:
Here n.trees is the number of "heads"(trees) f...
Machine Learning Model
In prediction, using too many trees may cause overfitting problem.
Therefore we need to use cross-v...
Machine Learning Model
Then we can make the prediction:
best_ntree
## [1] 539
predict(model, as.data.frame(test[1,,drop=F]...
Performance testing
How to compare these two models? We set up a test.
Every day we will get 288 data points. And now we w...
Performance testing
Here is the result:
We can see that gbm is slightly better than the time series prediction.
stl_precis...
Performance testing
However, our performance is not ideal.
We can use a straight-forward prediction: treat the data stay s...
Performance testing
This picture have some hints.
plot(diff(data.ts), type = "l")
55/58
Citibike data and prediction http:...
Performance testing
We can see that this data is tend to stay the same in the next 5 minutes, or even longer.
There are so...
More to do
There are many things to do in the future:
The sky is the limit!
Apply other algorithms to this problem, like n...
Our Packages
We are developing an R package for citibike, including
There was an app written in Ruby-On-Rails here, offeri...
Upcoming SlideShare
Loading in …5
×

1

Share

Download to read offline

Nyc open data project ii -- predict where to get and return my citibike

Download to read offline

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, GBM, Machine learning, Time Series, Citibike usage prodiction, advanced R

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Nyc open data project ii -- predict where to get and return my citibike

  1. 1. CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn WhichstationshouldIchoose? Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He) Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 1 of 58 6/12/14, 5:37 PM
  2. 2. Overview Overview1. Citibike Data2. Scrapping3. Data Description4. Modeling5. 2/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 2 of 58 6/12/14, 5:37 PM
  3. 3. Citibike Citibike is hosting a public bike service. There are many bike stations in NYC. People want to take bike from a station with at least one available bike. And when they get to the destination, they want to return bike to a station with at least one available slot. Our goal is to predict where to rent and where to return 3/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 3 of 58 6/12/14, 5:37 PM
  4. 4. Citibike 4/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 4 of 58 6/12/14, 5:37 PM
  5. 5. Citibike data Where are the data sets? citibike is great in opening their datasets. They provide previous datasets about trips. But that's not what we are looking for now. 5/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 5 of 58 6/12/14, 5:37 PM
  6. 6. Citibike data Where to find data for each stations bikes and slots? We can visit http://citibikenyc.com/stations/json to see the current data. With historical data, We want to provide prediction and guide people with a better choice. 6/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 6 of 58 6/12/14, 5:37 PM
  7. 7. Historical data we want to scrap data from the website every 5 minutes. How to do that in R? 7/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 7 of 58 6/12/14, 5:37 PM
  8. 8. Data scraping We use the following code: This is the time we get the data. require(rjson) jsonURL = "http://citibikenyc.com/stations/json" json_data = fromJSON(file = jsonURL) names(json_data) ## [1] "executionTime" "stationBeanList" json_data$executionTime ## [1] "2014-04-24 11:11:03 AM" 8/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 8 of 58 6/12/14, 5:37 PM
  9. 9. Data scraping Our data is in the form of list. We want to change it into data.frame. What can we get from this data? names(json_data$stationBeanList[[1]]) ## [1] "id" "stationName" ## [3] "availableDocks" "totalDocks" ## [5] "latitude" "longitude" ## [7] "statusValue" "statusKey" ## [9] "availableBikes" "stAddress1" ## [11] "stAddress2" "city" ## [13] "postalCode" "location" ## [15] "altitude" "testStation" ## [17] "lastCommunicationTime" "landMark" 9/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 9 of 58 6/12/14, 5:37 PM
  10. 10. Data scraping We just need id, availableDocks, availableBikes, and executionTime. executionTime = json_data$executionTime ids = sapply(json_data$stationBeanList, function(x) x$id) free = sapply(json_data$stationBeanList, function(x) x$availableDocks) bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes) data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes) 10/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 10 of 58 6/12/14, 5:37 PM
  11. 11. Data scraping And we can get something like this: head(data) ## time station_id free bikes ## 1 2014-04-24 11:11:03 AM 72 19 18 ## 2 2014-04-24 11:11:03 AM 79 13 15 ## 3 2014-04-24 11:11:03 AM 82 10 17 ## 4 2014-04-24 11:11:03 AM 83 44 17 ## 5 2014-04-24 11:11:03 AM 116 8 30 ## 6 2014-04-24 11:11:03 AM 119 16 2 11/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 11 of 58 6/12/14, 5:37 PM
  12. 12. CRON We use cron to schedule our tasks, including our web scrapper. The log service for cron is off by default. We can first and delete the '#' before '#cron.*'. Then we restart rsyslog with And now we have successfully enable the log management system of cron. Use this to check the log of cron: sudo vi /etc/rsyslog.d/50-default.conf sudo service rsyslog restart sudo vi /var/log/cron.log 12/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 12 of 58 6/12/14, 5:37 PM
  13. 13. CRON Then we can restart the CRON service. If the following command return a pid, then our cron service is on. Or you can use this alternative command: sudo service cron restart pgrep cron ps aux | grep 'cron' 13/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 13 of 58 6/12/14, 5:37 PM
  14. 14. CRON The simplest way to add tasks is create a .sh script. For example, we create a shell script named "citibike.sh". It is preferred to use the absolute path. /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R 14/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 14 of 58 6/12/14, 5:37 PM
  15. 15. CRON The final step is to add our script to the list of cron tasks. And we can add the following line to the end of crontab: And restart cron to validate our operation. Here, the first parameter "*/5" means do it every 5 minutes. Next four parameters correspond to hour, day, month, weekday. And finally is the command to run. sudo vi /etc/crontab */5 * * * * root /home/vivianzhang/citibike/citibike.sh 15/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 15 of 58 6/12/14, 5:37 PM
  16. 16. CRON Other Examples for cron tasks. 0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00 This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00. what if we want to cron every 30 minutes? 0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt 0 0,3,6,9,12,15,18,21 ... 30 1,4,7,10,13,16,19,22 ... 16/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 16 of 58 6/12/14, 5:37 PM
  17. 17. CRONTAB On Apple MAC machine, we use crontab. Create a file, or open an existing file to put your task description. such as 'crontest'1. Edit your tasks as stated previously.2. Start crontab, and list running tasks.3. Check whether it run correctly4. You can remove all the cron tasks after you are done5. 17/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 17 of 58 6/12/14, 5:37 PM
  18. 18. CRONTAB # make a new crontab file sudo touch /etc/crontest # change the content into this sudo vi /etc/crontest # content of the file # solution to cron every minute */1 * * * * echo "test cron" >> /tmp/test.txt # run the job into your cron task list crontab /etc/crontest # check crontab list crontab -l # check whether the log is written to your temp file vi /tmp/test.txt 18/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 18 of 58 6/12/14, 5:37 PM
  19. 19. CRONTAB # you should see a few works in the file # remove the cron job crontab -r # double check to see if the job is removed crontab -l 19/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 19 of 58 6/12/14, 5:37 PM
  20. 20. PostgreSQL We choose PostgreSQL as the database, which is open-sourced and R-friendly. We can easily connect to it with a command like this: require(RPostgreSQL) conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456", dbname = "station_all", host = "127.0.0.1", port = "5432") 20/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 20 of 58 6/12/14, 5:37 PM
  21. 21. PostgreSQL Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000 records/fetch is okay. The following code enable us extract the first 100 records in table: And we can fetch 101th record to 10,000th record in the table res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000") data1 <- fetch(res, n = 100) data2 <- fetch(res, n = -1) 21/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 21 of 58 6/12/14, 5:37 PM
  22. 22. PostgreSQL The size of the table may be larger than the memory. An alternative method is to directly play with PostgreSQL. We can copy the table to a local file. First we need to use a valid database user. To use the default user in PostgreSQL, one can Then in the interactive interface, use the following SQL command to export the table. sudo su - postgres psql c station_all copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER 22/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 22 of 58 6/12/14, 5:37 PM
  23. 23. Data preprocessing It is easy to handle date type of data with the following code: Our data is clean, and useful information includes dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S") time available bikes available spots. · · · 23/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 23 of 58 6/12/14, 5:37 PM
  24. 24. Data preprocessing We extract data from a single station, and name it "data_all". This is what we are gonna use: Let us explore first 10,000 records. load("data_all.rda") head(data_all) ## station_time bikes free ## 1 2013-08-21 14:10:00 1 37 ## 2 2013-08-21 14:15:00 2 36 ## 3 2013-08-21 14:20:00 2 36 ## 4 2013-08-21 14:25:00 2 36 ## 5 2013-08-21 14:30:00 2 36 ## 6 2013-08-21 14:35:00 3 35 data = data_all[1:10000, ] 24/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 24 of 58 6/12/14, 5:37 PM
  25. 25. Time Series Model We would like to predict the ratio of bikes in this station. data$total <- data$bikes + data$free data$ratio <- data$bikes/data$total head(data) ## station_time bikes free total ratio ## 1 2013-08-21 14:10:00 1 37 38 0.02632 ## 2 2013-08-21 14:15:00 2 36 38 0.05263 ## 3 2013-08-21 14:20:00 2 36 38 0.05263 ## 4 2013-08-21 14:25:00 2 36 38 0.05263 ## 5 2013-08-21 14:30:00 2 36 38 0.05263 ## 6 2013-08-21 14:35:00 3 35 38 0.07895 25/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 25 of 58 6/12/14, 5:37 PM
  26. 26. Time Series Model The time interval between our data points is 5 minutes. Let's check if there's any trends: five_day_ind = 1:(288 * 5) plot(data$ratio[five_day_ind], type = "l") 26/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 26 of 58 6/12/14, 5:37 PM
  27. 27. Time Series Model Then we turn it into a time series object with frequency=288 Let's check our data There is an NA value in our sequence. data.ts <- ts(data$ratio, start = 1, frequency = 288) sum(is.na(data.ts)) ## [1] 1 27/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 27 of 58 6/12/14, 5:37 PM
  28. 28. Time Series Model Use the following code to fill them with the previous value. na.position <- which(is.na(data.ts)) data.ts[na.position] <- data.ts[na.position - 1] any(is.na(data.ts)) ## [1] FALSE 28/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 28 of 58 6/12/14, 5:37 PM
  29. 29. Time Series Model The "seasonal" trend is obvious. We need to make use of this information. It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level trends. fit <- stl(data.ts, "periodic") colnames(fit$time.series) ## [1] "seasonal" "trend" "remainder" 29/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 29 of 58 6/12/14, 5:37 PM
  30. 30. Time Series Model The fitted result looks like: head(fit$time.series) ## seasonal trend remainder ## [1,] -0.2251 0.2772 -0.025791 ## [2,] -0.2133 0.2784 -0.012396 ## [3,] -0.2126 0.2795 -0.014250 ## [4,] -0.2156 0.2806 -0.012383 ## [5,] -0.2067 0.2817 -0.022373 ## [6,] -0.2089 0.2828 0.005042 30/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 30 of 58 6/12/14, 5:37 PM
  31. 31. Time Series Model Black line is original data showing how much percentage of bikes are available at each time point. Red line is extracted seasonal effect. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) leg.txt = c("origin", "seasonal") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2) 31/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 31 of 58 6/12/14, 5:37 PM
  32. 32. Time Series Model The green line is the trend: plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) lines(fit$time.series[five_day_ind, 2], col = 3) leg.txt = c("origin", "seasonal", "trends") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3) 32/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 32 of 58 6/12/14, 5:37 PM
  33. 33. Time Series Model We get an approximation of our data by adding trend and seasonal effects. Blue line shows the mixed effect of trend and seasonal. The remaining difference is the remainder. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4) leg.txt = c("origin", "approx") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4)) 33/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 33 of 58 6/12/14, 5:37 PM
  34. 34. Time Series Model Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional charges for a journey over 30 minutes. We want to focus on the prediction for next 30 minutes, given the update happens every 5 minutes, we will fit 6 data points. 34/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 34 of 58 6/12/14, 5:37 PM
  35. 35. Time Series Model With the R package 'forecast', we can do time series prediction easily. library(forecast) # h is number of periods for forecasting pred = as.numeric(forecast(fit, h = 6)$mean) 35/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 35 of 58 6/12/14, 5:37 PM
  36. 36. Machine Learning Model Machine learning could also be applied to the time series data. Here we are going to use GBM for demonstration. Before we apply gbm to our data. We need to extract some more time related features. Especially, we need to use previous values to predict. 36/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 36 of 58 6/12/14, 5:37 PM
  37. 37. Feature extraction traindata = data[1:2000, ] traindata = traindata[c("station_time", "ratio")] names(traindata) <- c("time", "y") head(traindata) ## time y ## 1 2013-08-21 14:10:00 0.02632 ## 2 2013-08-21 14:15:00 0.05263 ## 3 2013-08-21 14:20:00 0.05263 ## 4 2013-08-21 14:25:00 0.05263 ## 5 2013-08-21 14:30:00 0.05263 ## 6 2013-08-21 14:35:00 0.07895 37/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 37 of 58 6/12/14, 5:37 PM
  38. 38. Feature extraction Time points to make prediction: h = 6 new_time <- seq(from=traindata$time[nrow(traindata)], by='5 min', length.out=h+1)[-1] new_time ## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST" ## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST" ## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST" 38/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 38 of 58 6/12/14, 5:37 PM
  39. 39. Feature extraction Let's combind our train and test data for further features. test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h) traindata <- rbind(traindata, data.frame(time = new_time, y = NA)) test_id ## [1] 2001 2002 2003 2004 2005 2006 39/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 39 of 58 6/12/14, 5:37 PM
  40. 40. Feature extraction Of course, this service may be popular in weekends than weekdays. So we need a variable to mark it. traindata$weekday <- as.factor(weekdays(traindata$time)) head(traindata$weekday) ## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday ## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday 40/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 40 of 58 6/12/14, 5:37 PM
  41. 41. Feature extraction Time stamp is useful: hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST")) mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST")) ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST")) traindata$time_hms <- hh + 60 * mm + 3600 * ss head(traindata) ## time y weekday time_hms ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 ## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514 ## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814 ## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114 41/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 41 of 58 6/12/14, 5:37 PM
  42. 42. Feature extraction How to combine previous information? We need to compute a lagged time series. A lagged time series is a "delayed" time series, as shown below f_lag <- function(x, lag=0) c(rep(NA, lag), x[1:(length(x)-lag)]) f_lag(1:10, 1) ## [1] NA 1 2 3 4 5 6 7 8 9 f_lag(1:10, 4) ## [1] NA NA NA NA 1 2 3 4 5 6 42/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 42 of 58 6/12/14, 5:37 PM
  43. 43. Feature extraction To use the information from 12:30 in 12:40, we can do it with lagged time series. for (lag in 1:12) { traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag) } traindata[1:3, ] ## time y weekday time_hms lag_1 lag_2 lag_3 ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA ## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA NA NA NA ## 2 NA NA NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA NA NA 43/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 43 of 58 6/12/14, 5:37 PM
  44. 44. Feature extraction Don't worry about those NAs! They are inevitable in a lagged series. traindata[1:10, 5:7] ## lag_1 lag_2 lag_3 ## 1 NA NA NA ## 2 0.02632 NA NA ## 3 0.05263 0.02632 NA ## 4 0.05263 0.05263 0.02632 ## 5 0.05263 0.05263 0.05263 ## 6 0.05263 0.05263 0.05263 ## 7 0.07895 0.05263 0.05263 ## 8 0.05263 0.07895 0.05263 ## 9 0.05263 0.05263 0.07895 ## 10 0.05263 0.05263 0.05263 44/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 44 of 58 6/12/14, 5:37 PM
  45. 45. Feature extraction Finally, we have our data test <- traindata[test_id, -1] train <- traindata[-test_id, -1] train <- train[!is.na(train$y), ] head(train) ## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 ## 1 0.02632 Wednesday 614 NA NA NA NA NA NA ## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA ## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA ## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA ## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA ## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA ## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA NA NA ## 4 NA NA NA NA NA NA ## 5 NA NA NA NA NA NA ## 6 NA NA NA NA NA NA 45/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 45 of 58 6/12/14, 5:37 PM
  46. 46. Machine Learning Model Now we can use gbm to do prediction. Wait, what is gbm? 46/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 46 of 58 6/12/14, 5:37 PM
  47. 47. Machine Learning Model gbm refers to a certain supervised learning algorithm. It has a lot of names. In the original publication, "gbm" is short for "Gradient Boosting Machine". In the R package, it is short for "Generalized Boosting Model". And its wiki page names it as "Gradient boosting". · · · 47/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 47 of 58 6/12/14, 5:37 PM
  48. 48. Machine Learning Model gbm is derived from a relatively simple principle. Briefly speaking, it is "hundreds of heads are better than one". This algorithm generate many regression trees and combine their results for the final model. 48/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 48 of 58 6/12/14, 5:37 PM
  49. 49. Machine Learning Model With the following code, we can calculate the model: Here n.trees is the number of "heads"(trees) for this problem. model <- gbm(formula=y~., data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))], distribution='gaussian', n.trees=2000, interaction.depth=5, shrinkage=0.01, cv.folds=0, keep.data=F) 49/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 49 of 58 6/12/14, 5:37 PM
  50. 50. Machine Learning Model In prediction, using too many trees may cause overfitting problem. Therefore we need to use cross-validation to choose the number of trees to avoid it. gbm provide us a convenient tool, here OOB means "Out Of Bag": best_ntree <- gbm.perf(model, method = "OOB") 50/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 50 of 58 6/12/14, 5:37 PM
  51. 51. Machine Learning Model Then we can make the prediction: best_ntree ## [1] 539 predict(model, as.data.frame(test[1,,drop=F]), n.trees=best_ntree, type='response') ## [1] 0.1287 51/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 51 of 58 6/12/14, 5:37 PM
  52. 52. Performance testing How to compare these two models? We set up a test. Every day we will get 288 data points. And now we want to predict next 6 points with data from the previous week, i.e. 2016 data points. We randomly choose 50 time points and make prediction for the next 30 minutes. Then compare their performance with RMSE: rmse = function(pred, real) sqrt(mean((pred - real)^2)) 52/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 52 of 58 6/12/14, 5:37 PM
  53. 53. Performance testing Here is the result: We can see that gbm is slightly better than the time series prediction. stl_precision ## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698 gbm_precision ## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258 53/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 53 of 58 6/12/14, 5:37 PM
  54. 54. Performance testing However, our performance is not ideal. We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the result? Why is this happenning? y_precision ## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311 54/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 54 of 58 6/12/14, 5:37 PM
  55. 55. Performance testing This picture have some hints. plot(diff(data.ts), type = "l") 55/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 55 of 58 6/12/14, 5:37 PM
  56. 56. Performance testing We can see that this data is tend to stay the same in the next 5 minutes, or even longer. There are so many 5-minutes that nobody come to this station. Therefore the most straight- forward prediction out-performed those two advanced methods. sum(diff(data.ts) == 0) ## [1] 6622 56/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 56 of 58 6/12/14, 5:37 PM
  57. 57. More to do There are many things to do in the future: The sky is the limit! Apply other algorithms to this problem, like neural networks. Use information from nearby station: empty nearby stations will lead people come to this one. Combine with weather record: nobody ride in rainy day! Path finding: design the whole trip for people. · · · · 57/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 57 of 58 6/12/14, 5:37 PM
  58. 58. Our Packages We are developing an R package for citibike, including There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku went to sleep since the service didn't get much traffic, but one of our meetup member spent sometime to make it live today and emailed me the link! here 2 Data scraping Database interaction and retrieve Time Series prediction GBM prediction · · · · 58/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 58 of 58 6/12/14, 5:37 PM
  • paulyang0125

    Oct. 6, 2015

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, GBM, Machine learning, Time Series, Citibike usage prodiction, advanced R

Views

Total views

1,901

On Slideshare

0

From embeds

0

Number of embeds

574

Actions

Downloads

25

Shares

0

Comments

0

Likes

1

×