FORECASTING PROJECT ON US DOMESTIC FLIGHTS
(In Revolution Analytics)
Prepared By:
Wyendrila Roy
http://in.linkedin.com/pub...
Acknowledgement
This project is done as a final project, as a part of the course titled “Business Analytics with R”. I am
...
Table of Contents
Methodology................................................................................................
Methodology
1. Overview
In this report we have analyzed time series data in R language. We have used the “data step” funct...
The Analysis
1. Importing the Data
We use the RevoScaleR function to read the text file into the special xdf binary format...
rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights",
+ decreasing = TRUE,overwrite=TRUE)
> rxGetInfoXdf...
origin_airport destin_airport passengers flights month origin
1 SFO RDM 1413 92 199003 SFO
2 SFO RDM 1394 88 199006 SFO
3 ...
> rxHistogram(~flights|origin, data="mostFlights")
The transformation function, xform, used in rxDataStep creates a new va...
> rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX",
+ varsToDrop=c("origin_airport","month"),
+ overwrite=TRUE)
> rxGet...
> head(t1)
F_Year F_Month origin flights Counts
1 1990 1 SFO 39.04225 284
2 1991 1 SFO 38.42034 295
3 1992 1 SFO 46.23954 ...
We use the R function, ts, to form the data into a time series object, and use the function stl to perform
a seasonal deco...
> SFO.ts <- ts(y,start=x[1],freq=12)
> sd.SFO <- stl(SFO.ts,s.window="periodic")
> plot(sd.SFO)
In the above graph, the fi...
We may now repeat the above steps for the LAX data
> LAX.t1 <- t1[t1$origin=="LAX",]
> LAX.t1 <- LAX.t1[order(LAX.t1$Date)...
> LAX.ts <- ts(b,start=x[1],freq=12)
> sd.LAX <- stl(LAX.ts,s.window="periodic")
> plot(sd.LAX)
120001600020000
data
-1500...
5. Predict Future Values based on the Time Series Analysis
Now we can proceed with the forecasting analysis, for our furth...
By default, HoltWinters() just makes forecasts for the same time period covered by our original time
series. In this case,...
> Forecast <- forecast.HoltWinters(fit, h=18)
> Forecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2010 10987.14 10177.29...
Here the forecasts for ‘Jan 2010 – June 2011’ are plotted as a dark blue line, the 80% prediction interval
as the blue sha...
> acf(Forecast$residuals, lag.max=20)
To test whether there is significant evidence for non-zero correlations at lags 1-20...
> plot.ts(Forecast$residuals)
The plot shows that the in-sample forecast errors seem to have roughly constant variance ove...
The plot shows that the distribution of forecast errors is roughly centered on zero, and is more or less
normally distribu...
Autoregressive Integrated Moving Average (ARIMA) models
Autoregressive Integrated Moving Average (ARIMA) models include an...
> SFO_diff_1 <- diff(SFO.ts, differences=2)
> plot.ts(SFO_diff_1)
The time series of second differences (above) does appea...
We see from the correlogram that the autocorrelation at lag 1, 2, and 3 exceeds the significance bounds,
but its decreasin...
The partial correlogram shows that the partial autocorrelations at lags 1 and 2 exceed the significance
bounds, are negati...
Forecasting using an ARIMA model
> SFO_arima <- arima(SFO.ts, order=c(2,2,3)) # fit an ARIMA(2,2,3) model
> SFO_arima
Seri...
> plot.forecast(SFO_arimaforecasts)
Forecasts from ARIMA(2,2,3)
1990 1995 2000 2005 2010
8000100001200014000
Conclusion
In this report, we worked on a large data set, i.e., the airlines flight data set from infochimps.com, which
co...
Upcoming SlideShare
Loading in...5
×

Forecasting analysis on us flights v1

2,030

Published on

It is a Final Project done for Edureka on Time Series and Forecasting Analysis in R. The data consists of US Domestic Flight details from 1999-2009 and is downloaded from infochimps.com. The dataset contains more than 3.5 million records which is read using the RevoScaleR package. The data is cleansed, analysed and broken into smaller subsets which is then used for forecasting the flight activity for 2010 - June 2011.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,030
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Forecasting analysis on us flights v1"

  1. 1. FORECASTING PROJECT ON US DOMESTIC FLIGHTS (In Revolution Analytics) Prepared By: Wyendrila Roy http://in.linkedin.com/pub/wyendrila-roy/5/3a/876
  2. 2. Acknowledgement This project is done as a final project, as a part of the course titled “Business Analytics with R”. I am really thankful to our course instructor Mr. Ajay Ohri, Founder, DecisionStats, for giving me an opportunity to do the project in Time Series Analysis using R and providing me with the necessary support and guidance which made me complete the project on time. I am extremely grateful to him for providing me with the big data set and also the necessary links to start of the project and understand Time Series Analysis. In this project I have chosen the topic- “Forecasting on US Domestic Flights”, where I have analyzed the flight activities in the Top Domestic Airports of US and then presented a prediction of the same for 2010 – June’ 2011. Due to the size of the data set this project is done in Revolution Analytics. I am really grateful to the extremely resourceful articles and publications provided by Revolution Analytics, which helped me in understanding the tool as well as the topic. Also, I would like to extend my sincere regards to the support team of Edureka for their constant and timely support.
  3. 3. Table of Contents Methodology................................................................................................................................................4 1. Overview...........................................................................................................................................4 2. Data Source.......................................................................................................................................4 3. Limitations.........................................................................................................................................4 4. Tool/Package Used............................................................................................................................4 5. File Format Used...............................................................................................................................4 The Analysis..................................................................................................................................................5 1. Importing the Data............................................................................................................................5 2. Exploring the Data.............................................................................................................................5 3. Aggregating the Data ........................................................................................................................8 4. Building the Time Series....................................................................................................................9 5. Predict Future Values based on the Time Series Analysis ..............................................................15 Conclusion ..................................................................................................................................................28 References..................................................................................................................................................28
  4. 4. Methodology 1. Overview In this report we have analyzed time series data in R language. We have used the “data step” functions in Revolution Analytics’ RevoScaleR package to access a large data file, manipulated it, sorted it, extracted the data we needed and then aggregated the records with monthly time stamps to form multiple, monthly time series. Then we have used ordinary R time series functions to do some basic analysis. Thereafter we have used forecasting functions to predict the domestic flights activity for Top airports in US for the period Jan 2010 –June 2011. 2. Data Source The dataset used in this report is the airlines “edge” flight data set (77,242 KB) from infochimps.com. It contains 3.5 million monthly domestic flight records from 1990 to 2009. 3. Limitations The major limitation was to extract the time series from time stamped data embedded in this very large data set. These types of data sets are too large to be read into memory and processed by normal R language. 4. Tool/Package Used This Report uses Revolution Analytics’ new add-on package called RevoScaleR™, which provides unprecedented levels of performance and capacity for statistical analysis in the R environment. With the help of this package, we can process, visualize and model the largest data sets in a fraction of the time of legacy systems, without the need to deploy expensive or specialized hardware. 5. File Format Used RevoScaleR provides a new data file type with extension .xdf that has been optimized for “data chunking”, accessing parts of an Xdf file for independent processing. Xdf files store data in a binary format. The file format provides very fast access to a specified set of rows for a specified set of columns. New rows and columns can be added to the file without re-writing the entire file. RevoScaleR also provides a new R class, RxDataSource that has been designed to support the use of external memory algorithms with .xdf files.
  5. 5. The Analysis 1. Importing the Data We use the RevoScaleR function to read the text file into the special xdf binary format used by RevoScaleR functions: > rxGetInfoXdf("Flights",getVarInfo=TRUE) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Variable information: Var 1: origin_airport 683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI Var 2: destin_airport 708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1 Var 3: passengers, Type: integer, Low/High: (0, 89597) Var 4: flights, Type: integer, Low/High: (0, 1128) Var 5: month, Type: character > rxGetInfoXdf(file="Flights",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Data (10 rows starting with row 1): origin_airport destin_airport passengers flights month 1 MHK AMW 21 1 200810 2 EUG RDM 41 22 199011 3 EUG RDM 88 19 199012 4 EUG RDM 11 4 199010 5 MFR RDM 0 1 199002 6 MFR RDM 11 1 199003 7 MFR RDM 2 4 199001 8 MFR RDM 7 1 199009 9 MFR RDM 7 2 199011 10 SEA RDM 8 1 199002 2. Exploring the Data Now we will sort the file by flights to find the origin / destination pairs, which have the most monthly flights and pick out the two top origin airports having the most flights.
  6. 6. rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights", + decreasing = TRUE,overwrite=TRUE) > rxGetInfoXdf(file="sortFlights") File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 > mostflights5 <- rxGetInfoXdf(file="sortFlights",numRows=5,startRow=1) > mostflights5 File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Data (5 rows starting with row 1): origin_airport destin_airport passengers flights month 1 SFO LAX 83153 1128 199412 2 LAX SFO 80450 1126 199412 3 HNL OGG 73014 1058 199408 4 OGG HNL 77011 1056 199408 5 OGG HNL 63020 1044 199412 > top5f <- as.data.frame(mostflights5[[5]]) > topOA <- unique(as.vector(top5f$origin_airport)) > # Select the top 2 > top2 <- topOA[1:2] > top2 [1] "SFO" "LAX" From the above code we can see that the two top origin airports that have the most flights are San Francisco International (SFO) and Los Angeles International (LAX) Next, we use the RevoScaleR function rxDataStep to build a new file “mostFlights” containing only those flights that originate in either SFO or LAX. > rxGetInfoXdf("mostFlights",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Data (10 rows starting with row 1):
  7. 7. origin_airport destin_airport passengers flights month origin 1 SFO RDM 1413 92 199003 SFO 2 SFO RDM 1394 88 199006 SFO 3 SFO RDM 922 86 199001 SFO 4 SFO RDM 1661 93 199008 SFO 5 SFO RDM 1093 88 199005 SFO 6 SFO RDM 995 79 199011 SFO 7 SFO RDM 1080 83 199004 SFO 8 SFO RDM 1279 78 199012 SFO 9 SFO RDM 1080 83 199002 SFO 10 SFO RDM 1493 92 199007 SFO > rxGetInfoXdf("mostFlights",getVarInfo=TRUE) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Variable information: Var 1: origin_airport 683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI Var 2: destin_airport 708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1 Var 3: passengers, Type: integer, Low/High: (0, 83153) Var 4: flights, Type: integer, Low/High: (0, 1128) Var 5: month, Type: character Var 6: origin 2 factor levels: SFO LAX
  8. 8. > rxHistogram(~flights|origin, data="mostFlights") The transformation function, xform, used in rxDataStep creates a new variable, origin, with only two levels (“SFO” and “LAX”) to hold the information on origin airports. The last line of code in this section produces the following histogram of monthly flights 3. Aggregating the Data Now we will break the month variable (which we originally imported as character data) into a month and year component in order to proceed with our Time Series Analysis. > xfunc = function(data){data$Month = as.integer(substring(data$month,5,6)) + data$Year = as.integer(substring(data$month,1,4)) + return(data)} > xfunc = function(data){data$Month = as.integer(substring(data$month,5,6)) + data$Year = as.integer(substring(data$month,1,4)) + return(data)} > # Add a new variable for time series work > rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX", + overwrite = TRUE, transformVars="month",transformFunc = xfunc) > (file="SFO_LAX", numRows=10,startRow=1)
  9. 9. > rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX", + varsToDrop=c("origin_airport","month"), + overwrite=TRUE) > rxGetInfoXdf(file="SFO.LAX",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionSFO.LAX.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Data (10 rows starting with row 1): destin_airport passengers flights origin Month Year 1 RDM 1413 92 SFO 3 1990 2 RDM 1394 88 SFO 6 1990 3 RDM 922 86 SFO 1 1990 4 RDM 1661 93 SFO 8 1990 5 RDM 1093 88 SFO 5 1990 6 RDM 995 79 SFO 11 1990 7 RDM 1080 83 SFO 4 1990 8 RDM 1279 78 SFO 12 1990 9 RDM 1080 83 SFO 2 1990 10 RDM 1493 92 SFO 7 1990 The transformation function, xfunc, used in rxDataStepXdf uses ordinary R string handling functions to break apart the month data. A second data step function drops the unnecessary variables from our final file: SFO.LAX. 4. Building the Time Series The function rxCube counts the number of flights in each combination of Year, Month and origin airport. > xfunc <- function(data){ + data$Month <- as.integer(substring(data$month,5,6)) + data$Year <- as.integer(substring(data$month,1,4)) + return(data) + } > > rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX", + overwrite = TRUE, transformVars="month",transformFunc = xfunc) > (file="SFO_LAX",numRows=10,startRow=1) > t1 <-rxCube(flights ~ F(Year):F(Month):origin, removeZeroCounts=TRUE,data = "SFO_LAX") > t1 <- as.data.frame(t1)
  10. 10. > head(t1) F_Year F_Month origin flights Counts 1 1990 1 SFO 39.04225 284 2 1991 1 SFO 38.42034 295 3 1992 1 SFO 46.23954 263 4 1993 1 SFO 44.39464 261 5 1994 1 SFO 36.15417 240 6 1995 1 SFO 45.76768 198 From the above table, we see that there were 284 records where the originating airport was SFO for the first month of 1990. The average number of flights among these 284 counts was 39.04225. From this information, we can calculate the total number of flights for each month. The next bit of code does this and forms the time information into a proper date. Note that we have reduced the data sufficiently so that we are now working with a data frame, t1. Now we will compute total flights out and combine month and date into a date t1$flights_out<- t1$flights*t1$Counts > names(t1) <- c("Year","Month","origin","avg.flights.per.destin","total.destin","flights.out") > t1$Date <- as.Date(as.character(paste(t1$Month,"- 28 -",t1$Year)),"%m - %d - %Y") > head(t1) Year Month origin avg.flights.per.destin total.destin flights.out Date 1 1990 1 SFO 39.04225 284 11088 1990-01-28 2 1991 1 SFO 38.42034 295 11334 1991-01-28 3 1992 1 SFO 46.23954 263 12161 1992-01-28 4 1993 1 SFO 44.39464 261 11587 1993-01-28 5 1994 1 SFO 36.15417 240 8677 1994-01-28 6 1995 1 SFO 45.76768 198 9062 1995-01-28 Now, we extract out the SFO data, sort it to form a time series and plot it. > SFO.t1 <- SFO.t1[order(SFO.t1$Date),] > x <-SFO.t1$Date > y <-SFO.t1$flights.out > library(ggplot2) > qplot(x,y, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of SFO")
  11. 11. We use the R function, ts, to form the data into a time series object, and use the function stl to perform a seasonal decomposition.
  12. 12. > SFO.ts <- ts(y,start=x[1],freq=12) > sd.SFO <- stl(SFO.ts,s.window="periodic") > plot(sd.SFO) In the above graph, the first panel of reproduces the time series. The second panel shows the periodic, seasonal component. The third panel displays the trend and the fourth panel displays the residuals. 700090001100013000 data -10000500 seasonal 900011000 trend -10000500 7335 7340 7345 7350 remainder time
  13. 13. We may now repeat the above steps for the LAX data > LAX.t1 <- t1[t1$origin=="LAX",] > LAX.t1 <- LAX.t1[order(LAX.t1$Date),] > a <-LAX.t1$Date > b<-LAX.t1$flights.out > qplot(a,b, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of LAX") 12000 14000 16000 18000 20000 1990 1995 2000 2005 2010 NumberofFlights Monthly Flights Out of LAX
  14. 14. > LAX.ts <- ts(b,start=x[1],freq=12) > sd.LAX <- stl(LAX.ts,s.window="periodic") > plot(sd.LAX) 120001600020000 data -1500-500500 seasonal 140001600018000 trend -1000010002000 7335 7340 7345 7350 remainder time
  15. 15. 5. Predict Future Values based on the Time Series Analysis Now we can proceed with the forecasting analysis, for our further analysis we will work on the SFO time series data and predict its values for the period of Jan’2010-June’2011. We will use the Simple Exponential Smoothing as well as the ARIMA model for our forecasting analysis. SFO.ts = ts(y, start = c(1990), freq=12) plot.ts(SFO.ts) Simple Exponential Smoothing fit <- HoltWinters(SFO.ts, beta=FALSE, gamma=FALSE) fit Smoothing parameters: alpha: 0.6726511 beta : FALSE gamma: FALSE Coefficients: [,1] a 10987.14 Time SFO.ts 1990 1995 2000 2005 2010 70008000900010000110001200013000
  16. 16. By default, HoltWinters() just makes forecasts for the same time period covered by our original time series. In this case, our original time series included Number of Flights originating from SFO from 1990- 2009, so the forecasts are also for 1990-2009. In the example above, we have stored the output of the HoltWinters() function in the list variable “fit”. >plot(fit) The plot shows the original time series in black, and the forecasts in red. As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in- sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-ofsquared-errors is stored in a named element of the list variable “fit” called “SSE”, so we can get its value by typing: > fit$SSE [1] 95039885 That is, here the sum-of-squared-errors is 95039885 As explained above, by default HoltWinters() just makes forecasts for the time period covered by the original data, which is 1990-2009 in this case. We can make forecasts for further time points by using the “forecast.HoltWinters()” function in the R “forecast” package. Holt-Winters filtering Time Observed/Fitted 1990 1995 2000 2005 2010 70008000900010000110001200013000
  17. 17. > Forecast <- forecast.HoltWinters(fit, h=18) > Forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 Jan 2010 10987.14 10177.296 11796.98 9748.591 12225.68 Feb 2010 10987.14 10011.132 11963.14 9494.466 12479.81 Mar 2010 10987.14 9869.403 12104.87 9277.711 12696.56 Apr 2010 10987.14 9743.726 12230.55 9085.504 12888.77 May 2010 10987.14 9629.634 12344.64 8911.015 13063.26 Jun 2010 10987.14 9524.415 12449.86 8750.096 13224.18 Jul 2010 10987.14 9426.272 12548.00 8600.000 13374.28 Aug 2010 10987.14 9333.946 12640.33 8458.798 13515.48 Sep 2010 10987.14 9246.509 12727.77 8325.076 13649.20 Oct 2010 10987.14 9163.260 12811.02 8197.757 13776.52 Nov 2010 10987.14 9083.648 12890.63 8076.001 13898.27 Dec 2010 10987.14 9007.235 12967.04 7959.137 14015.14 Jan 2011 10987.14 8933.663 13040.61 7846.619 14127.66 Feb 2011 10987.14 8862.637 13111.64 7737.995 14236.28 Mar 2011 10987.14 8793.911 13180.36 7632.886 14341.39 Apr 2011 10987.14 8727.273 13247.00 7530.973 14443.30 May 2011 10987.14 8662.545 13311.73 7431.980 14542.30 Jun 2011 10987.14 8599.571 13374.70 7335.670 14638.61 The forecast.HoltWinters() function gives the forecast for our 18 month period, a 80% prediction interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted value for Jan 2010 is about 10987.14, with a 95% prediction interval of (9748.591, 12225.68). To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:
  18. 18. Here the forecasts for ‘Jan 2010 – June 2011’ are plotted as a dark blue line, the 80% prediction interval as the blue shaded area, and the 95% prediction interval as a light blue shaded area. The ‘forecast errors’ are calculated as the observed values minus predicted values, for each time point. We can only calculate the forecast errors for the time period covered by our original time series, which is 1990-2009 for the Flight data. The in-sample forecast errors are stored in the named element “residuals” of the list variable returned by forecast.HoltWinters(). We will now obtain a correlogram of the in-sample forecast errors for lags 1-20. We can calculate a correlogram of the forecast errors using the “acf()” function in R. To specify the maximum lag that we want to look at, we use the “lag.max” parameter in acf(). Forecasts from HoltWinters 1990 1995 2000 2005 2010 8000100001200014000
  19. 19. > acf(Forecast$residuals, lag.max=20) To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a Ljung-Box test. > Box.test(Forecast$residuals, lag=20, type="Ljung-Box") Box-Ljung test data: Forecast$residuals X-squared = 370.1992, df = 20, p-value < 2.2e-16 To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the forecast errors are normally distributed with mean zero and constant variance. To check whether the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors: 0.0 0.5 1.0 1.5 -0.50.00.51.0 Lag ACF Series Forecast$residuals
  20. 20. > plot.ts(Forecast$residuals) The plot shows that the in-sample forecast errors seem to have roughly constant variance over time, although the size of the fluctuations in the start of the time series may be slightly less than that at later dates. The fluctuations for the time period 2000-2005 is quite high. To check whether the forecast errors are normally distributed with mean zero, we can plot a histogram of the forecast errors, with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors. Time Forecast$residuals 1990 1995 2000 2005 2010 -3000-2000-1000010002000
  21. 21. The plot shows that the distribution of forecast errors is roughly centered on zero, and is more or less normally distributed, although it seems to be slightly skewed to the left compared to a normal curve. However, the left skew is relatively small, and so it is plausible that the forecast errors are normally distributed with mean zero. Histogram of forecasterrors forecasterrors Density -6000 -4000 -2000 0 2000 4000 0e+002e-044e-046e-048e-04
  22. 22. Autoregressive Integrated Moving Average (ARIMA) models Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series that allows for non-zero autocorrelations in the irregular component. a. Differencing a Time Series ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used. > SFO_diff <- diff(SFO.ts, differences=1) > plot.ts(SFO_diff) The resulting time series of first differences (above) does not appear to be stationary in mean. Therefore, we can difference the time series twice, to see if that gives us a stationary time series: Time SFO_diff 1990 1995 2000 2005 2010 -3000-2000-1000010002000
  23. 23. > SFO_diff_1 <- diff(SFO.ts, differences=2) > plot.ts(SFO_diff_1) The time series of second differences (above) does appear to be stationary in mean and variance, as the level of the series stays roughly constant over time, and the variance of the series appears roughly constant over time. Thus, it appears that we need to difference the time series of the ‘SFO flights’ twice in order to achieve a stationary series. This means that we can use an ARIMA(p,d,q) model for the above time series, where d (order of differencing) = 2, i.e., ARIMA(p,2,q). The next step is to figure out the values of p and q for the ARIMA model. To do this, we usually need to examine the correlogram and partial correlogram of the stationary time series. b. Autocorrelations > acf(SFO_diff_1, lag.max=20) # plot a correlogram > acf(SFO_diff_1, lag.max=20, plot=FALSE) Autocorrelations of series ‘SFO_diff_1’, by lag 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 1.000 -0.709 0.264 0.084 -0.361 0.553 -0.652 0.529 -0.327 0.078 0.231 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 -0.572 0.787 -0.621 0.286 0.035 -0.295 0.501 -0.623 0.504 -0.287 Time SFO_diff_1 1990 1995 2000 2005 2010 -2000020004000
  24. 24. We see from the correlogram that the autocorrelation at lag 1, 2, and 3 exceeds the significance bounds, but its decreasing and its nearing zero after lag 3 although there are other autocorrelations between lags 1-20 that exceed the significance bounds. c. Partial Autocorrelations > pacf(SFO_diff_1, lag.max=20) # plot a partial correlogram > pacf(SFO_diff_1, lag.max=20, plot=FALSE) Partial autocorrelations of series ‘SFO_diff_1’, by lag 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 0.9167 -0.709 -0.481 0.055 -0.326 0.232 -0.357 -0.027 -0.401 -0.019 -0.053 -0.464 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 0.130 0.096 0.045 -0.020 0.100 0.118 -0.013 -0.073 0.005 0.0 0.5 1.0 1.5 -0.50.00.51.0 Lag ACF Series SFO_diff_1
  25. 25. The partial correlogram shows that the partial autocorrelations at lags 1 and 2 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag. The partial autocorrelations nears zero after lag 2. 0.5 1.0 1.5 -0.6-0.4-0.20.00.2 Lag PartialACF Series SFO_diff_1
  26. 26. Forecasting using an ARIMA model > SFO_arima <- arima(SFO.ts, order=c(2,2,3)) # fit an ARIMA(2,2,3) model > SFO_arima Series: SFO.ts ARIMA(2,2,3) Coefficients: ar1 ar2 ma1 ma2 ma3 -1.7315 -0.9996 0.7477 -0.7477 -1.000 s.e. 0.0014 0.0006 0.0211 0.0218 0.021 sigma^2 estimated as 215044: log likelihood=-1806.58 AIC=3625.17 AICc=3625.53 BIC=3646 > > SFO_arimaforecasts <- forecast.Arima(SFO_arima, h=18) > SFO_arimaforecasts Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 Jan 2010 11141.50 10543.492 11739.51 10226.924 12056.08 Feb 2010 10705.97 9854.680 11557.25 9404.037 12007.89 Mar 2010 11356.13 10317.891 12394.36 9768.281 12943.97 Apr 2010 10666.23 9459.164 11873.29 8820.184 12512.27 May 2010 11211.39 9863.664 12559.11 9150.221 13272.56 Jun 2010 10957.55 9475.979 12439.12 8691.684 13223.41 Jul 2010 10852.63 9249.182 12456.08 8400.366 13304.90 Aug 2010 11288.52 9572.908 13004.13 8664.718 13912.32 Sep 2010 10639.15 8812.668 12465.63 7845.787 13432.51 Oct 2010 11328.32 9402.626 13254.00 8383.228 14273.40 Nov 2010 10784.62 8758.111 12811.13 7685.342 13883.90 Dec 2010 11037.64 8918.355 13156.93 7796.473 14278.81 Jan 2011 11143.50 8933.374 13353.62 7763.406 14523.59 Feb 2011 10707.79 8408.235 13007.34 7190.924 14224.65 Mar 2011 11356.89 8974.390 13739.40 7713.169 15000.62 Apr 2011 10668.99 8200.825 13137.16 6894.256 14443.73 May 2011 11211.75 8664.892 13758.61 7316.668 15106.83 Jun 2011 10960.08 8333.068 13587.08 6942.415 14977.74
  27. 27. > plot.forecast(SFO_arimaforecasts) Forecasts from ARIMA(2,2,3) 1990 1995 2000 2005 2010 8000100001200014000
  28. 28. Conclusion In this report, we worked on a large data set, i.e., the airlines flight data set from infochimps.com, which consisted of 3.5 million monthly domestic flight records from 1990 to 2009. First of all we started with analyzing the data set, figured out the variables it contains and their data types, and computed basic summary statistics. The next task was to prepare the data for analysis, which in addition to cleaning the data also involved supplementing the data set with additional information, removing unnecessary variables and, transforming some variables in a way that made sense for the contemplated analysis. Eventually the data set got smaller in size as the analysis proceeded. We prepared two subsets of the overall data in the form of most flights originating from SFO and LAX data sets. The time series analysis was done on both of them. Finally, we did our forecasting analysis on the SFO time series data. We used two forecasting techniques, the Simple Exponential Smoothing technique and the forecasting analysis based on the ARIMA model. References Books/Articles 1. Little Book of R on Time Series Analysis - Avril Coghlan 2. Introduction to R's time series facilities - Michael Lundholm 3. Working with Time Series Data in R - Eric Zivot 4. White Papers on Big Data and Data Step - Revolution Analytics Websites 1. http://www.inside-r.org/howto/extracting-time-series-large-data-sets 2. http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009 3. http://www.revolutionanalytics.com/

×