3. ICO price index (in US dollars) from January
2013 to April 2013.
You can find the data file on Blackboard. Download it and put it
in the same folder as your R program file.
Then, use the following command to read the data
coffee <- read.table('Coffee_prices_2013.txt', sep = 't', header
= TRUE)
1. Make a time series plot (not scatterplot) of price against
time. Use “Coffee Price” and “Time” as label for
y-axis and x-axis, respectively. Which time series components
are evident from the plot?
2. Smooth the coffee price series using simple moving averages
(SMA) of length 2 and 8. Add the two
smoothed curves (one in red and one in green) to the plot made
in (a) and compare them.
3. Apply single exponential smoothing (SES) to the coffee price
series with weights α = 0.8 and α = 0.2,
respectively. Add the two smoothed curves (one in orange and
one in purple) to the plot made in (a) and
compare them.
4. Find autocorrelation between the original time series and
each of the first 5 lags. Then, fit an autoregressive
model with the lags whose autocorrelations are greater than 0.8.
Write down the fitted model and add the
smoothed curve in blue to the plot made in (a). Which lag does
the model depend on most? Why?
5. Suppose we know that the next value in the series was, in
fact, 138.90. Compute the corresponding
absolute percentage error (APE) for each of the models you
have fitted before. Which model gives us the best
4. prediction?
1
Coffee Prices 2013
Chapter 20: Time Series Analysis
Home Depot Sales
The Home Depot chain of home improvement stores grew in the
1980s and 1990s faster than any other retailer
in history. By 2005, it was the second largest retailer in the
United States. But its extraordinary record of
growth was slowed by the financial crisis of 2008. How do
different methods of modeling time series compare
for understanding these data?
Let’s first read the dataset into R
HD <- read.table('Home_Depot_2012_GE19.txt', sep = 't',
header = TRUE)
and look at its structure
str(HD)
## 'data.frame': 72 obs. of 6 variables:
## $ Sales: num 3.57 4.15 4 3.75 4.36 ...
## $ year : num 1995 1995 1996 1996 1996 ...
## $ Q1 : int 1 0 0 0 1 0 0 0 1 0 ...
## $ Q2 : int 0 1 0 0 0 1 0 0 0 1 ...
## $ Q3 : int 0 0 1 0 0 0 1 0 0 0 ...
## $ Q4 : int 0 0 0 1 0 0 0 1 0 0 ...
As we can see, there are 72 observations for each of 6 variables.
Both Sales and year are numerical variables,
5. while Q1, Q2, Q3 and Q4 are the indicator variables for the
quarters.
Let’s plot the time series of Home Depot quarterly sales:
plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales',
main = 'Home Depot Quarterly Sales 1995 - 2012')
lines(HD$year, HD$Sales)
1995 2000 2005 2010
5
1
0
1
5
2
0
2
5
Home Depot Quarterly Sales 1995 − 2012
Year
S
a
le
s
As we can see, there was a consistent increasing trend until the
6. end of 2006. After that, sales fell sharply.
1
They appear to have been recovering. Throughout this period,
however, there are fluctuations around the
trend that appear to be seasonal because they repeat every four
quarters.
Smoothing Methods
First, Let’s try a simple moving average (SMA). For data with a
strong seasonal component, it’s a good idea
to choose a moving average length based on the period. The
period of Home Depot sales is 4. We therefore
apply a SMA of length 2 and 4, respectively. We use SMA()
function in TTR package to do this:
library(TTR)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
sma1 <- SMA(HD$Sales, n = 2)
sma2 <- SMA(HD$Sales, n = 4)
The sma1 and sma2 return the smoothed series, which are
plotted below:
par(mfrow = c(1, 2))
plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales', main =
'SMA of Length 2')
lines(HD$year, HD$Sales)
lines(HD$year, sma1, col = 'red')
7. plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales', main =
'SMA of Length 4')
lines(HD$year, HD$Sales)
lines(HD$year, sma2, col = 'red')
1995 2000 2005 2010
5
1
0
1
5
2
0
2
5
SMA of Length 2
Year
S
a
le
s
1995 2000 2005 2010
5
1
8. 0
1
5
2
0
2
5
SMA of Length 4
Year
S
a
le
s
As expected, the SMA of longer length shows a smoother series.
The SMA of length 4 smooths out most of
the seasonal effects, and has trouble modeling the sudden
change in 2007. But, it provides a good description
of the trend. The SMA of length 2 is more wiggly, capturing the
seasonal effects and the sudden change
better.
Second, let’s try an exponential moving average (EMA). We
need to choose a smoothing weight. We will use
α = 0.5, which weights the current data value equally all the rest
in the past. For comparison reasons, we
2
9. also try α = 0.2, which makes the smoothing heavily depend on
the data in the past. The initial value is
chosen as the first observed data point. We use EMA() function
in TTR package to do this:
ema1 <- EMA(HD$Sales, ratio = 0.5, n = 1)
ema2 <- EMA(HD$Sales, ratio = 0.2, n = 1)
par(mfrow = c(1, 2))
plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales', main =
'EMA (alpha = 0.5)')
lines(HD$year, HD$Sales)
lines(HD$year, ema1, col = 'red')
plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales', main =
'EMA (alpha = 0.2)')
lines(HD$year, HD$Sales)
lines(HD$year, ema2, col = 'red')
1995 2000 2005 2010
5
1
0
1
5
2
0
2
5
10. EMA (alpha = 0.5)
Year
S
a
le
s
1995 2000 2005 2010
5
1
0
1
5
2
0
2
5
EMA (alpha = 0.2)
Year
S
a
le
s
11. As we can see, the EMA with a bigger α provides a less smooth
series. The EMA of α = 0.5 follows the
seasonal pattern more closely and models the sudden change
better. The EMA of α = 0.2 totally fails,
because it is too smooth to capture the seasonal effects, and
responds too slowly to the trend (under-estimate
before 2007 and over-estimate after).
Autoregressive Method
Let’s fit an autoregressive model. Because we know that the
seasonal pattern is of 4 quarters long, we use 4
lagged variables as the predictors. The autocorrelations
regarding those lags are given by
acf(HD$Sales, lag.max = 4, plot = FALSE)
##
## Autocorrelations of series 'HD$Sales', by lag
##
## 0 1 2 3 4
## 1.000 0.912 0.838 0.831 0.838
It’s interesting to see that the autocorrelation of lag4 is 0.838,
the second largest among all. It is also the
evidence of the seasonal effects.
3
Now let’s use ar() function to fit the autoregressive model:
ar1 <- ar(HD$Sales, aic = FALSE, order.max = 4, demean =
FALSE,
intercept = TRUE, method = 'ols')
12. ar1
##
## Call:
## ar(x = HD$Sales, aic = FALSE, order.max = 4, method =
"ols", demean = FALSE, intercept = TRUE)
##
## Coefficients:
## 1 2 3 4
## 0.4695 -0.2551 0.2233 0.4871
##
## Intercept: 1.694 (0.4544)
##
## Order selected 4 sigma^2 estimated as 1.612
The fitted AR(4) model is
ŷt = 1.694 + 0.4695yt−1 − 0.255yt−2 + 0.223yt−3 + 0.487yt−4
Unfortunately, the ar1 does not provide the fitted values. But,
we can use the residuals e to compute them
using ŷ = y − e:
fitted.ar1 <- HD$Sales - ar1$resid
Let’s plot the fitted series and the residuals:
par(mfrow = c(1, 2))
plot(HD$year, HD$Sales, xlab = 'Year', ylab = 'Sales', main =
'AR(4)')
lines(HD$year, HD$Sales)
lines(HD$year, fitted.ar1, col = 'red')
plot(HD$year, ar1$resid, xlab = 'Year', ylab = 'Residuals')
abline(0, 0)
1995 2000 2005 2010
14. 4
Year
R
e
si
d
u
a
ls
As we can see, the AR(4) model follows the seasonal pattern
well, but seems to have great difficulty capturing
the sudden change in 2007. The residual plot shows some
disturbance around the time of the financial
4
crisis. Because sales at that time stopped resembling previous
behavior with respect to growth, the 4 lagged
variables are less successful predictors.
Multiple Regression-based Model
Let’s fit a multiple regression model to Sales using year and the
dummy variables. We use year to model
trend and the dummy variables seasonal component. Since there
is a linear trend until the end of 2006, we
will fit the model to the time series before 2007 only. Note that
we will use the last quarter Q4 as baseline, so
15. only put the first three dummy variables in the model.
HD.new <- HD[HD$year<2007,]
imod <- lm(Sales ~ year + Q1 + Q2 + Q3, data = HD.new)
summary(imod)
##
## Call:
## lm(formula = Sales ~ year + Q1 + Q2 + Q3, data = HD.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9176 -0.6004 -0.1046 0.3649 4.7269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.515e+03 1.084e+02 -32.414 < 2e-16 ***
## year 1.762e+00 5.414e-02 32.554 < 2e-16 ***
## Q1 5.885e-01 4.967e-01 1.185 0.242652
## Q2 1.755e+00 4.921e-01 3.567 0.000901 ***
## Q3 4.554e-01 4.878e-01 0.934 0.355686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.178 on 43 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.9629
## F-statistic: 305.7 on 4 and 43 DF, p-value: < 2.2e-16
The coefficient of year is positive and highly significant,
showing a positive linear trend in the time series of
Sales . The significant coefficient of Q2 indicates that the sales
in second quarter is on average significantly
higher than the last quarter (baseline). The whole model looks
great with high R2 and significant F-test.
Comparison
16. So far, we have applied a few different methods to the Home
Depot series. Suppose we want to forecast
the sales for the first quarter of 2013. (Note: the original series
ends at the fourth quarter of 2012.) In the
smoothing methods, we use the last average in the series to
forecast:
(yhat.sma1 <- sma1[length(sma1)])
## [1] 18.627
(yhat.sma2 <- sma2[length(sma2)])
## [1] 18.908
5
(yhat.ema1 <- ema1[length(ema1)])
## [1] 18.89295
(yhat.ema2 <- ema2[length(ema2)])
## [1] 18.38313
In the autoregressive model, we may use predict() function to
forecast:
(yhat.ar1 <- predict(ar1, n.ahead = 1, se.fit = FALSE))
## Time Series:
## Start = 73
## End = 73
## Frequency = 1
## [1] 19.31656
In the multiple regression, we also need predict() function to
forecast:
17. data.new <- data.frame(year = 2013, Q1 = 1, Q2 = 0, Q3 = 0)
yhat <- predict(imod, newdata = data.new)
We know that the actual sales in the first quarter of 2013 were
$19.124$B. Which model does provide the
best prediction? We can compute absolute percentage error
(APE) for each prediction:
y.true <- 19.124
abs(y.true - yhat.sma1)/abs(y.true)*100
## [1] 2.598829
abs(y.true - yhat.sma2)/abs(y.true)*100
## [1] 1.129471
abs(y.true - yhat.ema1)/abs(y.true)*100
## [1] 1.208153
abs(y.true - yhat.ema2)/abs(y.true)*100
## [1] 3.87401
abs(y.true - yhat.ar1)/abs(y.true)*100
## Time Series:
## Start = 73
## End = 73
## Frequency = 1
## [1] 1.006905
abs(y.true - yhat)/abs(y.true)*100
## 1
## 76.65737
The AR(4) model seems to offer the best prediction, because it
yields the smallest APE (1.01%). The multiple
regression model gives the worst forecast, because the financial
crisis changed the trend of the time series
18. after 2007, but the model assumes the trend remains the same.
6
Home Depot SalesSmoothing MethodsAutoregressive
MethodMultiple Regression-based ModelComparison