What is ARIMA Forecasting and How Can it Be Used for Enterprise Analysis?
Writing Sample 2
1. Australian Red Wine
Time Series Data Analysis
Julia L. Nickle
Predict 411 – Section 57
11/28/2012
2. 2
Introduction
As discussed in Assignment 7, the modeling of time series data requires the data to be
stationary, assuming constant mean and variance. If a data series is non-stationary, an ARIMA
model cannot be properly fit, resulting in inaccurate and inefficient forecasts. The purpose of this
assignment is to explore the Australian red wine data set, which covers monthly sales over 11
years, and efficiently generate forecasts for ten periods ahead.
Initial analysis reveals the data set suffers from both non-stationarity and seasonality. In
order for the non-stationary series to become stationary, the data must be transformed to ensure
the mean, variance and autocorrelation structure must not change over time (e-Handbook of
Statistical Methods, 2012). Time series data that has been modified to account for this will show
no trends and constant variance over time. When a series suffers from seasonality, there are
periodic fluctuations; or reoccurring peaks and declines in the data. This can often be corrected
through differencing, meaning taking the difference of the change from one period to the next
(Nau, 2005). Because differencing can result in lost observations, time series analysis calls for
large sample sizes. Another viable remedy is creating a multiplicative model that accounts for
seasonality.
Once non-stationarity and seasonality issues are remedied, the series can be identified,
checked, and forecasted. This analysis of the Australian red wine data set covers its
transformation to a stationary series, accounting for seasonality, identification methods through
autocorrelation function and partial autocorrelation function plots, as well as fitting the data to
various models. Finally, this analysis includes forecasts of red wine sales as suggested by an
optimal model.
3. 3
Data
The Australian red wine data set consists of 142 observations. The data was collected
from January 1980 through October 1991 and includes the monthly sales, in kiloliters, of red
wine. Simple descriptive statistics show the mean of the series as 1,477.77 with a standard
deviation of 531.03. A time series plot illustrates an upward trend and seasonal pattern within the
data. Sales in kiloliters appear to peak in July and drop during January. This visualization
indicates the data is affected by both non-stationarity and seasonality.
As constant variance over time cannot be assumed, the data needs to be transformed in
order to utilize the Box-Jenkins method for time series analysis. Transforming the data into a
stationary series may be as simple as taking the natural log of sales in kiloliters. The time series
plot of this transformation shows that the variability steadied somewhat, but further
transformation is necessary. Taking the first difference at lag 1 eliminates the upward trend,
achieving stationarity; the change can be seen in the time series plot after differencing. However,
the issue of seasonality still remains. Additional differencing at lag 12 accounts for the issue,
resulting in a time series plot that shows the Australian red wine data set as stationary and
unaffected by seasonality. The series is ready to move forward with the next steps in the
identification portion of the Box-Jenkins method.
Analysis
The best practice to identify an appropriate model for the Australian red wine data set
begins with an evaluation of the series’ autocorrelation and partial autocorrelation function plots
in addition to an assessment of the autocorrelation check for white noise. Analysis begins with
the original, non-stationary data set. Proc ARIMA output illustrates the results of white noise
4. 4
test; the autocorrelations up to lag 36 are highly significant with p-values <.0001. The null
hypothesis, which states that none of the autocorrelations up to a given lag are significantly
different than zero, can be rejected, meaning that an ARIMA model is in fact necessary for the
data to be forecasted accurately. However, with the Australian red wine data set in its original
state, affected by non-stationarity and seasonality, forecasts will be unreliable and inaccurate.
Results for the data set using the natural log of sales in kiloliters are similar; while the variability
is less erratic, the series is still too unstable for effective modeling. The autocorrelation function
plot clearly demonstrates the effect of non-stationarity, featuring slow decays and increases.
Modeling the series after taking the first difference proves to be a step in the right
direction. As previously stated, the time series plot illustrates constant variance, yet, there are
still significant peaks within the set. Moving to model the log of sales in kiloliters after both the
first differencing and the difference at lag 12 provides a better solution. The autocorrelation
check for white noise is consistent; the p-values are highly significant at <.0001 through lag 36.
The ACF plot shows a sharp drop after lag one; the drops continue until the series dips below
zero. The PACF plot shows a slightly more stable, exponential decay over the lags. Overall,
however, the plots are similar, which suggests that a mixed model might be the best way to
represent the series. To determine whether or not this is accurate, the series should be fit to both
AR and MA models, and comparing results to a mixed, ARMA model.
Proc ARIMA output illustrates how well the Australian red wine data set fits to an MA
model of order 1. The moving average term of the model is significant with a p-value of .0003.
However, the autocorrelation check of residuals shows a highly significant p-value of <.0001 at
lag 12. Therefore, the MA(1) model is not sufficient to represent the Australian red wine series.
5. 5
Moreover, if the data is fitted to an MA(12) model, results show that none of the MA terms are
significant. Taken as a whole, it appears that an MA model alone is not suitable for the data set.
If the series is fitted to an AR(1) model, results show the AR term as highly significant,
with a p-value <.0001. But, as with the MA(1) model, the autocorrelation check of residuals
shows a highly significant p-value of <.0001 at lag 12. Not all of the AR parameters are
significant to the model, however. Parameters 1, 5, 8, 11, and 12 are significant to the model
while the rest of the parameters are unnecessary and do not need to be included.
Perhaps because the data illustrates trend and seasonal components, the series should be
represented by a multiplicative model with differencing at lags 1 and 12. PROC ARIMA shows
that a multiplicative AR(1,12) model proves to be a decent fit, but is not ideal. Both AR terms
are highly significant with p-values < .0001. Fit statistics show an AIC of -147.262 and an SBC
of -138.683 with a standard error of 0.135171. Yet, the autocorrelation check of residuals shows
significant p-values to lag 24.
The multiplicative MA(1,12) model, according to fit statistics, provides a better fit. Not
only are both MA terms are highly significant, but the AIC and SBC values are smaller; -184.973
and -176.394, respectively. Moreover, the standard error estimate is smaller, at .011679. Here,
the autocorrelation check of residuals test cannot be rejected; none of the p-values are
statistically significant. This indicates that the model provides an adequate fit to the data. None
of the other models thus far confirmed this through each lag. Additionally, according to the Q-Q
plot, the residuals appear normally distributed.
As a precaution, the data should also be modeled by a multiplicative ARMA(1,12),
because the ACF and PACF indicated an ARMA might best represent the series. However,
PROC ARIMA output shows that of the 5 parameters, only the MA terms are significant. As
6. 6
such, the multiplicative MA(1,12) model appears to provide the optimal fit to the Australian red
wine data set.
Proc ARIMA initially forecasts results as log values of sales so final values are
transformed to exponentiate the forecast values. Using the MA(1,12) model, forecasts of sales in
kiloliters for ten periods ahead show values ranging from 1,123.58 to 2,885.36. The average
forecast is 2,095.61 with a standard deviation of 166.37. Lower and upper confidence limits
range from 878.10 to 2,188.17 and 1,416.46 to 3,734.57, respectively. The average standard error
for the ten forecasts is .127983. Visually, the forecasted values decrease between observation
144 and 146, but begin to increase steadily between observation 148 and 151 and then taper off
between before observation 152.
Summary/Conclusions:
Exploratory analysis of the Australian red wine data set using time series plots reveal the
set’s non-stationary nature and sensitivity to seasonality. In order to appropriate apply the Box-
Jenkins method for time series data analysis, the set requires transformation. Only when the data
is stationary and seasonality is accounted for, can the series be identified correctly. During this
critical step in the Box-Jenkins process, ACF and PACF plots illustrate the data set’s best
possible representation through a mixed, ARMA model. Evaluation of AR(1), AR(12), MA(1),
(MA12) models reiterated the data set’s need to be fit to a multiplicative or a multiplicative
mixed model.
A multiplicative model is more useful in this scenario as the Australian red wine data set
suffers from seasonality. The model must take into account higher and lower value proportions,
rather than assume their difference is constant (Box, Jenkins, & Reinsel, 2008). In other words, a
7. 7
multiplicative model assumes seasonal effects act proportionally on the data series. Of the three
multiplicative models, the MA(1,12) model performed best and proved to be sufficient for the
data series. Additionally, values forecasted from the MA(1,12) model have minimal standard
errors, maintain 95% confidence and appear logical and consistent with the rest of the series.
Future Work
If more time were available, it may be beneficial to consider the other orders suggested
by the smallest canonical correlation method (SCAN) and the extended sample autocorrelation
function (ESACF). SCAN and ESACF methods provide valuable suggestions from which to
uncover the order of a time series model (SAS, 2010). Each method proposed the ARMA(3,3)
model as optimal, but it might be interesting to compare results to other recommended mixed
models: ARMA(5,3), ARMA(1,5), and ARMA(2,5). Fit statistics of these models including AIC,
SBC, and standard error estimates might show a better fit, resulting in different and potentially
more accurate forecasts. Additionally, because these other models appear sufficient to represent
the data, it might be worthwhile to complete forecasted values. Afterwards, the forecasts could
be compared to the MA(1,12) values and evaluated for accuracy.
References
SAS. (2010, April). Retrieved November 25, 2012, from SAS/STAT(R) 9.2 User's Guide, Second
Edition: http://support.sas.com/
e-Handbook of Statistical Methods. (2012). Retrieved November 24, 2012, from
NIST/SEMATECH: http://www.itl.nist.gov/div898/handbook/
Box, G. E., Jenkins, G. M., & Reinsel, G. C. (2008). Time Series Analysis, Forecasting and
Control. Hoboken: John Wiley & Sons, Inc.
8. 8
Nau, R. (2005, May 15). Statistical Forecasting. Retrieved November 24, 2012, from
Stationarity and differencing: http://people.duke.edu/~rnau/411diff.htm
Appendix
Time Series Plot of the Australian Red Wine Dataset
Time Series Plot of the Australian Red Wine Dataset w/ Natural Log,
9. 9
Name of Variable = log_sales
Period(s) of Differencing 1
Mean of Working Series 0.010527
Standard Deviation 0.271498
Number of Observations 141
Observation(s) eliminated by differencing 1
Autocorrelation Check for White Noise
To
Lag Chi-Square DF Pr > ChiSq Autocorrelations
6 31.48 6 <.0001 -0.240 -0.100 0.065 0.034 -0.023 -0.374
12 122.91 12 <.0001 -0.062 0.048 0.042 -0.089 -0.172 0.735
18 146.28 18 <.0001 -0.136 -0.088 0.075 0.063 -0.066 -0.322
24 218.84 24 <.0001 -0.055 0.027 0.029 -0.095 -0.125 0.626
30 239.26 30 <.0001 -0.099 -0.090 0.056 0.099 -0.092 -0.272
36 308.42 36 <.0001 -0.058 0.022 0.046 -0.111 -0.109 0.575
10. 10
Final Multiplicative MA(1,12) Model with Forecast
Conditional Least Squares Estimation
Parameter Estimate
Standard
Error t Value
Approx
Pr > |t| Lag
MU -0.0005585 0.0007340 -0.76 0.4481 0
MA1,1 0.78686 0.05565 14.14 <.0001 1
MA2,1 0.75201 0.06917 10.87 <.0001 12
Constant Estimate -0.00056
Variance Estimate 0.01364
Std Error Estimate 0.11679
AIC -184.973
SBC -176.394
Number of Residuals 129
Autocorrelation Check of Residuals
To
Lag Chi-Square DF Pr > ChiSq Autocorrelations
6 4.86 4 0.3023 0.057 0.002 -0.069 -0.065 0.116 -0.100
12 8.89 10 0.5422 -0.041 0.083 0.013 0.084 -0.112 -0.000
18 11.22 16 0.7956 -0.092 0.028 0.048 -0.035 -0.013 -0.053
24 12.58 22 0.9443 -0.036 0.010 -0.020 -0.014 0.069 0.043