WAL_HUMN6100_08_A_EN-CC.mp4
Time Series Design
Time Series Design
Program Transcript
RICHARD BALKIN: Time series design simply refers to study done over time, as
opposed to time to collect data at one particular instant. Often, time series design
is really the single subject design. But you can have multiple participants in time
series design. In the article that we'll even discuss for this week, we see a time
series design occur as an element of looking at changes across a program over
time, and perceptions that participants have about that program over time.
So time series design can be used to look at how data is predictable, how
information can become verifiable, and can this information be replicated over
time? In other words, in a good time series design, I should be able to conduct a
study like this again, and get the same result. So in a time series design, instead
of looking at how changes may occur between groups, we may see how change
occurs with a single subject, or even within a group, or for a program over
particular period of time.
And that period of time can even be more of a longitudinal nature. We can look at
changes across a few months, but we can look at changes across years.
Additionally, if multiple subjects are used in a time series design, and if the
research is longitudinal in nature, you need to take into consideration attrition
rates. Are the participants who began the study the same participants at the end
of the study? Was there attrition? And maybe consider why attrition might occur.
For example, is the researcher able to keep up with all of the participants at the
beginning, intermediate, and latter stages of the study. Attrition is normal in any
research study, but it also needs to be accounted for.
An example of some time series research that I've conducted in the past, has
been when I worked as a therapist at a psychiatric hospital. At that time, we were
very interested in seeing what happened to our clients once they leave the
hospital. We knew how they were when they were admitted. They were either a
danger to self or a danger to others. And we had an idea of how stable they were
when they discharged. But how are they doing one month, three months, six
months, and 12 months after treatment?
So we had an after care program. And through the active aftercare program, we
were able to do some post-care follow up with each of the clients once they left
the hospital. One of our experiences was that after six months it was very difficult
to continue to get feedback from the participants. One of the reasons simply was
that working with this population, they were highly transient. Phone numbers
would change. Addresses would change. And we just weren't able to get a lot of
one-year follow up. Or, perhaps a child had relapsed and the parents were
maybe angry at the treatment center, didn't want to respond to our queries. So
those elements can play ...
2. done over time, as
opposed to time to collect data at one particular instant. Often,
time series design
is really the single subject design. But you can have multiple
participants in time
series design. In the article that we'll even discuss for this
week, we see a time
series design occur as an element of looking at changes across a
program over
time, and perceptions that participants have about that program
over time.
So time series design can be used to look at how data is
predictable, how
information can become verifiable, and can this information be
replicated over
time? In other words, in a good time series design, I should be
able to conduct a
study like this again, and get the same result. So in a time series
design, instead
of looking at how changes may occur between groups, we may
see how change
occurs with a single subject, or even within a group, or for a
program over
particular period of time.
And that period of time can even be more of a longitudinal
nature. We can look at
changes across a few months, but we can look at changes across
years.
Additionally, if multiple subjects are used in a time series
design, and if the
research is longitudinal in nature, you need to take into
consideration attrition
rates. Are the participants who began the study the same
participants at the end
3. of the study? Was there attrition? And maybe consider why
attrition might occur.
For example, is the researcher able to keep up with all of the
participants at the
beginning, intermediate, and latter stages of the study. Attrition
is normal in any
research study, but it also needs to be accounted for.
An example of some time series research that I've conducted in
the past, has
been when I worked as a therapist at a psychiatric hospital. At
that time, we were
very interested in seeing what happened to our clients once they
leave the
hospital. We knew how they were when they were admitted.
They were either a
danger to self or a danger to others. And we had an idea of how
stable they were
when they discharged. But how are they doing one month, three
months, six
months, and 12 months after treatment?
So we had an after care program. And through the active
aftercare program, we
were able to do some post-care follow up with each of the
clients once they left
the hospital. One of our experiences was that after six months it
was very difficult
to continue to get feedback from the participants. One of the
reasons simply was
that working with this population, they were highly transient.
Phone numbers
would change. Addresses would change. And we just weren't
able to get a lot of
one-year follow up. Or, perhaps a child had relapsed and the
parents were
5. months, and the six months, and the 12 months interval, was
essential in terms
of doing a time series design, and finding out did kids relapse
or regress to their
previous high risk behavior after receiving treatment at the
hospital. And what
were the influencing factors? We also would want to know
information as, did
they continue an outpatient counseling, for example?
In examining an article that uses time series design, we've
selected an article
that's quite multi-faceted. So in this particular article, they use
a four-phase
design to conduct the time series research. The 12-month
baseline pre-exposure
phase assessed program and patient outcomes. In Phase II,
which occurs after
six months of training, MDFT experts train Adolescent Day
Treatment Program
staff and administrators. and then in Phase III they have an
implementation
stage. And this is at 14 months. And then at Phase IV, they have
a Durability
Practice Phase, which is around 18 months.
So let's take a look at how the program dimensions changed
over time through
this time series design. So these program dimensions included
aspects like
autonomy, and clarity, and program organization, and control.
And what they
notice is that as a result of implementing this MDFT program,
that participants,
patients within the program, noticed positive differences among
these program
6. dimensions. So here what we end up with is a statistically
significant difference in
the way a program is perceived by the primary stakeholders, in
this case, the
patients who are experiencing treatment in the day program.
So imagine being able to implement an intervention that across
time improves
your program and improves receptiveness to treatment. And that
was the
importance of the study. Hopefully, when practitioners see this,
they can see a
treatment model that affects the quality of care. And they may
be more apt to use
such a model in their programs.
In terms of multicultural ethical and legal considerations, we
might want to once
again review, who was a sample? Who are the participants in
this study? So that
we make sure that the participants in the study are truly
generalizable to the
population of interest.
Additionally, whenever doing a time series design, you want to
think about and
consider, what occurs during the study? What is the
intervention? What is the
change that we're looking at? Is this change positive or not? For
example, what
would happen if the study was being conducted and immediately
a negative
consequence as a result of the intervention is occurring? Well,
of course the
ethical thing to do would be to stop the study.
9. observations
in order to fit a seasonal model to the data.
Goals of time series analysis:
1. Descriptive: Identify patterns in correlated data—trends
and seasonal variation
2. Explanation: understanding and modeling the data
3. Forecasting: prediction of short-term trends from
previous patterns
4. Intervention analysis: how does a single event change the
time series?
5. Quality control: deviations of a specified size indicate
a problem
Time series are analyzed in order to understand the
underlying structure and function that produce the
observations. Understanding the mechanisms of a time series
allows a mathematical model to be developed that explains the
data in such a
way that prediction, monitoring, or control can occur.
Examples include prediction/forecasting,
which is widely used in economics and business.
Monitoring of ambient conditions, or of an input or an output, is
common
in science and industry. Quality control
is used in computer science, communications, and industry.
It is assumed that a time series data set has at least one
systematic pattern. The most common
patterns are trends and seasonality.
Trends are generally linear or quadratic. To find trends,
moving averages or regression
analysis is often used. Seasonality is a
trend that repeats itself systematically over time. A second
assumption is that the data exhibits
enough of a random process so that it is hard to identify the
10. systematic patterns
within the data. Time series analysis
techniques often employ some type of filter to the data in order
to dampen the
error. Other potential patterns have to
do with lingering effects of earlier observations or earlier
random errors.
There are numerous software programs that will analyze time
series, such as SPSS, JMP, and SAS/ETS.
For those who want to learn or are comfortable with coding,
Matlab, S-PLUS, and R are other software packages that can
perform time series analyses. Excel can be used if linear
regression analysis
is all that is required (that is, if all you want to find out is the
magnitude
of the most obvious trend). A word of
caution about using multiple regression techniques with time
series data:
because of the autocorrelation nature of time series, time series
violate the
assumption of independence of errors.
Type I error rates will increase substantially when
autocorrelation is
present. Also, inherent patterns in the
data may dampen or enhance the effect of an intervention; in
time series
analysis, patterns are accounted for within the analysis.
Observations made over time can be either discrete or
continuous. Both types of observations
can be equally spaced, unequally spaced, or have missing data.
Discrete measurements can be recorded at any
time interval, but are most often taken at evenly spaced
intervals. Continuous measurements can be spaced
randomly in time, such as measuring earthquakes as they occur
11. because an
instrument is constantly recording, or can entail constant
measurement of a
natural phenomenon such as air temperature, or a process such
as velocity of an
airplane.
Time series are very complex because each observation is
somewhat dependent upon the previous observation, and often is
influenced by
more than one previous observation.
Random error is also influential from one observation to
another. These influences are called
autocorrelation—dependent relationships between successive
observations of the
same variable. The challenge of time
series analysis is to extract the autocorrelation elements of the
data, either
to understand the trend itself or to model the underlying
mechanisms.
Time series reflect the stochastic nature of most
measurements over time. Thus, data may
be skewed, with mean and variation not constant, non-normally
distributed, and
not randomly sampled or independent.
Another non-normal aspect of time series observations is that
they are
often not evenly spaced in time due to instrument failure, or
simply due to
variation in the number of days in a month.
There are two main approaches used to analyze time series
(1) in the time domain or (2) in the frequency domain. Many
techniques are available to analyze data
within each domain. Analysis in the time
12. domain is most often used for stochastic observations. One
common technique is the Box-Jenkins ARIMA
method, which can be used for univariate (a single
data set) or multivariate (comparing two or more data sets)
analyses. The ARIMA technique uses
moving averages, detrending, and regression methods to detect
and remove
autocorrelation in the data.
Below, I will demonstrate a Box-Jenkins ARIMA time domain
analysis of a
single data set.
Analysis in the frequency domain is often used for periodic
and cyclical observations. Common techniques are spectral
analysis, harmonic
analysis, and periodogram analysis. A
specialized technique is Fast Fourier Transform (FFT).
Mathematically, frequency domain techniques
use fewer computations than time domain techniques, thus for
complex data,
analysis in the frequency domain is most common. However,
frequency analysis is more difficult
to understand, so time domain analysis is generally used outside
of the
sciences.
Time series analysis using
ARIMA methods
Using the ARIMA (auto-regressive, integrated, moving
average) method is an iterative, exploratory, process intended to
best-fit your
time series observations by using three steps—identification,
13. estimation, and
diagnostic checking—in the process of building an adequate
model for a time
series. The auto-regressive component
(AR) in ARIMA is designated as p, the
integrated component (I) as d, and
moving average (MA) as q. The AR component represents the
lingering
effects of previous observations. The I component represents
trends, including
seasonality. And the MA component
represents lingering effects of previous random
shocks (or error). To fit an ARIMA
model to a time series, the order of each model component must
be selected.
Usually a small integer value (usually 0, 1, or 2) is found for
each
component. The goal is to find the most
parsimonious model with the smallest number of estimated
parameters needed to
adequately model the patterns in the observed data.
In order to demonstrate time series analysis, I introduce a
data set of monthly precipitation totals from Portola,
CA in the Sierra Nevada
in Table
1. When a time series has strong
seasonality, as my data set does, a slightly different type of
ARIMA (p,d,q) process is used, which is often called SARIMA
(p,d,q)*(P,D,Q), where S stands for seasonal. In this model, not
only are there possible
AR, I, and MA terms for the data, there is a second set of AR, I,
and MA terms
that take into account the seasonality of the data.
Time series data are correlated,
14. which means that measurements are related to one another and
change together to
some degree. Thus, each observation is
partially predictable from previous observations, or from
previous random
shocks, or from both. An assumption made
after analysis is that the correlations inherent in the data set
have been
adequately modeled. Thus after a model
has been built, any leftover variations are considered to be
independent and
normally distributed with mean zero and constant variance over
time. These leftover variations are used to
interpret the data.
Regardless of which technique is used, the first step in any
time series analysis is to plot the observed values against time.
A number of qualitative aspects are
noticeable as you visually inspect the graph.
In Figure 1, we see that there is a 12-month pattern of
seasonality, no
evidence of a linear trend, and, variation from the mean appears
to be
approximately equal across time.
Monthly
precipitation data from NOAA weather station in Portola,
Ca., from January
1999 through April 2004
Figure 1. Precipitation occurs cyclically. December falls on
number 12, 24, 36, 48,
15. 60, and 72. Mean = 1.66 inches/month,
standard deviation = 2.09, n =
76.
Is there a trend to this data set? The simplest linear equation
would be y = b,
where b is the random shock, or error, of the data set. The
linear equation for my data set is y =
-0.0018x + 1.6688. With a slope of
-0.0018, there is no significant linear trend.
This data set needs no further work to eliminate a linear or
quadratic
trend.
If removal of the trend—detrending—is needed, I would
proceed to differencing. Ordinary least
squares analysis is another method used to detect and remove
trends. Differencing has advantages of ease of use
and simplicity, but also has disadvantages including over-
correcting for
trends, which skews the correlations in a negative direction.
There are other problems with differencing
that are covered in textbooks.
Differencing
means calculating the difference among pairs of observations at
some time
interval. A difference of one time
interval apart is calculated by subtracting value #1 from value
#2, then #2
from #3, and on, and plotting that data to determine if mean of 0
and a
constant variance are present. If
differencing of one does not detrend the data,
16. calculate a difference of 2 by subtracting difference #2 from
difference #3,
and on. Use a log transformation on the
differences if necessary to stabilize the mean and variance.
Seasonal
autocorrelation is different from a linear or quadratic data trend
in that
it is predictably spaced in time. Our
precipitation data can be expected to have a 12-month seasonal
pattern, whereas
daily observations might have a 7-day pattern, and hourly
observations often
have a 24-hour pattern.
Equation 1
In order to detect seasonality, plot the autocorrelation function
(ACF) by
calculating and graphing the residuals
(observed minus mean for each data point).
The graph of the residuals against a specified time interval is
called a
lagged autocorrelation function or a correlogram. The null
hypothesis for the ACF is that the
time series observations are not correlated to one another, i.e.;
that any
pattern in the data is from random shocks only.
The residuals can be calculated using equation 1.
In time series analysis a lag is defined as: an event
occurring at time t + k (k > 0) is said to lag behind an event
occurring at
17. time t, the extent of the lag being k.
In 1970, Box and Jenkins wrote, “..to obtain a
useful estimate of the autocorrelation function, we would need
at least 50
observations and the estimated autocorrelations would be
calculated for k = 0,
1, …, k, where k was not larger than N/4”. For my data set of
78 observations, I
specified 19 autocorrelation lags (78/4 = 19.5).
A rule of thumb for an ACF is if there are plotted residuals
that are greater than 2 standard errors away from the zero mean,
they indicate
statistically significant autocorrelation.
In Figure 2, there are 2 residual values, at lag 6 and lag 12, that
lay
more than 2 standard errors—that is, the approximate 95%
confidence limits—from
the zero mean. I interpret this as a
6-month seasonal pattern that cycles between summer when
there is little to no
precipitation, and winter when precipitation is at its peak. So,
even though the linear equation reveals
no trend, graphing the ACF reveals seasonality.
I used the JMP software program from SAS to analyze my data
set. Though I will not cover how to
perform a time series analysis in the spectral domain, I did use
the spectral
density graph to verify that the biggest seasonal pattern occurs
at 12-month
intervals, not at 6-month intervals. In
Figure 3, notice the large spike at period 12.
Lagged
autocorrelation function of Portola, Ca precipitation data.
18. Figure 2. Visual inspection shows significant deviations
from zero correlation at lag 1, 6, and 12, and very close at
lag 7 and
13. Interpretation suggests that
there are two seasonal (rainy season and dry season) patterns
spaced about
6 months apart. Number of
autocorrelation lags equals 19.
Spectral Density as a function of period
Figure 3. A
strong signal appears at about period 12, corresponding to a
yearly cycle.
The partial autocorrelation function (PACF) is also used to
detect trends and seasonality. Figure 4
is the PACF of the precipitation data.
In general, the PACF is the amount of correlation between a
variable and
its lag that is not explained by correlations at all lower-order
lags. The equation to obtain partial
autocorrelations is very complex, and is best explained in time
19. series
textbooks.
Lagged partial
autocorrelation function of Portola, Ca precipitation data.
Figure 4.
Significant deviation from zero is evident at lags 1, 6, and
12,
suggesting the same 6-month seasonal pattern.
Now that our observations against time, as well as the ACF,
and PACF have been graphed, we can begin to match our
patterns to idealized
ARIMA models. The easy way to analyze a
time series data set is to simply input numerous variations of
ARIMA. There are also systematic steps that you can
take that will help suggest the best values for the AR, I, and MA
terms.
Here I present a few general rules to apply when working to
identify the best-fit ARIMA model. These
rules come from the Duke University
website http://www.duke.edu/~rnau/411home.htm, that, along
with other textbooks and websites listed
below, was instrumental in helping me understand time series
analysis, and
specifically in helping me understand the nuances of seasonally
affected time
series.
20. After adjusting the data by a seasonal difference of 1 using
JMP, a visual inspection of shows that the ACF decays more
slowly than the
PACF, Figure 5. I used Duke’s Rule #3:
The optimal order of differencing is often the order of
differencing at which
the standard deviation is lowest, to help me determine that my
data needed no
differencing for trend but did need to be differenced for
seasonality (both
options available in JMP). A seasonal
difference of 1 yields a standard deviation of 1.89, the lowest
value of the
iterations that I tried.
ACF and PACF
after seasonal differencing of 1.
Figure 5. All ACF and PACF lags fall below significant
levels, indicating that autocorrelation has been eliminated.
Using the iterative approach of checking model values via
JMP, I found that the lowest values of Aikaike’s ‘A’
Information Criterion (AIC), Schwarz’s Bayesian Criterion, and
the
-2LogLikelihood for my data set are obtained with an ARIMA
(0,0,0)(1,1,1). According to Duke’s Rule 8, it is possible
21. for an AR term and an MA term to cancel each other out. They
suggested that I try a model with one
fewer AR term and one fewer MA term, particularly if it takes
more than 10
iterations for the model to converge. My
model took 6 iterations to converge.
Duke’s Rule 12 states that if a series has a strong and
consistent seasonal pattern, never use more than one order of
seasonal
differencing or more than 2 orders of total differencing
(seasonal + nonseasonal). Rule
13 states that if the autocorrelation at the seasonal period is
positive,
consider adding an SAR term, and if negative try adding an
SMA term to the
model. Do not mix SAR and SMA terms in
the same model.
Duke’s rules for seasonality suggest that I not accept a
mixed model as the best-fit model for my data.
I eliminated the AR and MA terms, but that model yielded a
higher value
of AIC, Schwarz’s Bayesian Criterion, and a much higher value
of the
-2LogLikelihood. I also successively
eliminated the AR or the MA term while leaving the other term
in, but still got
higher values for all test parameters. Based
on the parameter values, I believe that the ARIMA
(0,0,0)(1,1,1)
is the best model for my data.
Parameter estimates
of the most likely SARIMA models
23. 76.057666
-0.04
75.26258
Table 2. Model
#4, SARIMA (0,0,0)(1,1,1) has the lowest variance, AIC,
SBC, RSquare, and -2LogLH.
About 20 models were tested; these four had the lowest
scores.
I have demonstrated best-fitting an ARIMA model to a time
series using description and explanation phases of time series
analysis. If I were to continue with this exercise, I
could use this model to predict precipitation for the next year or
two. Most software programs are capable of
extrapolating values based on previous patterns in the data set.
This topic is covered in textbooks.
There are numerous books, websites, and software programs
available for working with time series.
I found that most of the books that were solely dedicated to
time series
were quite dense with formulas, thus difficult to understand.
Some websites were somewhat easier to
understand but only a couple offered a step-by-step process to
guide you
through an analysis. I used just one
software program, JMP, and used the help guide extensively.
The help guide was useful in understanding
the generated graphs, but offered definitions without
elaboration as to how to
24. interpret the defined data. If you are
going to analyze a time series, I suggest using multiple
resources, and
especially if you are new to time series analysis (like I am),
find a
knowledgeable person who can help you with interpretation of
your results.
Books:
If the CD-ROM is available, this text will walk you through
many analyses.
Brockwell, P.J. and Davis,
R.A. 2002, 2nd ed. Introduction to time series
and forecasting. Springer, New York.
These guys wrote the book on ARIMA processes.
Box, G.E.P., Jenkins, G.M., and Reinsel,
G.C. 1994, 3rd ed. Time series analysis: Forecasting and
control. Prentice Hall, Englewood
Cliffs, NJ.
This book is pretty understandable, though still lots of
formulas.
Chatfield, C. 2004, 6th ed.
The analysis of time series – an introduction. Chapman and
Hall,
London, UK.
An excellent discussion of problems and
solutions to ARIMA techniques.
Glass, G.V., Willson, V.L., and Gottman, J.M. 1975. Design
and analysis of time-series experiments. Colorado
Associated University Press, Boulder, Colorado.
25. An interesting read about time series from a historical
perspective.
Klein, J.L. 1997. Statistical visions in time: a history of
time series analysis, 1662-1938.
Cambridge University Press, New York.
The time series chapter is understandable and easily
followed.
Tabachnick, B.G., and Fidell, L.S. 2001, 4th ed. Using
multivariate statistics. Allyn and Bacon, Needham
Heights, MA.
Websites:
This is the best
website that I found in my web searches.
It is a step-by-step guide to understanding many aspects of time
series,
including a series of ‘rules’ to use when analyzing your data.
http://www.duke.edu/~rnau/411home.htm
An introduction to
time series analysis from an engineering point of view, with two
worked
examples. Very
helpful.
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.ht
m
Extensive website
with LOTS of useful information once you get through the
business talk. Has applets for
determining stationarity, seasonality, mean,
variance, etc.
http://home.ubalt.edu/ntsbarsh/Business-stat/stat-
26. data/Forecast.htm
Useful for
definitions, would be great if they had examples of actual
analyses.
http://www.statsoftinc.com/textbook/stathome.html
Step-by-step explanation
of time series analysis, including examples of how to use Excel
to adjust for
seasonality and analyzing the data by using linear regression,
all in the
Crunching section.
http://www.bized.ac.uk/timeweb/index.htm
Type
in time series in product search to see available books that are
short but
sweet.
http://www.sagepub.com/Home.aspx
Website
for my precipitation data.
http://www.wrh.noaa.gov/cnrfc/monthly_precip.php
Website
for the software package that I used in this presentation.
http://www.jmp.com/
Extensive
and easy to use statistical software package.
http://www.spss.com/
Free software for
analyzing time series data sets, but you need to code.
http://www.r-project.org/
27. Free statistics and
forecasting software (didn’t try out, so can’t say how good)
http://www.wessa.net/
@charset "windows-1252";
@font-face { font-family: Geneva; }
@font-face { font-family: Times-Roman; }
p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0in 0in
0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }
p.MsoCommentText, li.MsoCommentText,
div.MsoCommentText { margin: 0in 0in 0.0001pt; font-size:
12pt; font-family: "Times New Roman"; }
span.MsoCommentReference { }
29. Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 74
Graph Based Framework for Time Series Prediction
Vivek Yadav
*
Durga Toshniwal
**
Abstract
Purpose: A time series comprises of a sequence of observations
ordered with time.
A major task of data mining with regard to time series data is
predicting the
future values. In time series there is a general notion that some
aspect of past
pattern will continue in future. Existing time series techniques
fail to capture the
30. knowledge present in databases to make good assumptions of
future values.
Design/Methodology/Approach: Application of graph matching
technique to
time series data is applied in the paper.
Findings: The study found that use of graph matching
techniques on time-series
data can be a useful technique for finding hidden patterns in
time series database.
Research Implications: The study motivates to map time series
data and graphs
and use existing graph mining techniques to discover patterns
from time series
data and use the derived patterns for making predictions.
Originality/Value: The study maps the time-series data as
graphs and use graph
mining techniques to discover knowledge from time series data.
Keywords: Data mining; Time Series Prediction; Graph Mining;
Graph Matching
Paper Type: Conceptual
Introduction
ata mining is the process of extracting meaningful and
potentially useful patterns from large datasets. Nowadays, data
mining is becoming an increasingly important tool by modern
business processes to transform data into business intelligence
giving
business processes an informational advantage to make their
strategic
business decisions based on the past observed patterns rather
than on
intuitions or beliefs (Clifton, 2011). Graph based framework for
time
series prediction is a step towards exploring new efficient
31. approach for
time series prediction where predictions are based on patterns
observed
in past.
Time Series data consists of sequences of values or events
obtained over
repeated instances of time. Mostly these values or events are
collected at
equally spaced, discrete time intervals (e.g., hourly, daily,
weekly,
monthly, yearly etc.). When there is only one variable upon
which
observations with respect to (w.r.t) time are made, is called
univariate
time series. Data mining on Time-series data is popular in many
applications, such as stock market analysis, economic and sales
forecasting, budgetary analysis, utility studies, inventory
studies, yield
* Department of Electronics & Computer Engineering, IIT
Roorkee.
email: [email protected]
** Assistant Professor. Department of Electronics & Computer
Engineering, IIT Roorkee.
email: [email protected]
D
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 75
projections, workload projections, process and quality control,
32. observation of natural phenomena (such as atmosphere,
temperature,
wind, earthquake), scientific and engineering experiments, and
medical
treatments (Han & Kamber, 2006).
Time series dataset constitutes of {Y1, Y2, Y3, …, Yt } values,
where each Yi
represent the value of variable under study at time i. One of the
major
goal of Data mining in the time series is forecasting the time
series i.e., to
predict the future value Yt+1. The successive observations in
time series
are statistically dependent on time and time series modeling is
concerned
with techniques for analysis of such dependencies. In time
series analysis,
a basic assumption is made that is (i.e.) some aspect of past
pattern will
continue in future. Under this assumption time series prediction
is
assumed to be based on past values of the main variable Y. The
time
series prediction can be useful in planning and measuring the
performance of predicted value on past data against actual
observed
value on the main variable Y.
Time series modeling is advantageous, as it can be used more
easily for
forecasting purposes since the historical sequences of
observations upon
study on main variable are readily available as they are
recorded in the
form of past observations & can be purchased or gathered from
published secondary sources. In time series modeling, the
prediction of
33. values for future periods is based on the pattern of past values
of the
variable under study, but the model does not generally account
for
explanatory variable which may have affected the system. There
are two
reasons for resorting to such time models. First, the system may
not be
understood, and even if it is understood it may be extremely
difficult to
measure the cause and effect relationship of parameters
affecting the
time series. Second, the main concern may be only to predict
the next
value and not to explicitly know why it was observed (Box,
Jenkins &
Reinsel, 1976)
Time Series analysis consists of four major components for
characterizing
time-series data (Madsen, 2008). First, Trend component- these
indicate
the general direction in which a time series data is moving over
a long
interval of time, denoted by T. Second, Cyclic component- these
refer to
the cycles, that is, the long-term oscillations about a trend line
or curve,
which may or may not be periodic, denoted by C. Third,
Seasonal
component- these are systematic or calendar related, denoted by
S.
Fourth, Random component- these characterize the sporadic
motion of
time series due to random or chance events, denoted by R.
Time-series
modeling is also referred to as the decomposition of a time
34. series into
these four basic components. The time-series variable Y at the
time t can
be modeled as either the product of the four variables at time t
(i.e., Yt =
Tt×Ct× St× Rt) using multiplicative model proposed by (Box,
Jenkins &
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 76
Reinsel, 1970) where Tt means Trend component at time t, Ct
means
cyclic component at time t, St means seasonal component at
time t and Rt
signifies Random component at time t. As an alternative,
additive model
(Balestra & Nerlove, 1966; Bollerslev, 1987) can also be used
in which (Yt
= Tt+Ct+St+Rt) where Yt, Tt, Ct, St, Rt have the same meaning
as described
above. Since multiplicative model is the most popular model,
we will use
it for the time series decomposition. Example of time series data
is the
airline passenger data set (Fig. 1) in which the main variable Y
is the
number of passengers (in thousands) in an airline is recorded
w.r.t time,
where each observation on main variable is recorded on monthly
basis
from January 1949 to December 1960. Clearly, the time series is
35. affected
by increasing trend, seasonal and cyclic variations.
Fig. 1: Time series Data of the Airline Passenger Data from
Year 1949 to 1960 represented
on monthly basis.
Review of Literature
In time series analysis there is an important notion of de-
seasonalizing
the time series (Box & Pierce, 1970). It makes the assumption
that if the
time series represents a seasonal pattern of L periods, then by
taking
moving average Mt of L periods, we would get the mean value
for the
year. This would be free of seasonality and contain little
randomness
(owing to averaging). Thus Mt=Tt×Ct (Box, Jenkins & Reinsel,
1976). To
determine the seasonal component, one would simply divide the
original
series by the moving average i.e., Yt/Mt= (Tt×Ct× St× Rt)/(
Tt×Ct )= St× Rt.
Taking average over months eliminates randomness and yields
seasonality component St. De-seasonalized Yt time series can
be
computed by Yt/St.
The approach described in (Box, et al, 1976) for predicting the
time
series, uses regression to fit a curve to De-seasonalized time
series using
36. Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 77
least square method. To predict the values in time series, model
projects
the De-seasonalized time series into future using regression and
divide it
by the seasonal component. The Least Square Method is
explained in
Johnson and Wichern (2002).
Exponential Smoothing has been proposed in (Shumway &
Stoffer, 1982)
which is an extension to above method to make more accurate
predictions. It suggests, making prediction for Yt weighing the
most
recent observation (Yt-1) by α and weighting the most recent
forecast (Ft-1)
by (1- α). Note α lies between 0 and 1 (i.e., 0≤α≤1). Thus the
forecast is
given by Ft+1= Yt-1* α +(Ft-1) * (1- α). Optimal α is chosen
based on the
smallest MSE (Mean Square Error) value during the training.
ARIMA (Auto-Regressive Integration Moving Average Based
Model) has
also been proposed (Box, et al., 1970, 1976; Hamilton, 1989).
ARIMA
model is categorized by ARIMA(p,q,d) where p denotes order of
auto-
regression, q denotes order of differentiation and d denotes
order of
moving averages. The model tries to find the value of p, q, and
37. d that best
fits the data. In time series forecasting using a hybrid ARIMA
and neural
network model has proposed a model that tries to find p, q and d
using
neural network (Zhang, 2003).
Proposed Work: Graph Based Framework for Time Series
Prediction
In this paper, I propose to use graph based framework for time
series
prediction. The motivation to use the graphs is to capture the
tacit
historical pattern present in the dataset. The idea behind
creation of
graph over time series is to utilize two facts. First, some aspect
of time
series pattern will continue in future and graph is a data
structure that is
well suited to model a pattern. Second, similarity can be
calculated
between graphs to know the similar patterns and their order of
occurrence. Thus, graph is created with the motivation to store a
pattern
over time series and make prediction based on similarity of
observed
pattern from historical data as an alternative to Regression and
curve
fitting. The major shortcoming of using the regression and
curve fitting is
that it requires expert knowledge about curve equation and the
number
of parameters in it. If parameters are too many there is problem
of over
fitting and if parameters are too less, model suffers from
problem of
38. under fitting (Han & Kamber, 2006). The complete pattern in
time series
is not known initially and it is affected by random component
which
makes the regression harder, hence deciding the curve equation
and
number of parameters in it is a major issue.
To further explore the concept of pattern, let there be time
series on
monthly data of N years where first observation was in first
month of m
year, Data = {Y1(k)Y2(k)…Y12(k), Y1(k+1) Y2(k+1)
…Y12(k+1),…, Y1(k+N)Y2(k+N)…Y12(k+N)}
where Y1(k) means value of variable under study for first month
of year k
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 78
& Y12(k+N) means value of variable under study for twelfth
month of year
k+N. Note m≤k≤(m+N). In general let d, be the time interval
which makes
a pattern. If a pattern has to be stored yearly and data is
available
monthly d=12, data is available quarterly d=4, etc. Each
successive
observation to Yij (meaning month i and year j) on main
variable ordered
by time is in general given by Yi’j’ where if Yij 1≤i≤12,
k≤j≤(k+N), then for
Yi’j’ if i<12 then i’=i+1, j’=j else i’=1, j’=j+1. A graph over
39. each successive d
observation is created to store the pattern. This is called ‘last-
pattern-
observed-graph’. To make the prediction we also store the
knowledge in
each graph that how the last pattern observed effect the next
observation. This is called ‘knowledge-graph’. Example If we
consider the
data {Y1(k)Y2(k)…Y12(k), Y1(k+1) Y2(k+1) …Y12(k+1),…,
Y1(k+N) Y2(k+N)…Y12(k+N)}, last-
pattern-observed-graph for Jan of year (k+1) will be generated
using data
{Y1(k)Y2(k)…Y12(k)} and knowledge-graph of Jan for year
(k+1) will be
generated using {Y1(k)Y2(k)…Y12(k), Y1(k+1)} data.
Knowledge graph is created
with intuition to capture how the variable under study changed
over last
d observations and its effect on d+1 observation.
In time series data, the graph is created with the motivation to
model
each observation as vertex and represent the effect of variation
in
observations with respect to time in form of edges. The number
of
vertices in graph is equal to time interval over which a pattern
has to be
stored. The edges are created to take into account the effect of
each
observation on other. Since the past values will affect the future
values,
but future values would not affect the past values and hence the
edges
are created between vertices corresponding to it and all the
subsequent
observations which measure the change in angle with
40. horizontal. The
graphs generated can be represented in computer memory either
by
using Adjacency matrix representation or Adjacency list
representation
(Cormen, 2001). I have used Adjacency list representation to
save the
memory required to store the graph as each graph will have n(n-
1)/2
edges thus space required will be n(n-1)/2 using adjacency list
representation as compared to n
2
space using adjacency matrix
representation.
Dataset of N tuples is partitioned into two sets. First set for
training data
of m tuples and second {N-m} tuples for training and validation
of model.
During the training phase, a Knowledge-Graph is generated over
training
data tuples over each subsequent d+1 observation.
Yi(k)Y(i+1)(k)…Y(i+12)(k),
Y(i+13) (k) where i has bounds 1≤i≤12 and if i>12 then i=1 &
k=k+1 for all m
tuples in training Dataset. Thus m-12 Knowledge-Graphs are
generated.
These generated graphs are partitioned into d sets (d=12), where
each
graph is stored in the interval over which knowledge they have
captured
(i.e. graph for all Jan’s are stored together, all Feb’s stored
together, etc.).
To implement this we have used an array of size d of linked list
of graphs.
41. Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 79
Each linked list stores all the knowledge graph corresponding to
interval
over which knowledge it represents. The graphs are partitioned
with the
motivation to ease the search since while making prediction,
model will
query for all patterns observed w.r.t a particular month, since
the graphs
are already stored in partitioned form, time taken by model to
execute
this query will be O(1).
To predict the next value in time series, model will take the last
d known
observations previous to the month on which prediction has to
be done
and compute ‘last-pattern-observed-Graph’. The model will
search for a
Knowledge graph (stored in the partitioned form corresponding
to month
for which prediction has to be made) that is most similar to
‘last-pattern
observed graph’, considering only number of vertices equal to
‘last-
pattern observed graph’ in Knowledge-Graph. To compute the
similarity
between two graphs, graph-edit distance technique has been
used
(Brown, 2004; Bunke & Riesen, 2008). The key idea of Graph-
42. edit
Distance approach is to model structural variation by edit
operations
reflecting modifications in structure and labeling. A standard
set of edit
operations is given by insertions, deletions, and substitutions of
both
nodes and edges. While calculating graph edit distance for time-
series
Graph for g1 (source graph) & g2 (destination graph), requires
only
substitutions of edges (change in angle) in g2 to make it similar
to g1 and a
summation of cost incurred with each edit operation is
calculated. The
graph with least edit cost is most similar & selected as a graph
that will
form the basis, of the prediction.
To make the prediction, model takes into account the structural
difference between two graphs in vertex ordered weighted
average
manner. To make the prediction on graph g1 (last-pattern-
observed-
Graph) using graph g2 (Knowledge Graph which is most similar
to g1),
every vertex in g1 predicts the angle between itself and the
predicted
value using the knowledge of g2 and taking into account the
difference of
edges between itself & it’s corresponding vertex in g2 in a
weighted
average manner (where edge difference to vertex that are closer
to be
predicted are given more weight technique to apply exponential
smoothing in Graph based time series prediction approach), and
thus in
43. this way each vertex predicts the angle. Every vertex makes the
prediction & the predicted value is average of value predicted
by each
vertex. After making the prediction, once the actual observed
value is
known, Knowledge graph is generated to capture the pattern
corresponding to the last observation and in this way model
learns in an
iterative manner.
Experimental Results
The code to implement Graph Based Time Series prediction
approach as
discussed above is written in java. The Graph Based Time
Series
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 80
prediction approach was applied on the airline passenger data
set, which
was first used in (Brown & Smoothing, 1962) and then in (Box,
et al.,
1976). It represents the number of airline passengers in
thousands
observed between January 1949 and December 1960 on a
monthly basis.
I have used 2 years of data for training i.e., 1949 & 1950 and
estimated
the remaining data on monthly basis implementing iterative
learning as
an observation is recorded.
Fig. 2 represents Actual and Predicted number of Passenger
44. using Graph
Based Framework for Time Series prediction applied on the
Time Series
of airline passenger data set. Fig. 3 represents the corresponding
percentage error rate observed on monthly basis. The average
error
recorded on time-series is 7.05.Fig. 4 represents the Actual and
Predicted
Number of passenger using Graph Based Framework for Time
Series
prediction applied on the De-seasonalized Time Series of airline
passenger data set (using concept of Moving Average). Fig. 5
represents
the corresponding percentage error rate observed on monthly
basis. The
average percentage error recorded on De-seasonalized Time
series is
5.81.
Fig. 2: Actual and Predicted number of Passenger using Graph
Based Framework for Time
Series prediction applied on the Time Series of airline
passenger data set (APTS).
Fig. 3: Percentage Error between Actual and predicted using
Graph Based Framework for
Time Series prediction applied on the Time Series of airline
passenger data set (APTS).
45. Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 81
Fig. 4: Actual and Predicted number of Passenger using Graph
Based Framework for Time
Series prediction applied on the De-seasonalized Time Series of
airline passenger data set
(APTS).
Fig. 5: Percentage Error between Actual and Predicted values
using Graph Based
Framework for Time Series prediction applied on the De-
seasonalized Time Series of
airline passenger data set (APTS).
Conclusion & Discussion
A new approach for time series prediction has been proposed &
implemented which is based on graphs. The results reported
show that
using graph based framework for time series prediction on De-
seasonalized Time Series (Computed Using Concept of Moving
Average)
on The Airline Passenger Data has 94.19 percent accuracy and
on direct
Time Series of The Airline Passenger Data has 92.95 percent
46. accuracy.
The accuracy on De-seasonalized time series is better since this
time
series has only two factors, cyclic and trend factors which leads
to less
error rate as compared to direct application of proposed
approach on
time-series which has all the four factors cyclic, trend, seasonal
and
randomness, which makes the prediction difficult. Thus
application of
Graph based framework in conjunction to Moving average
offers good
accuracy.
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 82
Graph based framework approach for time series prediction has
incorporated the concept of exponential smoothing, moving
average and
graph mining to enhance its accuracy. Graph based framework
approach
for time series prediction is a good alternative to regression. In
the
proposed approach there is no need of domain expert knowledge
to
know the curve equation and number of parameters in it. The
result
validate that the new approach has good accuracy rate.
References
47. Balestra, P., & Nerlove, M. (1966). Pooling cross section and
time series
data in the estimation of a dynamic model: The demand for
natural gas. Econometrica, 34(3), 585-612.
Bollerslev, T. (1987). A conditionally heteroskedastic time
series model
for speculative prices and rates of return. The review of
economics and statistics, 69(3), 542-547.
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1970). Time
series analysis.
Oakland, CA: Holden-Day.
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1976). Time
series analysis:
forecasting and control (Vol. 16): San Francisco, CA: Holden-
Day.
Box, G. E. P., & Pierce, D. A. (1970). Distribution of residual
autocorrelations in autoregressive-integrated moving average
time series models. Journal of the American Statistical
Association, 65(332), 1509-1526.
Brown, R. G. (2004). Smoothing, forecasting and prediction of
discrete
time series. Mineola, NY: Dover Publications.
Brown, R. G., & Smoothing, F. (1962). Prediction of Discrete
Time Series.
Englewood Cliffs, NJ: Prentice Hall.
Bunke, H., & Riesen, K. (2008). Graph Classification Based on
Dissimilarity
Space Embedding. In N. da Vitoria Lobo, T. Kasparis, F. Roli,
48. J.
Kwok, M. Georgiopoulos, G. Anagnostopoulos & M. Loog
(Eds.),
Structural, Syntactic, and Statistical Pattern Recognition (Vol.
5342, pp. 996-1007): Berlin / Heidelberg: Springer
Clifton, C. (2011). Data Mining. In Encyclopaedia Britannica.
Retrieved
from
http://www.britannica.com/EBchecked/topic/1056150/data-
mining
Cormen, T. H. (2001). Introduction to algorithms. Cambridge,
Mass: The
MIT press.
Hamilton, J. D. (1989). A new approach to the economic
analysis of
nonstationary time series and the business cycle. Econometrica,
57(2), 357-384.
Han, J., & Kamber, M. (2006). Data mining: concepts and
techniques:
Morgan Kaufmann.
Graph Based Framework for Time Series Prediction Yadav*&
Toshniwal
TRIM 7 (2) July - Dec 2011 83
Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate
statistical
analysis (Vol. 5): NJ: Prentice Hall Upper Saddle River.
49. Madsen, H. (2008). Time series analysis. Boca Raton: Chapman
and
Hall/CRC Press.
Shumway, R. H., & Stoffer, D. S. (1982). An approach to time
series
smoothing and forecasting using the EM algorithm. Journal of
time series analysis, 3(4), 253-264.
Zhang, G. P. (2003). Time series forecasting using a hybrid
ARIMA and
neural network model. Neurocomputing, 50, 159-175. doi:
10.1016/s0925-2312(01)00702-0
Copyright of Trends in Information Management is the property
of University of Kashmir and its content may
not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder's express written
permission. However, users may print, download, or email
articles for individual use.