The document discusses time series analysis and provides examples of time series data. It begins by defining time series as data that is measured at successive points in time, such as stock prices, weather data, and software errors. Examples are given of carbon dioxide levels measured at Mauna Loa, long-distance phone call prices in the US, and the Nikkei Stock Index. The document then discusses common components of time series like trends, seasonality, and noise. Methods for smoothing time series data like running averages and exponential smoothing are also introduced.
2. http://publicationslist.org/junio
What is it about?
Time series are an incredibly common kind of data
Stock market
CPU utilization
Meteorology - daily rainfall, wind speed, and temperature
Sociology - crime figures, employment figures
Software engineering – number of errors
Networks – number of nodes, and edges
3. http://publicationslist.org/junio
First examples
Consider a data set with the concentration (ppm) of carbon
dioxide (CO2) in the atmosphere, as measured by the
observatory on Mauna Loa on Hawaii, recorded at monthly
intervals since 1959
The plot shows two
common features in
time series:
Trend: a steady, long-
term linear growth
Seasonality: a regular
periodic pattern – on 12
month cycle
4. http://publicationslist.org/junio
First examples
Consider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strong
nonlinear trend
The single-log plot (inset)
shows that the data follow a
power-law distribution
(logarithmic linear) – a usual
behavior of growth/decay
processes
5. http://publicationslist.org/junio
First examples
Consider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strong
nonlinear trend
The single-log plot (inset)
shows that the data follow a
power-law distribution
(logarithmic linear) – a usual
behavior of growth/decay
processes
This example asks for closer inspection:
• Has the long-distance call service changed along
time?
• Were the prices adjusted for inflation?
• What are the uncharacteristically low prices for a
couple of years in the late 1970s? Did the breakup
of the AT&T system have anything to do with it?
6. http://publicationslist.org/junio
First examples
Consider the data set with the development of the Japanese
stock market as represented by the Nikkei Stock Index over
the last 40 years shown with a 31-point Gaussian smoothing
filter
The plot shows a change in
the behavior after 1990
(the big Japanese bubble),
after which a long-term
increasing trend turned into
an oscillatory decreasing
trend
The seasonality also
changed significantly after
then
7. http://publicationslist.org/junio
First examples
Consider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
This example is way more
challenging with its complex
structure
Actually, it is not clear whether
the high-frequency variation in
the plot is noise or has some
form of regularity
In an initial analysis, not many
conclusions can be drawn from
the plot – apparently, no
trend, no seasonality, and
no change in behavior
8. http://publicationslist.org/junio
First examples
Consider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
This example is way more
challenging with is complex
structure
Actually, it is not clear whether
the high-frequency variation in
the plot is noise or has some
form of regularity
In an initial analysis, not many
conclusions can be drawn from
the plot – apparently, no
trend, no seasonality, and
no change in behavior
As time-series commonly counts on long-term data, it
is important to certify that the data acquisition was
homogeneous along the period, otherwise the series
may change its behavior in ways that becomes hard to
make sense
9. http://publicationslist.org/junio
Main components
As we have seen, the main components observed are:
Trend: linear or non-linear, with a characteristic magnitude
Seasonality: additive, for example, every 12 months the sales
increase by 3 million; or multiplicative, for example, every 12
months the sales increase by 1.4 times what was observed in the last
cycle
Noise: some form of random variation, quite common
Other: change in behavior, special outliers, missing data, and anything
remarkable
10. http://publicationslist.org/junio
Assumptions
Standard methods of time-series analysis make a number of
assumptions, all of them are violated in real-world
scenarios:
Data points have been taken at equally spaced time steps, with
no missing data points: demands interpolation in case of missing
points, or re-sampling in case of insufficient sampling
The time series is sufficiently long (at least 50 points): requires
smoothing methods to define a continuous curve, even where
there are no points
The series is stationary, it has no trend, no seasonality, and the
character (amplitude and frequency) of any noise does not change
with time: may require breaking the series into multiple
segments to be analyzed separately
11. http://publicationslist.org/junio
Smoothing
Just as with two-variable data, it is useful to fit a curve
according to the available data (actually, a time series is a
special case of two-variable data)
Smoothing helps in:
Reducing noise
Interpolating missing/insufficient values
12. http://publicationslist.org/junio
Running averages
The method know as running (moving, rolling, or floating)
average is straightforward: for any odd number of consecutive
points, replace the centermost value with the average of
the other points
The smoothed point si is given by:
where xi are the data points
For example, for a 5-point (k=2) moving average, consider
point x10 = 4, and points x8 = 4, x9 = 7, x11 = 2, x12 = 9, so
s10 = 1/5*(4+7+4+2+9)=1/5*26 = 5.2
And so forth for any point
13. http://publicationslist.org/junio
Weighted running averages
Running averages do not work well in the presence of
outliers, what may distort the curve
The weighted running averages techniques lessens this
problem by using weights to associate more importance
to points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-
point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussian
is the first choice
14. http://publicationslist.org/junio
Weighted running averages
Running averages do not work well in the presence of
outliers, what may distort the curve
The weighted running averages techniques lessens this
problem by using weights to associate more importance
to points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-
point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussian
is the first choice
In either case, the choice of weights must be peaked at
the center, drop toward the edges, and add up to 1
15. http://publicationslist.org/junio
Running averages
For example: considering synthetic data (filled line) and an 11-
point moving average
The plot shows that the simple
technique could reasonably
represent the data, but
whenever an outlier (spike)
appears, the curve is abruptly
distorted until the outlier
leaves the window
The weighted version of the
technique presented better
results, instead of abrupt
distortions, it shows
smoothed peaks that point
out the original outliers
16. http://publicationslist.org/junio
Single exponential smoothing
Running averages are intrinsically local and may not capture the global
behavior of the series
An improved method is exponential smoothing, which, in its single
form, departs from a simple recursive definition
= + 1 −
with 0 ≤ ≤ 1, and = , or = ∑ , for n initial values
That is, the next i-th smoothed point is a mix between the actual xi point
and the previous smoothed si-1 point, where can be defined with trial
and error
By mathematical induction, this recursion leads to the exponential
expression: = ∑ (1 − )
which can provide any smooth si value as a function of all the previous i
values x
17. http://publicationslist.org/junio
Single exponential smoothing
The single exponential smoothing provides good smoothing curves and,
for some cases, forecasting
It is limited, though, for series that present trend or seasonality,
situations when the technique cannot be accurately used for
prediction
There two exponential smoothing techniques that are more advanced
Double exponential smoothing for series with trend but without
seasonality
Triple exponential smoothing for series with trend and seasonality, this
technique is called Holt–Winters method
The Holt–Winters method is a powerful technique able to reproduce
the full behavior of additive or multiplicative time series
19. http://publicationslist.org/junio
Double and triple exponential smoothing
Double exponential smoothing
Additive triple exponential smoothing
Multiplicative triple exponential smoothing
Trend factor
Trend factor
Seasonality factor
Trend factor
Seasonality factor
Forecasting
Forecasting
Exponential smoothing depends on mixing parameters,
which are required by software packages:
• Single exponential smoothing:
• Double exponential smoothing:
• Triple exponential smoothing:
More on time-series analysis:
http://www.statsoft.com/textbook/time-series-analysis/
20. http://publicationslist.org/junio
Triple exponential smoothing
For example, the additive Holt–Winters plot for a dataset with the
number of US monthly international flight passengers
The years 1949 through 1957 were used to “train” the algorithm,
and the years 1958 through 1960 were forecasted
Note how well the forecast agrees with the actual data
21. http://publicationslist.org/junio
Autocorrelation and correlogram
As mentioned, time-series are mainly characterized by trends
and seasonality
Trend is analyzed by means of smoothing, function fitting
(modeling), and plotting
Seasonality can benefit from techniques correlation and
correlogram
22. http://publicationslist.org/junio
Autocorrelation and correlogram
The correlation between two time series is obtained as
follows:
For each point xi in the two series, multiply their response values (yi),
considering their deviation from the mean
Sum up all the products
Normalize
The correlation for two identical series is 1, and it is -1 for
series that are exactly inverted one in relation to the
other
23. http://publicationslist.org/junio
Autocorrelation and correlogram
Seasonality:
Formally defined as the correlation between each i-th element and
the (i+k)-th element – k is usually called the lag
Measured by the Autocorrelation Function - ACF, i.e., the correlation
between the two terms xi and xi+k
If the measurement error is not too large, seasonality can be
visually identified as a pattern that repeats every k moments in
time
24. http://publicationslist.org/junio
Autocorrelation and correlogram
If seasonality is present, then the behavior of the series
should repeat at every k time units, where k is named lag
The problem, hence, is: how to identify analytically what is
the lag of the series?
The answer is: compare the time series with its own self, but
shifted by increasing values (lags) of k; for each value calculate
the correlation
Hence, the autocorrelation of a given series at lag k is given
by
Normalization according to lag 0,
that is, to the correlation of the
series with itself
25. http://publicationslist.org/junio
Autocorrelation and correlogram
Autocorrelation basic algorithm:
1.Let k = 0
2.Start with two copies of the series (original and copy)
3.Subtract the mean from all values in both series
4.Multiply the values at corresponding time steps with each other
5.Sum up the results for all time steps
6.Normalize with the variance of the original series this is the
correlation for lag k, that is, c(k)
7.Shift the copy by 1 time step
8.Let k k+1
9.Continue in step 2 while k < kmax
26. http://publicationslist.org/junio
Autocorrelation and correlogram
Autocorrelation basic algorithm:
1.Let k = 0
2.Start with two copies of the series (original and copy)
3.Subtract the mean from all values in both series
4.Multiply the values at corresponding time steps with each other
5.Sum up the results for all time steps
6.Normalize with the variance of the original series this is the
correlation for lag k, that is, c(k)
7.Shift the copy by 1 time step
8.Let k k+1
9.Continue in step 2 while k < kmax
According to this algorithm:
Initially (lag 0), the two signals are perfectly aligned and the
correlation is 1
Then, as we shift the signals they slowly move out of phase and
the correlation drops
How quickly it drops tells us how much “memory” there is
in the data:
If quickly, we know that, after a few steps, the signal has lost all
memory of its recent past
If slowly, then we know that we are dealing with a process that
is relatively steady over longer periods of time
27. http://publicationslist.org/junio
Autocorrelation and correlogram
The correlogram refers to the plot “lag x correlation” of a
given time series
For example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than two
years – as presented earlier
Time series (Auto) correlogram – axis x 0<=lag<=500
28. http://publicationslist.org/junio
Autocorrelation and correlogram
The correlogram refers to the plot lag x correlation of a given
time series
For example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than two
years – as presented earlier
Time series (Auto) correlogram
From the correlogram we can observe that:
The series has a long “memory” (long cycles): it takes the
correlation almost 100 days to fall to zero, indicating that the
frequency of calls changes more or less once per quarter but not
more frequently
There is a pronounced secondary peak at a lag of 365 days: the
call center data is highly seasonal and repeats itself on a yearly
basis, when the series repeats its response behavior (high
correlation)
There is a small but regular sawtooth structure; if we look
closely, we will find that the first peak of the sawtooth is at a lag of
7 days and that all repeating ones occur at multiples of 7 - this is
the signature of the high-frequency component that we see in the
plot of the series; that is, the traffic to the call center exhibits a
secondary seasonal component with 7-day periodicity, the
traffic depends on the day of the week
30. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
Consider again the data set with the concentration (ppm) of
carbon dioxide (CO2) in the atmosphere, as measured by the
observatory on Mauna Loa on Hawaii, recorded at monthly
intervals since 1959
31. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
Which can be better numerically analyzed if the horizontal axis be expressed as
incremental monthly indexes, and if the graph goes through the origin (vertical
translation of -315)
This can be achieved in Gnuplot with:
plot "data" using 0:($2-315) with lines
32. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
The series has a trend that seems to be a power-law of the form b(x/a)k with k
bigger than 1 as the curve is convex downward, a first guess is k=2 and b=35 and
a=350 (upper rightmost part of the series)
This can be achieved in Gnuplot with:
plot “data” using 0:($2-315) with lines, 35*(x/350)**2
33. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
By trial and error, a better guess for k is 1.35
This can be achieved in Gnuplot with:
plot "data" using 0:($2-315) with lines, 35*(x/350)**1.35
34. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
To verify the accuracy of the model function, we can plot the residual by subtracting
the trend from the data
This can be achieved in Gnuplot with:
plot "data" using 0:($2-315 - 35*($0/350)**1.35) with lines
35. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
The model seems fine but for the seasonality, which consists of regular oscillations
that can be captured by sines, as the series starts at (0,0); also the series is monthly-
based with a cycle of one year, so a guess is that the data is the same every 12
points; the amplitude is around 3, as we can observe in the former plots
We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" using 0:($2-f($0)) with lines, 3*sin(2*pi*x/12) with lines
36. http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii
The model seems fine but for the seasonality, which consists of regular oscillations that
can be captured by sines as the series starts at (0,0); also the series is monthly-based
with a cycle of one year, so a guess is that the data is the same every 12 points; the
amplitude is around 3, as we can observe in the former plots
We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l
At this point the model is given by the power-law function plus the sine
function
f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12)
plot "data" using 0:2 with lines, f(x)
which is pretty close the actual phenomenon