Data analysis03 timeasa-variable

http://publicationslist.org/junio
Data Analysis
Time as a variable: time-series analysis
Prof. Dr. Jose Fernando Rodrigues Junior
ICMC-USP

What is it about?
Time series are an incredibly common kind of data
 Stock market
 CPU utilization
 Meteorology - daily rainfall, wind speed, and temperature
 Sociology - crime figures, employment figures
 Software engineering – number of errors
 Networks – number of nodes, and edges

First examples
Consider a data set with the concentration (ppm) of carbon
dioxide (CO2) in the atmosphere, as measured by the
observatory on Mauna Loa on Hawaii, recorded at monthly
intervals since 1959
The plot shows two
common features in
time series:
 Trend: a steady, long-
term linear growth
 Seasonality: a regular
periodic pattern – on 12
month cycle

First examples
Consider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strong
nonlinear trend
The single-log plot (inset)
shows that the data follow a
power-law distribution
(logarithmic linear) – a usual
behavior of growth/decay
processes

First examples
Consider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strong
nonlinear trend
The single-log plot (inset)
shows that the data follow a
power-law distribution
(logarithmic linear) – a usual
behavior of growth/decay
processes
This example asks for closer inspection:
• Has the long-distance call service changed along
time?
• Were the prices adjusted for inflation?
• What are the uncharacteristically low prices for a
couple of years in the late 1970s? Did the breakup
of the AT&T system have anything to do with it?

First examples
Consider the data set with the development of the Japanese
stock market as represented by the Nikkei Stock Index over
the last 40 years shown with a 31-point Gaussian smoothing
filter
The plot shows a change in
the behavior after 1990
(the big Japanese bubble),
after which a long-term
increasing trend turned into
an oscillatory decreasing
trend
The seasonality also
changed significantly after
then

First examples
Consider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
 This example is way more
challenging with its complex
structure
 Actually, it is not clear whether
the high-frequency variation in
the plot is noise or has some
form of regularity
 In an initial analysis, not many
conclusions can be drawn from
the plot – apparently, no
trend, no seasonality, and
no change in behavior

First examples
Consider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
 This example is way more
challenging with is complex
structure
 Actually, it is not clear whether
the high-frequency variation in
the plot is noise or has some
form of regularity
 In an initial analysis, not many
conclusions can be drawn from
the plot – apparently, no
trend, no seasonality, and
no change in behavior
As time-series commonly counts on long-term data, it
is important to certify that the data acquisition was
homogeneous along the period, otherwise the series
may change its behavior in ways that becomes hard to
make sense

Main components
As we have seen, the main components observed are:
 Trend: linear or non-linear, with a characteristic magnitude
 Seasonality: additive, for example, every 12 months the sales
increase by 3 million; or multiplicative, for example, every 12
months the sales increase by 1.4 times what was observed in the last
cycle
 Noise: some form of random variation, quite common
 Other: change in behavior, special outliers, missing data, and anything
remarkable

Assumptions
Standard methods of time-series analysis make a number of
assumptions, all of them are violated in real-world
scenarios:
 Data points have been taken at equally spaced time steps, with
no missing data points: demands interpolation in case of missing
points, or re-sampling in case of insufficient sampling
 The time series is sufficiently long (at least 50 points): requires
smoothing methods to define a continuous curve, even where
there are no points
 The series is stationary, it has no trend, no seasonality, and the
character (amplitude and frequency) of any noise does not change
with time: may require breaking the series into multiple
segments to be analyzed separately

Smoothing
Just as with two-variable data, it is useful to fit a curve
according to the available data (actually, a time series is a
special case of two-variable data)
Smoothing helps in:
 Reducing noise
 Interpolating missing/insufficient values

Running averages
The method know as running (moving, rolling, or floating)
average is straightforward: for any odd number of consecutive
points, replace the centermost value with the average of
the other points
The smoothed point si is given by:
where xi are the data points
For example, for a 5-point (k=2) moving average, consider
point x10 = 4, and points x8 = 4, x9 = 7, x11 = 2, x12 = 9, so
s10 = 1/5*(4+7+4+2+9)=1/5*26 = 5.2
And so forth for any point

Weighted running averages
Running averages do not work well in the presence of
outliers, what may distort the curve
The weighted running averages techniques lessens this
problem by using weights to associate more importance
to points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-
point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussian
is the first choice

Weighted running averages
Running averages do not work well in the presence of
outliers, what may distort the curve
The weighted running averages techniques lessens this
problem by using weights to associate more importance
to points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-
point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussian
is the first choice
In either case, the choice of weights must be peaked at
the center, drop toward the edges, and add up to 1

Running averages
For example: considering synthetic data (filled line) and an 11-
point moving average
 The plot shows that the simple
technique could reasonably
represent the data, but
whenever an outlier (spike)
appears, the curve is abruptly
distorted until the outlier
leaves the window
 The weighted version of the
technique presented better
results, instead of abrupt
distortions, it shows
smoothed peaks that point
out the original outliers

Single exponential smoothing
 Running averages are intrinsically local and may not capture the global
behavior of the series
 An improved method is exponential smoothing, which, in its single
form, departs from a simple recursive definition
= + 1 −
with 0 ≤ ≤ 1, and = , or = ∑ , for n initial values
 That is, the next i-th smoothed point is a mix between the actual xi point
and the previous smoothed si-1 point, where can be defined with trial
and error
 By mathematical induction, this recursion leads to the exponential
expression: = ∑ (1 − )
which can provide any smooth si value as a function of all the previous i
values x

Single exponential smoothing
 The single exponential smoothing provides good smoothing curves and,
for some cases, forecasting
 It is limited, though, for series that present trend or seasonality,
situations when the technique cannot be accurately used for
prediction
 There two exponential smoothing techniques that are more advanced
 Double exponential smoothing for series with trend but without
seasonality
 Triple exponential smoothing for series with trend and seasonality, this
technique is called Holt–Winters method
 The Holt–Winters method is a powerful technique able to reproduce
the full behavior of additive or multiplicative time series

Double and triple exponential smoothing
 Double exponential smoothing
 Additive triple exponential smoothing
 Multiplicative triple exponential smoothing
Trend factor
Trend factor
Seasonality factor
Trend factor
Seasonality factor
Forecasting
Forecasting

Double and triple exponential smoothing
 Double exponential smoothing
 Additive triple exponential smoothing
 Multiplicative triple exponential smoothing
Trend factor
Trend factor
Seasonality factor
Trend factor
Seasonality factor
Forecasting
Forecasting
 Exponential smoothing depends on mixing parameters,
which are required by software packages:
• Single exponential smoothing:
• Double exponential smoothing:
• Triple exponential smoothing:
 More on time-series analysis:
http://www.statsoft.com/textbook/time-series-analysis/

Triple exponential smoothing
For example, the additive Holt–Winters plot for a dataset with the
number of US monthly international flight passengers
The years 1949 through 1957 were used to “train” the algorithm,
and the years 1958 through 1960 were forecasted
Note how well the forecast agrees with the actual data

Autocorrelation and correlogram
As mentioned, time-series are mainly characterized by trends
and seasonality
Trend is analyzed by means of smoothing, function fitting
(modeling), and plotting
Seasonality can benefit from techniques correlation and
correlogram

The correlation between two time series is obtained as
follows:
 For each point xi in the two series, multiply their response values (yi),
considering their deviation from the mean
 Sum up all the products
 Normalize
The correlation for two identical series is 1, and it is -1 for
series that are exactly inverted one in relation to the
other

 Seasonality:
 Formally defined as the correlation between each i-th element and
the (i+k)-th element – k is usually called the lag
 Measured by the Autocorrelation Function - ACF, i.e., the correlation
between the two terms xi and xi+k
 If the measurement error is not too large, seasonality can be
visually identified as a pattern that repeats every k moments in
time

If seasonality is present, then the behavior of the series
should repeat at every k time units, where k is named lag
The problem, hence, is: how to identify analytically what is
the lag of the series?
 The answer is: compare the time series with its own self, but
shifted by increasing values (lags) of k; for each value calculate
the correlation
Hence, the autocorrelation of a given series at lag k is given
by
Normalization according to lag 0,
that is, to the correlation of the
series with itself

Autocorrelation basic algorithm:
1.Let k = 0
2.Start with two copies of the series (original and copy)
3.Subtract the mean from all values in both series
4.Multiply the values at corresponding time steps with each other
5.Sum up the results for all time steps
6.Normalize with the variance of the original series  this is the
correlation for lag k, that is, c(k)
7.Shift the copy by 1 time step
8.Let k  k+1
9.Continue in step 2 while k < kmax

Autocorrelation basic algorithm:
1.Let k = 0
2.Start with two copies of the series (original and copy)
3.Subtract the mean from all values in both series
4.Multiply the values at corresponding time steps with each other
5.Sum up the results for all time steps
6.Normalize with the variance of the original series  this is the
correlation for lag k, that is, c(k)
7.Shift the copy by 1 time step
8.Let k  k+1
9.Continue in step 2 while k < kmax
 According to this algorithm:
 Initially (lag 0), the two signals are perfectly aligned and the
correlation is 1
 Then, as we shift the signals they slowly move out of phase and
the correlation drops
 How quickly it drops tells us how much “memory” there is
in the data:
 If quickly, we know that, after a few steps, the signal has lost all
memory of its recent past
 If slowly, then we know that we are dealing with a process that
is relatively steady over longer periods of time

The correlogram refers to the plot “lag x correlation” of a
given time series
For example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than two
years – as presented earlier
Time series (Auto) correlogram – axis x  0<=lag<=500

The correlogram refers to the plot lag x correlation of a given
time series
For example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than two
years – as presented earlier
Time series (Auto) correlogram
 From the correlogram we can observe that:
 The series has a long “memory” (long cycles): it takes the
correlation almost 100 days to fall to zero, indicating that the
frequency of calls changes more or less once per quarter but not
more frequently
 There is a pronounced secondary peak at a lag of 365 days: the
call center data is highly seasonal and repeats itself on a yearly
basis, when the series repeats its response behavior (high
correlation)
 There is a small but regular sawtooth structure; if we look
closely, we will find that the first peak of the sawtooth is at a lag of
7 days and that all repeating ones occur at multiples of 7 - this is
the signature of the high-frequency component that we see in the
plot of the series; that is, the traffic to the call center exhibits a
secondary seasonal component with 7-day periodicity, the
traffic depends on the day of the week

Example

CO2 measurements above Mauna Loa in Hawaii
Consider again the data set with the concentration (ppm) of
carbon dioxide (CO2) in the atmosphere, as measured by the
observatory on Mauna Loa on Hawaii, recorded at monthly
intervals since 1959

 Which can be better numerically analyzed if the horizontal axis be expressed as
incremental monthly indexes, and if the graph goes through the origin (vertical
translation of -315)
 This can be achieved in Gnuplot with:
plot "data" using 0:($2-315) with lines

 The series has a trend that seems to be a power-law of the form b(x/a)k with k
bigger than 1 as the curve is convex downward, a first guess is k=2 and b=35 and
a=350 (upper rightmost part of the series)
plot “data” using 0:($2-315) with lines, 35*(x/350)**2

 By trial and error, a better guess for k is 1.35
plot "data" using 0:($2-315) with lines, 35*(x/350)**1.35

 To verify the accuracy of the model function, we can plot the residual by subtracting
the trend from the data
plot "data" using 0:($2-315 - 35*($0/350)**1.35) with lines

 The model seems fine but for the seasonality, which consists of regular oscillations
that can be captured by sines, as the series starts at (0,0); also the series is monthly-
based with a cycle of one year, so a guess is that the data is the same every 12
points; the amplitude is around 3, as we can observe in the former plots
 We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" using 0:($2-f($0)) with lines, 3*sin(2*pi*x/12) with lines

 The model seems fine but for the seasonality, which consists of regular oscillations that
can be captured by sines as the series starts at (0,0); also the series is monthly-based
with a cycle of one year, so a guess is that the data is the same every 12 points; the
amplitude is around 3, as we can observe in the former plots
 We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l
 At this point the model is given by the power-law function plus the sine
function
f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12)
plot "data" using 0:2 with lines, f(x)
which is pretty close the actual phenomenon

With the final model, it becomes possible to predict future values
for the series

References
 Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010.
 Wikipedia, http://en.wikipedia.org
 Wolfram MathWorld, http://mathworld.wolfram.com/

Data analysis03 timeasa-variable

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Data analysis03 timeasa-variable

Similar to Data analysis03 timeasa-variable (20)

More from Universidade de São Paulo

More from Universidade de São Paulo (20)

Recently uploaded

Recently uploaded (20)

Data analysis03 timeasa-variable