2. 2
Mining Time-Series Data
A time series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals
Time series analysis: A subfield of statistics, comprises methods
that attempt to understand such time series, often either to
understand the underlying context of the data points or to make
forecasts (or predictions)
Methods for time series analyses
Frequency-domain methods: Model-free analyses, well-suited to
exploratory investigations
spectral analysis vs. wavelet analysis
Time-domain methods: Auto-correlation and cross-correlation
analysis
Motif-based time-series analysis
Applications
Financial: stock price, inflation
Industry: power consumption
Scientific: experiment results
Meteorological: precipitation
3. 3
Mining Time-Series Data
Regression Analysis
Trend Analysis
Similarity Search in Time Series Data
Motif-Based Search and Mining in Time
Series Data
Summary
4. 4
Time-Series Data Analysis: Prediction &
Regression Analysis
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
5. 5
What is Regression?
Modeling the relationship between one response
variable and one or more predictor variables
Analyzing the confidence of the model
E.g, height v.s weight
6. 6
Regression Yields Analytical Model
Discrete data points →Analytical model
General relationship
Easy calculation
Further analysis
Application - Prediction
7. 7
Application - Detrending
Obtain the trend for irregular data series
Subtract trend
Reveal oscillations
trend
8. 8
Linear Regression - Single Predictor
Model is linear
y = w0 + w1 x
where w0 (y-intercept) and w1
(slope) are regression
coefficients
Method of least squares:
y: response
variable
x: predictor
variable
w1
w0
| |
1
| |
2
1
( )( )
1
( )
D
i i
i
D
i
i
x x y y
x x
w
x
w
y
w
1
0
9. 9
Training data is of the form (X1, y1), (X2, y2),…, (X|D|,
y|D|)
E.g., for 2-D data or
y = w0 + w1 x1+ w2 x2
Solvable by
Extension of least square method
(XTX ) W=Y →W = (XTX ) -1Y
Commercial software (SAS, S-Plus) x1
x2
y
Linear Regression – Multiple Predictor
10. 10
Nonlinear Regression with Linear Method
Polynomial regression model
E.g., y = w0 + w1 x + w2 x2 + w3 x3
Let x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Log-linear regression model
E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 )
Let y’=log(y)
y’= w0 + w1 x + w2 x2 + w3 x3
11. 11
Generalized Linear Regression
Response y
Distribution function in the exponential family
Variance of y depends on E( y), not a constant
E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 )
Examples
Logistic regression (binomial regression): probability of
some event occurring
Poisson regression: number of customers
…
References: Nelder and Wedderburn, 1972; McCullagh and
Nelder, 1989
12. 12
Regression Tree (Breiman et al., 1984)
Figure source: http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
Partition the domain space
Leaf: (1) a continuous-valued
prediction; (2) average value
13. 13
Model Tree (Quinlan, 1992)
Leaf – a linear equation
More general than regression tree
Figure source: http://datamining.ihe.nl/research/model-trees.htm
14. 14
Regression Trees and Model Trees
Regression tree: proposed in CART system (Breiman et al. 1984)
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the training tuples
that reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data cannot be represented well by a simple
linear model
15. 15
Predictive Modeling in
Multidimensional Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data
One can only predict value ranges or category
distributions
Method outline
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgment, etc.
Multi-level prediction: drill-down and roll-up analysis
16. 16
Predictive modeling: Predict data values or construct
generalized linear models based on the database data
One can only predict value ranges or category
distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the
prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgment, etc.
Multi-level prediction: drill-down and roll-up analysis
Predictive Modeling in Multidimensional Databases
18. 18
References
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models.
Journal of the Royal Statistical Society A, 135, 370-384.
C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman &
Hall, 1984.
McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, 2nd ed.
Chapman and Hall, London.
Breiman L, Friedman JH, Olshen RA, Stone CJ. (1984). Classification and
Regression Trees. Chapman &Hall (Wadsworth, Inc.): New York.
Quinlan, J. R. (1992). Learning with continuous classes. In: Adams, , Sterling,
(Eds.), Proceedings of artificial intelligence'92, World Scientific, Singapore. pp.
343-348.
Acknowledgment
This presentation integrates Xiaopeng Li’s slides in his CS 512 class
presentation
19. 19
Mining Time-Series Data
Regression Analysis
Trend Analysis
Similarity Search in Time Series Data
Motif-Based Search and Mining in Time
Series Data
Summary
20. 20
A time series can be illustrated as a time-series graph
which describes a point moving with the passage of time
21. 21
Categories of Time-Series Movements
Categories of Time-Series Movements
Long-term or trend movements (trend curve): general direction in
which a time series is moving over a long interval of time
Cyclic movements or cycle variations: long term oscillations about
a trend line or curve
e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears to
follow during corresponding months of successive years.
Irregular or random movements
Time series analysis: decomposition of a time series into these four
basic movements
Additive Modal: TS = T + C + S + I
Multiplicative Modal: TS = T C S I
22. 22
Estimation of Trend Curve
The freehand method
Fit the curve by looking at the graph
Costly and barely reliable for large-scaled data mining
The least-square method
Find the curve minimizing the sum of the squares of
the deviation of points on the curve from the
corresponding data points
The moving-average method
23. 23
Moving Average
Moving average of order n
Smoothes the data
Eliminates cyclic, seasonal and irregular movements
Loses the data at the beginning or end of a series
Sensitive to outliers (can be reduced by weighted
moving average)
24. 24
Trend Discovery in Time-Series (1):
Estimation of Seasonal Variations
Seasonal index
Set of numbers showing the relative values of a variable during
the months of the year
E.g., if the sales during October, November, and December are
80%, 120%, and 140% of the average monthly sales for the
whole year, respectively, then 80, 120, and 140 are seasonal
index numbers for these months
Deseasonalized data
Data adjusted for seasonal variations for better trend and cyclic
analysis
Divide the original monthly data by the seasonal index numbers
for the corresponding months
25. November 17, 2023 Data Mining: Concepts and Techniques 25
Seasonal Index
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10 11 12
Month
Seasonal Index
Raw data from
http://www.bbk.ac.uk/man
op/man/docs/QII_2_2003
%20Time%20series.pdf
26. 26
Trend Discovery in Time-Series (2)
Estimation of cyclic variations
If (approximate) periodicity of cycles occurs, cyclic
index can be constructed in much the same manner as
seasonal indexes
Estimation of irregular variations
By adjusting the data for trend, seasonal and cyclic
variations
With the systematic analysis of the trend, cyclic, seasonal,
and irregular components, it is possible to make long- or
short-term predictions with reasonable quality
27. 27
Mining Time-Series Data
Regression Analysis
Trend Analysis
Similarity Search in Time Series Data
Motif-Based Search and Mining in Time
Series Data
Summary
28. 28
Similarity Search in Time-Series Analysis
Normal database query finds exact match
Similarity search finds data sequences that differ only
slightly from the given query sequence
Two categories of similarity queries
Whole matching: find a sequence that is similar to the
query sequence
Subsequence matching: find all pairs of similar
sequences
Typical Applications
Financial market
Market basket data analysis
Scientific databases
Medical diagnosis
29. 29
Data Transformation
Many techniques for signal analysis require the data to
be in the frequency domain
Usually data-independent transformations are used
The transformation matrix is determined a priori
discrete Fourier transform (DFT)
discrete wavelet transform (DWT)
The distance between two signals in the time domain is
the same as their Euclidean distance in the frequency
domain
30. 30
Discrete Fourier Transform
DFT does a good job of concentrating energy in the first
few coefficients
If we keep only first a few coefficients in DFT, we can
compute the lower bounds of the actual distance
Feature extraction: keep the first few coefficients (F-index)
as representative of the sequence
31. 31
DFT (continued)
Parseval’s Theorem
The Euclidean distance between two signals in the time
domain is the same as their distance in the frequency
domain
Keep the first few (say, 3) coefficients underestimates the
distance and there will be no false dismissals!
1
0
2
1
0
2
|
|
|
|
n
f
f
n
t
t X
x
|
]
)[
(
]
)[
(
|
|
]
[
]
[
|
3
0
2
0
2
f
n
t
f
Q
F
f
S
F
t
Q
t
S
32. 32
Multidimensional Indexing in Time-Series
Multidimensional index construction
Constructed for efficient accessing using the first few
Fourier coefficients
Similarity search
Use the index to retrieve the sequences that are at
most a certain small distance away from the query
sequence
Perform post-processing by computing the actual
distance between sequences in the time domain and
discard any false matches
33. 33
Subsequence Matching
Break each sequence into a set of
pieces of window with length w
Extract the features of the
subsequence inside the window
Map each sequence to a “trail” in
the feature space
Divide the trail of each sequence
into “subtrails” and represent each
of them with minimum bounding
rectangle
Use a multi-piece assembly
algorithm to search for longer
sequence matches
35. 35
Enhanced Similarity Search Methods
Allow for gaps within a sequence or differences in offsets
or amplitudes
Normalize sequences with amplitude scaling and offset
translation
Two subsequences are considered similar if one lies within
an envelope of width around the other, ignoring outliers
Two sequences are said to be similar if they have enough
non-overlapping time-ordered pairs of similar
subsequences
Parameters specified by a user or expert: sliding window
size, width of an envelope for similarity, maximum gap,
and matching fraction
36. 36
Steps for Performing a Similarity Search
Atomic matching
Find all pairs of gap-free windows of a small length that
are similar
Window stitching
Stitch similar windows to form pairs of large similar
subsequences allowing gaps between atomic matches
Subsequence Ordering
Linearly order the subsequence matches to determine
whether enough similar pieces exist
37. 37
Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
38. 38
Query Languages for Time Sequences
Time-sequence query language
Should be able to specify sophisticated queries like
Find all of the sequences that are similar to some sequence in class
A, but not similar to any sequence in class B
Should be able to support various kinds of queries: range queries,
all-pair queries, and nearest neighbor queries
Shape definition language
Allows users to define and query the overall shape of time
sequences
Uses human readable series of sequence transitions or macros
Ignores the specific details
E.g., the pattern up, Up, UP can be used to describe
increasing degrees of rising slopes
Macros: spike, valley, etc.
39. 39
Mining Time-Series Data
Regression Analysis
Trend Analysis
Similarity Search in Time Series Data
Motif-Based Search and Mining in Time
Series Data
Summary
40. 40
Sequence Distance
A function that measures the differentness of two
sequences (of possibly unequal length)
Example: Euclidean Distance between TS Q,C
n
i i
i c
q
C
Q
D 1
2
)
(
)
,
(