Analysis of Time Series Data & Pattern Sequencing

Multidimensional analysis and descriptive mining of complex data objects
Time Series and Sequence Data
• A r c h i s h m a n B a n d y o p a d h y a y ( 3 9 1 3 6 )
• G a u ra n g D h u m e ( 3 9 1 6 8 )

Data Stream Mining
Key characteristics :
• Temporarily ordered
• Fast changing
• Massive
• Potentially infinite
Streams of data flow in and
out of a computer system
continuously and with
varying update rates.
Data stream management
systems (DSMS) are used
to perform data mining on
the data stream
I/P O/P
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
Data Stream Mining is the process of extracting
knowledge structures from continuous, rapid
data records.
A data stream is an ordered sequence of
instances that in many applications of data
stream mining can be read only once or a small
number of times using limited computing and
storage capabilities.

Time Series
A time series is a collection of
observations made sequentially in
time.
e.g.
• Financial time series like stock
fluctuations with time,
• Sales revenue with time,
• Budgetary analysis,
• Utility studies, inventory studies,
• Yield projections,
• Workload projections,
• Process and quality control,
• Observation of natural phenomena
(such as atmosphere, temperature,
wind, earthquake)
Major time series data mining tasks :
1. Indexing
2. Clustering
3. Classification
4. Prediction
5. Anomaly Detection
Time

Trend Analysis ; Y=f(t)
Goals for Time series analysis :
• Modelling time series (i.e., to gain
insight into the mechanisms or
underlying forces that generate the
time series )
• Forecasting time series (i.e., to
predict the future values of the
time-series variables).
Major components or movements :
• Trend or long-term movement
• Cyclic movement
• Seasonal movements
• Irregular/Random movements
0
1
2
3
4
5
6
Jan-01
Jul-01
Jan-02
Jul-02
Jan-03
Jul-03
Jan-04
Jul-04
Jan-05
Jul-05
Jan-06
Jul-06
Jan-07
Jul-07
Jan-08
Jul-08
Jan-09
Jul-09
Jan-10
Jul-10
Jan-11
Jul-11
Germany-Long-term interest rate (%)
400
500
600
700
800
900
1000
97 -
Q1
97 -
Q2
97 -
Q3
97 -
Q4
98 -
Q1
98 -
Q2
98 -
Q3
98 -
Q4
99 -
Q1
99 -
Q2
99 -
Q3
99 -
Q4
Ice Cream Sales(Rs. Mn)
Long Term Trend
Seasonal Movement

Sequential Pattern Mining
A Sequential database is any database
that consists of sequences of ordered
events, with or without concrete
notions of time.
Examples
• Web page traversal sequences
Customer shopping transaction
sequences (Renting “Star Wars”, then
“Empire Strikes Back”, then “Return of
the Jedi” in that order)
• Collection of ordered events within an
interval
Applications
• Targeted marketing
• Customer retention
• Weather prediction , etc
Pattern mining consists of discovering
interesting, useful, and unexpected
patterns in databases
Various types of patterns can be
discovered in databases, such as :
• Frequent item-sets,
• Associations,
• Subgraphs,
• Sequential rules, and
• Periodic patterns
Seq. ID Sequence
1 {a , b} , {c} , {f , g} , {g} , {e}
2 {a , d} , {c} , {b} , {a , b , e , f }
3 {a} , {b} , {f , g} , {e}
4 {b} , {f , g}
Example :

Applying Sequential Pattern Mining to Time Series
It is possible to convert time
series to sequences by
discretizing the time series
(transforming the numbers into
symbols).
Then techniques for analysing
sequences can also be applied to
analyse the time series - one of
the most popular algorithms for
the same being the SAX
algorithm (Symbolic Aggregate
approXimation)
Step 1 : PAA (piecewise
aggregate approximation)
Split the time series into 8
segments and replace each
segment by its average ,to
reduce the dimensionality
n = 11 (orgnl. data pts.)
w = 4 (no. of symbols)
v = 8 (PAA data pts.)
Step 2 : Each data point is
replaced by a symbol (to
represent various intervals of
values such that each is equally
probable under the normal
distribution)
Four symbols are created:
a = [-Infinity, 4.50]
b = [4.50, 6.2]
c = [6.2, 7.90]
d = [7.90,Infinity]
The result is a sequence of symbols: a, a, c, d, d, c, c, b
Step 3 : This sequence is the symbolic representation of the
time series.
Then, after a time series has been converted to a sequence, it is
possible to apply traditional pattern mining algorithms on the
time series

Common Sequential Pattern Mining Algorithms
I. Apriori-based Approaches (If a sequence S is not frequent, then
none of the supersequences of S is frequent) :
• GSP (Generalized Sequential Pattern)
• SPADE (Sequential PAttern Discovery using Equivalent Class)
II. Pattern-Growth-based Approaches (Sequence databases are
recursively projected into a set of smaller projected databases based on
the current sequential pattern(s), and sequential patterns are grown in
each projected databases by exploring only locally frequent fragments) :
• FreeSpan
• PrefixSpan (Prefix-Projected Sequential Pattern Growth)

Common Sequential Pattern Mining Algorithms
Algorithm (Top Down – Apriori , Bottom Up – Pattern Growth)
Large supermarket tracks sales data by
stock-keeping unit (SKU) for each item: each
item, such as "butter" or "bread", is
identified by a numerical SKU. The
supermarket has a database of transactions
where each transaction is a set of SKUs that
were bought together.
Let the database of transactions consist of
following item sets:
Item sets
{1,2,3,4}
{1,2,4}
{1,2}
{2,3,4}
{2,3}
{3,4}
{2,4}
Count the no. of
occurrences, called
the support, of each
member item
separately. By
scanning the
database for the first
time, we obtain the
following result
Item Support
{1} 3
{2} 6
{3} 4
{4} 5
Say that an item set is frequent if it appears in at least 3
transactions of the database: the value 3 is the support
threshold.
All the itemsets of size 1 have a support of at least 3, so they
are all frequent.
The next step is to generate a list of all pairs of the frequent
items.
For example, regarding the pair {1,2}: the first table of
Example 2 shows items 1 and 2 appearing together in three
of the itemsets; therefore, we say item {1,2} has support of
three.
Item
Supp
ort
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3
The pairs {1,2}, {2,3}, {2,4}, and
{3,4} all meet or exceed the
minimum support of 3, so they
are frequent. The pairs {1,3}
and {1,4} are not. Now, because
{1,3} and {1,4} are not
frequent, any larger set which
contains {1,3} or {1,4} cannot
be frequent. In this way, we can
prune sets: we will now look
for frequent triples in the
database

- Thank You -
• http://data-mining.philippe-fournier-
viger.com/introduction-time-series-
mining-spmf/
• http://data-mining.philippe-fournier-
viger.com/introduction-sequential-
pattern-mining/
• http://web.engr.illinois.edu/~hanj/cs5
12/bk2chaps/chapter_8.pdf
• Lin, J., Keogh, E., Wei, L., Lonardi, S.:
Experiencing SAX: a novel symbolic
representation of time series. Data
Mining and Knowledge Discovery 15,
107–144 (2007)
Sources :

Analysis of Time Series Data & Pattern Sequencing

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Analysis of Time Series Data & Pattern Sequencing

Similar to Analysis of Time Series Data & Pattern Sequencing (20)

Recently uploaded

Recently uploaded (20)

Analysis of Time Series Data & Pattern Sequencing