Multidimensional analysis and descriptive mining of complex data objects
Time Series and Sequence Data
• A r c h i s h m a n B a n d y o p a d h y a y ( 3 9 1 3 6 )
• G a u ra n g D h u m e ( 3 9 1 6 8 )
Data Stream Mining
Key characteristics :
• Temporarily ordered
• Fast changing
• Massive
• Potentially infinite
Streams of data flow in and
out of a computer system
continuously and with
varying update rates.
Data stream management
systems (DSMS) are used
to perform data mining on
the data stream
I/P O/P
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
Data Stream Mining is the process of extracting
knowledge structures from continuous, rapid
data records.
A data stream is an ordered sequence of
instances that in many applications of data
stream mining can be read only once or a small
number of times using limited computing and
storage capabilities.
Time Series
A time series is a collection of
observations made sequentially in
time.
e.g.
• Financial time series like stock
fluctuations with time,
• Sales revenue with time,
• Budgetary analysis,
• Utility studies, inventory studies,
• Yield projections,
• Workload projections,
• Process and quality control,
• Observation of natural phenomena
(such as atmosphere, temperature,
wind, earthquake)
Major time series data mining tasks :
1. Indexing
2. Clustering
3. Classification
4. Prediction
5. Anomaly Detection
Time
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
Trend Analysis ; Y=f(t)
Goals for Time series analysis :
• Modelling time series (i.e., to gain
insight into the mechanisms or
underlying forces that generate the
time series )
• Forecasting time series (i.e., to
predict the future values of the
time-series variables).
Major components or movements :
• Trend or long-term movement
• Cyclic movement
• Seasonal movements
• Irregular/Random movements
0
1
2
3
4
5
6
Jan-01
Jul-01
Jan-02
Jul-02
Jan-03
Jul-03
Jan-04
Jul-04
Jan-05
Jul-05
Jan-06
Jul-06
Jan-07
Jul-07
Jan-08
Jul-08
Jan-09
Jul-09
Jan-10
Jul-10
Jan-11
Jul-11
Germany-Long-term interest rate (%)
400
500
600
700
800
900
1000
97 -
Q1
97 -
Q2
97 -
Q3
97 -
Q4
98 -
Q1
98 -
Q2
98 -
Q3
98 -
Q4
99 -
Q1
99 -
Q2
99 -
Q3
99 -
Q4
Ice Cream Sales(Rs. Mn)
Long Term Trend
Seasonal Movement
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
Sequential Pattern Mining
A Sequential database is any database
that consists of sequences of ordered
events, with or without concrete
notions of time.
Examples
• Web page traversal sequences
Customer shopping transaction
sequences (Renting “Star Wars”, then
“Empire Strikes Back”, then “Return of
the Jedi” in that order)
• Collection of ordered events within an
interval
Applications
• Targeted marketing
• Customer retention
• Weather prediction , etc
Pattern mining consists of discovering
interesting, useful, and unexpected
patterns in databases
Various types of patterns can be
discovered in databases, such as :
• Frequent item-sets,
• Associations,
• Subgraphs,
• Sequential rules, and
• Periodic patterns
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
Seq. ID Sequence
1 {a , b} , {c} , {f , g} , {g} , {e}
2 {a , d} , {c} , {b} , {a , b , e , f }
3 {a} , {b} , {f , g} , {e}
4 {b} , {f , g}
Example :
Applying Sequential Pattern Mining to Time Series
Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
It is possible to convert time
series to sequences by
discretizing the time series
(transforming the numbers into
symbols).
Then techniques for analysing
sequences can also be applied to
analyse the time series - one of
the most popular algorithms for
the same being the SAX
algorithm (Symbolic Aggregate
approXimation)
Step 1 : PAA (piecewise
aggregate approximation)
Split the time series into 8
segments and replace each
segment by its average ,to
reduce the dimensionality
n = 11 (orgnl. data pts.)
w = 4 (no. of symbols)
v = 8 (PAA data pts.)
Step 2 : Each data point is
replaced by a symbol (to
represent various intervals of
values such that each is equally
probable under the normal
distribution)
Four symbols are created:
a = [-Infinity, 4.50]
b = [4.50, 6.2]
c = [6.2, 7.90]
d = [7.90,Infinity]
The result is a sequence of symbols: a, a, c, d, d, c, c, b
Step 3 : This sequence is the symbolic representation of the
time series.
Then, after a time series has been converted to a sequence, it is
possible to apply traditional pattern mining algorithms on the
time series
Common Sequential Pattern Mining Algorithms
I. Apriori-based Approaches (If a sequence S is not frequent, then
none of the supersequences of S is frequent) :
• GSP (Generalized Sequential Pattern)
• SPADE (Sequential PAttern Discovery using Equivalent Class)
II. Pattern-Growth-based Approaches (Sequence databases are
recursively projected into a set of smaller projected databases based on
the current sequential pattern(s), and sequential patterns are grown in
each projected databases by exploring only locally frequent fragments) :
• FreeSpan
• PrefixSpan (Prefix-Projected Sequential Pattern Growth)
Common Sequential Pattern Mining Algorithms
Algorithm (Top Down – Apriori , Bottom Up – Pattern Growth)
Large supermarket tracks sales data by
stock-keeping unit (SKU) for each item: each
item, such as "butter" or "bread", is
identified by a numerical SKU. The
supermarket has a database of transactions
where each transaction is a set of SKUs that
were bought together.
Let the database of transactions consist of
following item sets:
Item sets
{1,2,3,4}
{1,2,4}
{1,2}
{2,3,4}
{2,3}
{3,4}
{2,4}
Count the no. of
occurrences, called
the support, of each
member item
separately. By
scanning the
database for the first
time, we obtain the
following result
Item Support
{1} 3
{2} 6
{3} 4
{4} 5
Say that an item set is frequent if it appears in at least 3
transactions of the database: the value 3 is the support
threshold.
All the itemsets of size 1 have a support of at least 3, so they
are all frequent.
The next step is to generate a list of all pairs of the frequent
items.
For example, regarding the pair {1,2}: the first table of
Example 2 shows items 1 and 2 appearing together in three
of the itemsets; therefore, we say item {1,2} has support of
three.
Item
Supp
ort
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3
The pairs {1,2}, {2,3}, {2,4}, and
{3,4} all meet or exceed the
minimum support of 3, so they
are frequent. The pairs {1,3}
and {1,4} are not. Now, because
{1,3} and {1,4} are not
frequent, any larger set which
contains {1,3} or {1,4} cannot
be frequent. In this way, we can
prune sets: we will now look
for frequent triples in the
database
- Thank You -
• http://data-mining.philippe-fournier-
viger.com/introduction-time-series-
mining-spmf/
• http://data-mining.philippe-fournier-
viger.com/introduction-sequential-
pattern-mining/
• http://web.engr.illinois.edu/~hanj/cs5
12/bk2chaps/chapter_8.pdf
• Lin, J., Keogh, E., Wei, L., Lonardi, S.:
Experiencing SAX: a novel symbolic
representation of time series. Data
Mining and Knowledge Discovery 15,
107–144 (2007)
Sources :

Analysis of Time Series Data & Pattern Sequencing

  • 1.
    Multidimensional analysis anddescriptive mining of complex data objects Time Series and Sequence Data • A r c h i s h m a n B a n d y o p a d h y a y ( 3 9 1 3 6 ) • G a u ra n g D h u m e ( 3 9 1 6 8 )
  • 2.
    Data Stream Mining Keycharacteristics : • Temporarily ordered • Fast changing • Massive • Potentially infinite Streams of data flow in and out of a computer system continuously and with varying update rates. Data stream management systems (DSMS) are used to perform data mining on the data stream I/P O/P Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A) Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.
  • 3.
    Time Series A timeseries is a collection of observations made sequentially in time. e.g. • Financial time series like stock fluctuations with time, • Sales revenue with time, • Budgetary analysis, • Utility studies, inventory studies, • Yield projections, • Workload projections, • Process and quality control, • Observation of natural phenomena (such as atmosphere, temperature, wind, earthquake) Major time series data mining tasks : 1. Indexing 2. Clustering 3. Classification 4. Prediction 5. Anomaly Detection Time Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
  • 4.
    Trend Analysis ;Y=f(t) Goals for Time series analysis : • Modelling time series (i.e., to gain insight into the mechanisms or underlying forces that generate the time series ) • Forecasting time series (i.e., to predict the future values of the time-series variables). Major components or movements : • Trend or long-term movement • Cyclic movement • Seasonal movements • Irregular/Random movements 0 1 2 3 4 5 6 Jan-01 Jul-01 Jan-02 Jul-02 Jan-03 Jul-03 Jan-04 Jul-04 Jan-05 Jul-05 Jan-06 Jul-06 Jan-07 Jul-07 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Jan-11 Jul-11 Germany-Long-term interest rate (%) 400 500 600 700 800 900 1000 97 - Q1 97 - Q2 97 - Q3 97 - Q4 98 - Q1 98 - Q2 98 - Q3 98 - Q4 99 - Q1 99 - Q2 99 - Q3 99 - Q4 Ice Cream Sales(Rs. Mn) Long Term Trend Seasonal Movement Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A)
  • 5.
    Sequential Pattern Mining ASequential database is any database that consists of sequences of ordered events, with or without concrete notions of time. Examples • Web page traversal sequences Customer shopping transaction sequences (Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order) • Collection of ordered events within an interval Applications • Targeted marketing • Customer retention • Weather prediction , etc Pattern mining consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases, such as : • Frequent item-sets, • Associations, • Subgraphs, • Sequential rules, and • Periodic patterns Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A) Seq. ID Sequence 1 {a , b} , {c} , {f , g} , {g} , {e} 2 {a , d} , {c} , {b} , {a , b , e , f } 3 {a} , {b} , {f , g} , {e} 4 {b} , {f , g} Example :
  • 6.
    Applying Sequential PatternMining to Time Series Introduction to Business Intelligence – Archishman Bandyopadhyay & Gaurang Dhume , SIBM Pune (MBA 39 , Marketing A) It is possible to convert time series to sequences by discretizing the time series (transforming the numbers into symbols). Then techniques for analysing sequences can also be applied to analyse the time series - one of the most popular algorithms for the same being the SAX algorithm (Symbolic Aggregate approXimation) Step 1 : PAA (piecewise aggregate approximation) Split the time series into 8 segments and replace each segment by its average ,to reduce the dimensionality n = 11 (orgnl. data pts.) w = 4 (no. of symbols) v = 8 (PAA data pts.) Step 2 : Each data point is replaced by a symbol (to represent various intervals of values such that each is equally probable under the normal distribution) Four symbols are created: a = [-Infinity, 4.50] b = [4.50, 6.2] c = [6.2, 7.90] d = [7.90,Infinity] The result is a sequence of symbols: a, a, c, d, d, c, c, b Step 3 : This sequence is the symbolic representation of the time series. Then, after a time series has been converted to a sequence, it is possible to apply traditional pattern mining algorithms on the time series
  • 7.
    Common Sequential PatternMining Algorithms I. Apriori-based Approaches (If a sequence S is not frequent, then none of the supersequences of S is frequent) : • GSP (Generalized Sequential Pattern) • SPADE (Sequential PAttern Discovery using Equivalent Class) II. Pattern-Growth-based Approaches (Sequence databases are recursively projected into a set of smaller projected databases based on the current sequential pattern(s), and sequential patterns are grown in each projected databases by exploring only locally frequent fragments) : • FreeSpan • PrefixSpan (Prefix-Projected Sequential Pattern Growth)
  • 8.
    Common Sequential PatternMining Algorithms Algorithm (Top Down – Apriori , Bottom Up – Pattern Growth) Large supermarket tracks sales data by stock-keeping unit (SKU) for each item: each item, such as "butter" or "bread", is identified by a numerical SKU. The supermarket has a database of transactions where each transaction is a set of SKUs that were bought together. Let the database of transactions consist of following item sets: Item sets {1,2,3,4} {1,2,4} {1,2} {2,3,4} {2,3} {3,4} {2,4} Count the no. of occurrences, called the support, of each member item separately. By scanning the database for the first time, we obtain the following result Item Support {1} 3 {2} 6 {3} 4 {4} 5 Say that an item set is frequent if it appears in at least 3 transactions of the database: the value 3 is the support threshold. All the itemsets of size 1 have a support of at least 3, so they are all frequent. The next step is to generate a list of all pairs of the frequent items. For example, regarding the pair {1,2}: the first table of Example 2 shows items 1 and 2 appearing together in three of the itemsets; therefore, we say item {1,2} has support of three. Item Supp ort {1,2} 3 {1,3} 1 {1,4} 2 {2,3} 3 {2,4} 4 {3,4} 3 The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3, so they are frequent. The pairs {1,3} and {1,4} are not. Now, because {1,3} and {1,4} are not frequent, any larger set which contains {1,3} or {1,4} cannot be frequent. In this way, we can prune sets: we will now look for frequent triples in the database
  • 9.
    - Thank You- • http://data-mining.philippe-fournier- viger.com/introduction-time-series- mining-spmf/ • http://data-mining.philippe-fournier- viger.com/introduction-sequential- pattern-mining/ • http://web.engr.illinois.edu/~hanj/cs5 12/bk2chaps/chapter_8.pdf • Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 107–144 (2007) Sources :