SlideShare a Scribd company logo
1 of 30
Mining Sequential
Patterns
1
Sequential Pattern Mining
 Sequence database – consists of sequences of ordered
elements or events (with or without time)
 Sequential Pattern Mining is the mining of frequently
occurring ordered events or subsequences as patterns
 Example:
 Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera, within 3
months.
 Web access patterns, Weather prediction
 Usually categorical or symbolic data
 Numeric data analysis – Time Series Analysis
2
Sequential Pattern Mining
 I = {I1, I2, …Ip} – Set of items
 Sequence s = <e1 e2 e3… el>
 Ordered list of events
 Each event is an element of the sequence
 In Shopping databases, event – shopping trip in which customers bought items (x1 x2 …xq)
 All trips by customer form a sequence
 Item can occur at most once in an event, but several times in a sequence
 Sequence with length l : l-sequence
 A sequence α = <a1a2…an> is a subsequence of β = <b1b2..bm> denoted as α
⊆ β if there exists integers j1, j2, … jn between 1 and m such that a1 ⊆ bj1, a2 ⊆
bj2,… an ⊆ bjn
 α = <(ab),d> and β = <(abc),(de)> α is a sub-sequence of β
3
Sequential Pattern Mining
 A sequence database, S, is a set of tuples, <SID,s>, where SID is a
sequence ID and s is a sequence
 A tuple <SID, s> is said to contain a sequence a, if a is a
subsequence of s
 The support of a sequence α in a sequence database S is the
number of tuples in the database containing α
 Given the minimum support threshold, a sequence a is frequent in
sequence database S if support S(a) >= min sup
 A frequent sequence is called a sequential pattern
 A sequential pattern with length l is called an l-pattern
4
Sequential Patterns
 Length of Sequence 1 – 9
 It contains ‘a’ multiple times but sequence 1 will contribute only one
to the support of <a>
 Sequence <a(bc)df> - Subsequence of Sequence 1
 Support for <(ab)c> is 2 (Present in 1 and 3)
 Frequent as it satisfies minimum support of 2
5
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
Scalable Methods for Mining
Sequential Patterns
 Full Set Vs Closed Set
 A sequential patterns s is closed if there exists no s’ where s’ is a proper
super-sequence of s and s’ has same support as s
 Subsequences of a frequent sequence are also ferquent
 Mining closed sequential patterns avoids generation of unnecessary sub-sequences
 GSP – Candidate generate and test approach on horizontal data
format
 SPADE – Candidate generate and test approach on vertical data
format
 PrefixScan – does not require candidate generation
 All approaches exploit Apriori property – every non empty
subsequence of a sequential pattern is a sequential pattern
6
GSP: Sequential Pattern Mining
 GSP (Generalized Sequential Patterns)
 Multi-pass, Candidate generate and test approach proposed by Agrawal
and Srikant
 Outline of the method
 Initially, every item in DB is a candidate of length-1
 for each level (i.e., sequences of length-k) do
 Scan database to collect support count for each candidate sequence
 Generate candidate length-(k+1) sequences from length-k frequent
sequences using Apriori
 A (k-1)-sequence w1 is merged with another (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence obtained by removing the first event in w1
is the same as the subsequence obtained by removing the last event in w2
 repeat until no frequent sequence or no candidate can be found
 Major strength: Candidate pruning by Apriori
 Weakness : Generates large number of candidates
7
Generalized Sequential Patterns
8
 Initial candidates: all singleton sequences
 <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
 Scan database once, count support for candidates
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Cand Sup
<a> 4
<b> 4
<c> 1
<d> 2
<e> 1
<f> 4
<g> 1
<h> 1
Generalized Sequential Patterns
9
Cand Sup
<a> 4
<b> 4
<d> 2
<f> 4
Length 2 Candidates generated by joinLength 2 Candidates generated by join
<aa> <ab> <ad> <af> <ba> <bb> <bd> <bf>
<da> <db><dd> <df> <fa> <fb> <fd> <ff>
<(ab)> <(ad)> <(af)> <(bd)> <(bf)> <(df)>
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Length 2 Frequent SequencesLength 2 Frequent Sequences
<ba> <da> <db> <df> <fa>
<(ab)> <(af)> <(bf)>
Generalized Sequential Patterns
10
Length 2 Frequent SequencesLength 2 Frequent Sequences
<ba> <da> <db> <df> <fa>
<(ab)> <(af)> <(bf)>
Length 3 Candidates generated by joinLength 3 Candidates generated by join
<ba> and <(ab)> - <b(ab)> {1}
<ba> and <(af)> - <b(af)> {1}
<da> and <(ab)> - <d(ab)> {1}
<da> and <(af)> - <d(af)> {1}
<db> and <(bf)> - <d(bf)> {1, 4}
<db> and <ba> - <dba> {1, 4}
<df> and <fa> - <dfa> {1, 4}
<fa> and <(ab)> - <f(ab)> -
<fa> and <(af)> - <f(af)> {1}
<(ab)> and <(bf)> - <(abf)> {1,2,3}
<(ab)> and <ba> - <(ab)a> {1}
<(af)> and <fa> - <(af)a) {1}
<(bf)> and <fa> - <(bf)a> {1, 4}
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Length 3 Frequent SequencesLength 3 Frequent Sequences
<dba> <dfa> <(abf)> <(bf)a> <d(bf)>
Length 4 Candidates generated by joinLength 4 Candidates generated by join
<d(bf)> and <(bf)a> - <d(bf)a> {1, 4}
<(abf)> and <(bf)a> - <(abf)a> {1}
SPADE: Sequential Pattern Mining on
Vertical data format
 SPADE – Sequential PAttern Discovery using Equivalent classes
 Vertical data format <itemset: (sequence_ID, event_ID)>
 ID_list : Set of (sequence_ID, event_ID) pairs for a given itemset
 Mapping from horizontal to vertical format requires one scan
 Support of k-sequences can be determined by joining the ID lists of
(k-1) sequences
 To find candidate 2-sequences, all single items are joined if they are
frequent, if they share the same sequence identifier and if their event
identifier follows a sequential ordering
 Patterns are grown similarly
11
Vertical Data Format
12
SPADE - Example
13
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
SID EID Itemset
1 1 cd
1 2 abc
1 3 abf
1 4 acdf
2 1 abf
2 2 e
3 1 abf
4 1 dgh
4 2 bf
4 3 agh
Vertical Data Format
SPADE - Example
14
SID EID Itemset
1 1 cd
1 2 abc
1 3 abf
1 4 acdf
2 1 abf
2 2 e
3 1 abf
4 1 dgh
4 2 bf
4 3 agh
Vertical Data Format
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 2
1 4
c
SID EID
1 1
1 4
4 1
d
SID EID
2 2
e
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID
4 1
4 3
g
SID EID
4 1
4 3
h (ab) – 1,2; 1,3; 2,1;
3,1
(ad) – 1,4
(af) – 1,4; 2,1;3,1
(bd) – NIL
(bf) – 1,3; 2,1; 3,1;4,2
(df) – 1,4
SPADE - Example
15
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 4
4 1
d
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID(a) EID(a)
1 2 3
1 3 4
1 2 4
aa
SID EID(a) EID(b)
1 2 3
ab
SID EID(a) EID(d)
1 2 4
1 3 4
ad
SID EID(a) EID(b)
1 2 3
1 3 4
af
SID EID(b) EID(a)
1 2 3
1 3 4
4 2 3
ba
SID EID(b) EID(b)
1 2 3
bb
SID EID(b) EID(d)
1 2 4
1 3 4
bd
SID EID(b) EID(f)
1 2 3
1 3 4
bf
SPADE - Example
16
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 4
4 1
d
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID(d) EID(a)
1 1 2
1 1 3
1 1 4
4 1 3
da
SID EID(d) EID(b)
1 1 2
1 1 3
4 1 2
db
SID EID(d) EID(d)
1 1 4
dd
SID EID(d) EID(f)
1 1 3
1 1 4
4 1 2
df
SID EID(f) EID(a)
1 3 4
4 2 3
fa
SID EID(f) EID(b)
- - -
fb
SID EID(f) EID(d)
1 3 4
fd
SID EID(f) EID(f)
1 3 4
ff
SPADE
 Reduces scans of the sequence database
 As the length of the frequent sequence increases, the
size of ID_list decreases – results in fast joins
 But large set of candidates are still generated
17
PrefixSpan: Prefix-Projected Sequential
Pattern Growth
 Pattern Growth – does not require candidate
generation
 Finds frequent single items
 Constructs FP-tree
 Projected databases associated with each frequent
item are generated from FP-tree
 Builds prefix patterns which it concatenates with suffix
patterns to find frequent patterns
18
PrefixSpan
 Given a sequence α = <e1e2…en> (where each ei corresponds to a
frequent event), a sequence β = <e’1e’2…e’m> (m<=n) is called a
prefix of α iff
 e'i = ei for i<=m-1
 e'm ⊆ em
 All frequent items in (em– e’m) are alphabetically after those in e’m
 Sequence γ = <e”m em+1…en> is called the suffix of α wrt prefix β
denoted as γ = α/β where e”m = (em – e’m )
19
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<a(ab)> <(_c)(ac)d(cf)>
PrefixSpan
 Mining Sequential patterns:
 {<x1>,<x2>,…<xn>} – complete set of length-1 sequential patterns.
 The complete set of Sequential patterns in S can be partitioned into n disjoint
subsets.
 ith
subset is the set of sequential patterns with prefix <xi>
 α : length-l sequential pattern and {β1, β2,… βm} – set of all length
(l+1) sequential patterns with prefix α.
 The complete set of sequential patterns with α as a prefix can be partitioned
into m disjoint subsets
 jth
subset is the set of patterns prefixed with βj
20
PrefixSpan Example
 Step 1: find length-1 sequential patterns
 <a>, <b>, <c>, <d>, <e>, <f>
 Step 2: divide search space. The complete set of seq.
pat. can be partitioned into 6 subsets:
 The ones having prefix <a>;
 The ones having prefix <b>;
 …
 The ones having prefix <f>
21
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
PrefixSpan : Example
 Only need to consider projections w.r.t. <a>
 <a>-projected database: (only first occurrence of a is considered) <(abc)
(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
 In <a> projected database – frequent items are a:2, b:4, _b:2, c:4, d:2
and f:2
 Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>,
<(ab)>, <ac>, <ad>, <af>
 Further partition into 6 subsets
 Having prefix <aa> - {< (_bc)(ac)d(cf)>, <(_e)>} – No frequent subsequences
 Having prefix <(ab)> - <(_c)(ac)d(cf)> and <(df)cb>
 Frequent patterns: <c> <d> <f> <dc> : <(ab)c> <(ab)d> <(ab)f> <(ab)dc>
 Having prefix <ac> - <(ac)d(cf)>, <(bc)(ae)>, <b>, <bc>
 Frequent patterns: <a> <b> <c> : <aca> <acb> <acc>
….
22
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
PrefixSpan
 No candidate sequence needs to be generated
 Projected databases keep shrinking
 Major cost of PrefixSpan: constructing projected databases
 Can be improved by pseudo-projections
 Pseudo-projection
 Registers the index of the corresponding sequence and the starting
position of the projected suffix in the sequence instead of physical
projection
 Avoids physically copying postfixes
 Efficient in running time and space when database can be held in main
memory
 For large data combination of physical and pseudo projection
23
Sequential Pattern Mining
 Performance rating : PrefixSpan, SPADE, GSP
 All three are slow when there is a large number of frequent
subsequences
 Closed Sequential Patterns
 Closed Subsequences – contain no super sequence with the same support
 Reduces number of sequences considered
 CloSpan
 Based on equivalence of projected databases
 Two projected sequence databases are equivalent iff the total number of items match
 CloSpan prunes non-closed sequences whenever two projected databases are exactly the same by
stopping the growth of one
 Requires Post-processing to eliminate any remaining non-closed sequential patterns
 BIDE (Bidirectional Search) algorithm avoids additional checking
24
Sequential Pattern Mining
 Multi-dimensional, multi-level Sequential patterns
 Additional information maybe associated with Sequence ID –
Customer age, address, group and profession
 Additional information associated with items – item category, brand,
model type, model number, place…
 Example: “Retired customers who purchase a digital camera are
likely to purchase a color printer within a month”
 Additional information can be attached with Sequence ID / Item ID
25
Constraint based Mining of
Sequential Patterns
 Un-focused mining reduces the efficiency and usability of
frequent-pattern mining
 Constraint based mining incorporates user-specified
constraints to reduce the search space
 Regular expressions can be used to specify pattern templates
 Helps to improve efficiency of mining and interestingness of
patterns
26
Constraint based Mining
 Constraints related to duration T
 Constraints related to maximal or minimal length – anti-
monotonic / monotonic constraints
 Anti-monotonic constraint: T<= 10
 Monotonic : T > 10
 Succinct : T = 2005
 Periodic patterns – related to sets of partitioned sequences such
as every two weeks before and after an earthquake
27
Constraint based Mining
 Event folding window w
 specifies the periodicity for treating events as occurring together
 w=0 – No event sequence folding
 w=T – time-insensitive frequent patterns
 Gap between events
 Gap=0 – Strictly consecutive sequential patterns
 min_gap and max_gap
 Exact gaps and approximate gaps
 Serial Episodes
 Set of events occurring in total order
 Parallel Episodes
 Occurrence order is trivial
 Examples
 (A|B)C*(D|E) – A and B first occur (relative order is unimportant) followed by
any number of C events followed by D and E (in any order)
 C = <a*{bb|(bc)d|dd}> a-projected databases followed by SuffixSpan
28
Periodicity Analysis for Time-Related
Sequence data
 Periodicity analysis – mining of periodic patterns – searching for
recurring patterns in time-related sequence data
 Seasons, tides, traffic patterns, power consumption
 Often performed over time-series data
 Full periodic pattern
 Every point in time contributes (precisely or approximately) to the periodicity
 Example: All days in a year – contribute to season cycle
 Partial periodic pattern
 Specifies periodic behavior of a time-related sequence at some but not all points
of time
 Example: ABC reads the paper between 7:00-7:30 am every week day
29
Periodicity Analysis
 Synchronous periodicity – event occurs at a relatively fixed offset in
each “stable” period
 3 pm every day
 Asynchronous periodicity – event fluctuates in loosely defined
period
 Precise or approximate – depending on data value or offset within a
period
 Mining partial periodicity – leads to the discovery of cyclic or
periodic association rules (rules that associate a set of events that
occur periodically)
 Example: If tea sells well between 3 – 5 pm dinner will also sell well
between 7 – 9 pm on weekends
30

More Related Content

What's hot

Data cube computation
Data cube computationData cube computation
Data cube computation
Rashmi Sheikh
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
kiran said
 

What's hot (20)

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
Type Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLikeType Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLike
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 

Similar to 5.3 mining sequential patterns

ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
ChellamuthuHaripriya
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
Shani729
 
Wireless sensor network Apriori an N-RMP
Wireless sensor network Apriori an N-RMP Wireless sensor network Apriori an N-RMP
Wireless sensor network Apriori an N-RMP
Amrit Khandelwal
 

Similar to 5.3 mining sequential patterns (20)

My6asso
My6assoMy6asso
My6asso
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
 
03
0303
03
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatterns
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential Patterns
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
 
SPADE -
SPADE - SPADE -
SPADE -
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012
 
presentation
presentationpresentation
presentation
 
Declare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationDeclare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code Generation
 
PrefixSpan With Spark at Akanoo
PrefixSpan With Spark at AkanooPrefixSpan With Spark at Akanoo
PrefixSpan With Spark at Akanoo
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Wireless sensor network Apriori an N-RMP
Wireless sensor network Apriori an N-RMP Wireless sensor network Apriori an N-RMP
Wireless sensor network Apriori an N-RMP
 

More from Krish_ver2

More from Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Recently uploaded

Recently uploaded (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

5.3 mining sequential patterns

  • 2. Sequential Pattern Mining  Sequence database – consists of sequences of ordered elements or events (with or without time)  Sequential Pattern Mining is the mining of frequently occurring ordered events or subsequences as patterns  Example:  Customer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months.  Web access patterns, Weather prediction  Usually categorical or symbolic data  Numeric data analysis – Time Series Analysis 2
  • 3. Sequential Pattern Mining  I = {I1, I2, …Ip} – Set of items  Sequence s = <e1 e2 e3… el>  Ordered list of events  Each event is an element of the sequence  In Shopping databases, event – shopping trip in which customers bought items (x1 x2 …xq)  All trips by customer form a sequence  Item can occur at most once in an event, but several times in a sequence  Sequence with length l : l-sequence  A sequence α = <a1a2…an> is a subsequence of β = <b1b2..bm> denoted as α ⊆ β if there exists integers j1, j2, … jn between 1 and m such that a1 ⊆ bj1, a2 ⊆ bj2,… an ⊆ bjn  α = <(ab),d> and β = <(abc),(de)> α is a sub-sequence of β 3
  • 4. Sequential Pattern Mining  A sequence database, S, is a set of tuples, <SID,s>, where SID is a sequence ID and s is a sequence  A tuple <SID, s> is said to contain a sequence a, if a is a subsequence of s  The support of a sequence α in a sequence database S is the number of tuples in the database containing α  Given the minimum support threshold, a sequence a is frequent in sequence database S if support S(a) >= min sup  A frequent sequence is called a sequential pattern  A sequential pattern with length l is called an l-pattern 4
  • 5. Sequential Patterns  Length of Sequence 1 – 9  It contains ‘a’ multiple times but sequence 1 will contribute only one to the support of <a>  Sequence <a(bc)df> - Subsequence of Sequence 1  Support for <(ab)c> is 2 (Present in 1 and 3)  Frequent as it satisfies minimum support of 2 5 SID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc>
  • 6. Scalable Methods for Mining Sequential Patterns  Full Set Vs Closed Set  A sequential patterns s is closed if there exists no s’ where s’ is a proper super-sequence of s and s’ has same support as s  Subsequences of a frequent sequence are also ferquent  Mining closed sequential patterns avoids generation of unnecessary sub-sequences  GSP – Candidate generate and test approach on horizontal data format  SPADE – Candidate generate and test approach on vertical data format  PrefixScan – does not require candidate generation  All approaches exploit Apriori property – every non empty subsequence of a sequential pattern is a sequential pattern 6
  • 7. GSP: Sequential Pattern Mining  GSP (Generalized Sequential Patterns)  Multi-pass, Candidate generate and test approach proposed by Agrawal and Srikant  Outline of the method  Initially, every item in DB is a candidate of length-1  for each level (i.e., sequences of length-k) do  Scan database to collect support count for each candidate sequence  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori  A (k-1)-sequence w1 is merged with another (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2  repeat until no frequent sequence or no candidate can be found  Major strength: Candidate pruning by Apriori  Weakness : Generates large number of candidates 7
  • 8. Generalized Sequential Patterns 8  Initial candidates: all singleton sequences  <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>  Scan database once, count support for candidates Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Cand Sup <a> 4 <b> 4 <c> 1 <d> 2 <e> 1 <f> 4 <g> 1 <h> 1
  • 9. Generalized Sequential Patterns 9 Cand Sup <a> 4 <b> 4 <d> 2 <f> 4 Length 2 Candidates generated by joinLength 2 Candidates generated by join <aa> <ab> <ad> <af> <ba> <bb> <bd> <bf> <da> <db><dd> <df> <fa> <fb> <fd> <ff> <(ab)> <(ad)> <(af)> <(bd)> <(bf)> <(df)> Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Length 2 Frequent SequencesLength 2 Frequent Sequences <ba> <da> <db> <df> <fa> <(ab)> <(af)> <(bf)>
  • 10. Generalized Sequential Patterns 10 Length 2 Frequent SequencesLength 2 Frequent Sequences <ba> <da> <db> <df> <fa> <(ab)> <(af)> <(bf)> Length 3 Candidates generated by joinLength 3 Candidates generated by join <ba> and <(ab)> - <b(ab)> {1} <ba> and <(af)> - <b(af)> {1} <da> and <(ab)> - <d(ab)> {1} <da> and <(af)> - <d(af)> {1} <db> and <(bf)> - <d(bf)> {1, 4} <db> and <ba> - <dba> {1, 4} <df> and <fa> - <dfa> {1, 4} <fa> and <(ab)> - <f(ab)> - <fa> and <(af)> - <f(af)> {1} <(ab)> and <(bf)> - <(abf)> {1,2,3} <(ab)> and <ba> - <(ab)a> {1} <(af)> and <fa> - <(af)a) {1} <(bf)> and <fa> - <(bf)a> {1, 4} Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Length 3 Frequent SequencesLength 3 Frequent Sequences <dba> <dfa> <(abf)> <(bf)a> <d(bf)> Length 4 Candidates generated by joinLength 4 Candidates generated by join <d(bf)> and <(bf)a> - <d(bf)a> {1, 4} <(abf)> and <(bf)a> - <(abf)a> {1}
  • 11. SPADE: Sequential Pattern Mining on Vertical data format  SPADE – Sequential PAttern Discovery using Equivalent classes  Vertical data format <itemset: (sequence_ID, event_ID)>  ID_list : Set of (sequence_ID, event_ID) pairs for a given itemset  Mapping from horizontal to vertical format requires one scan  Support of k-sequences can be determined by joining the ID lists of (k-1) sequences  To find candidate 2-sequences, all single items are joined if they are frequent, if they share the same sequence identifier and if their event identifier follows a sequential ordering  Patterns are grown similarly 11
  • 13. SPADE - Example 13 Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> SID EID Itemset 1 1 cd 1 2 abc 1 3 abf 1 4 acdf 2 1 abf 2 2 e 3 1 abf 4 1 dgh 4 2 bf 4 3 agh Vertical Data Format
  • 14. SPADE - Example 14 SID EID Itemset 1 1 cd 1 2 abc 1 3 abf 1 4 acdf 2 1 abf 2 2 e 3 1 abf 4 1 dgh 4 2 bf 4 3 agh Vertical Data Format SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 2 1 4 c SID EID 1 1 1 4 4 1 d SID EID 2 2 e SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID 4 1 4 3 g SID EID 4 1 4 3 h (ab) – 1,2; 1,3; 2,1; 3,1 (ad) – 1,4 (af) – 1,4; 2,1;3,1 (bd) – NIL (bf) – 1,3; 2,1; 3,1;4,2 (df) – 1,4
  • 15. SPADE - Example 15 SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 4 4 1 d SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID(a) EID(a) 1 2 3 1 3 4 1 2 4 aa SID EID(a) EID(b) 1 2 3 ab SID EID(a) EID(d) 1 2 4 1 3 4 ad SID EID(a) EID(b) 1 2 3 1 3 4 af SID EID(b) EID(a) 1 2 3 1 3 4 4 2 3 ba SID EID(b) EID(b) 1 2 3 bb SID EID(b) EID(d) 1 2 4 1 3 4 bd SID EID(b) EID(f) 1 2 3 1 3 4 bf
  • 16. SPADE - Example 16 SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 4 4 1 d SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID(d) EID(a) 1 1 2 1 1 3 1 1 4 4 1 3 da SID EID(d) EID(b) 1 1 2 1 1 3 4 1 2 db SID EID(d) EID(d) 1 1 4 dd SID EID(d) EID(f) 1 1 3 1 1 4 4 1 2 df SID EID(f) EID(a) 1 3 4 4 2 3 fa SID EID(f) EID(b) - - - fb SID EID(f) EID(d) 1 3 4 fd SID EID(f) EID(f) 1 3 4 ff
  • 17. SPADE  Reduces scans of the sequence database  As the length of the frequent sequence increases, the size of ID_list decreases – results in fast joins  But large set of candidates are still generated 17
  • 18. PrefixSpan: Prefix-Projected Sequential Pattern Growth  Pattern Growth – does not require candidate generation  Finds frequent single items  Constructs FP-tree  Projected databases associated with each frequent item are generated from FP-tree  Builds prefix patterns which it concatenates with suffix patterns to find frequent patterns 18
  • 19. PrefixSpan  Given a sequence α = <e1e2…en> (where each ei corresponds to a frequent event), a sequence β = <e’1e’2…e’m> (m<=n) is called a prefix of α iff  e'i = ei for i<=m-1  e'm ⊆ em  All frequent items in (em– e’m) are alphabetically after those in e’m  Sequence γ = <e”m em+1…en> is called the suffix of α wrt prefix β denoted as γ = α/β where e”m = (em – e’m ) 19 <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Prefix Suffix (Prefix-Based Projection) <a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <a(ab)> <(_c)(ac)d(cf)>
  • 20. PrefixSpan  Mining Sequential patterns:  {<x1>,<x2>,…<xn>} – complete set of length-1 sequential patterns.  The complete set of Sequential patterns in S can be partitioned into n disjoint subsets.  ith subset is the set of sequential patterns with prefix <xi>  α : length-l sequential pattern and {β1, β2,… βm} – set of all length (l+1) sequential patterns with prefix α.  The complete set of sequential patterns with α as a prefix can be partitioned into m disjoint subsets  jth subset is the set of patterns prefixed with βj 20
  • 21. PrefixSpan Example  Step 1: find length-1 sequential patterns  <a>, <b>, <c>, <d>, <e>, <f>  Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:  The ones having prefix <a>;  The ones having prefix <b>;  …  The ones having prefix <f> 21 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
  • 22. PrefixSpan : Example  Only need to consider projections w.r.t. <a>  <a>-projected database: (only first occurrence of a is considered) <(abc) (ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>  In <a> projected database – frequent items are a:2, b:4, _b:2, c:4, d:2 and f:2  Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>  Further partition into 6 subsets  Having prefix <aa> - {< (_bc)(ac)d(cf)>, <(_e)>} – No frequent subsequences  Having prefix <(ab)> - <(_c)(ac)d(cf)> and <(df)cb>  Frequent patterns: <c> <d> <f> <dc> : <(ab)c> <(ab)d> <(ab)f> <(ab)dc>  Having prefix <ac> - <(ac)d(cf)>, <(bc)(ae)>, <b>, <bc>  Frequent patterns: <a> <b> <c> : <aca> <acb> <acc> …. 22 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
  • 23. PrefixSpan  No candidate sequence needs to be generated  Projected databases keep shrinking  Major cost of PrefixSpan: constructing projected databases  Can be improved by pseudo-projections  Pseudo-projection  Registers the index of the corresponding sequence and the starting position of the projected suffix in the sequence instead of physical projection  Avoids physically copying postfixes  Efficient in running time and space when database can be held in main memory  For large data combination of physical and pseudo projection 23
  • 24. Sequential Pattern Mining  Performance rating : PrefixSpan, SPADE, GSP  All three are slow when there is a large number of frequent subsequences  Closed Sequential Patterns  Closed Subsequences – contain no super sequence with the same support  Reduces number of sequences considered  CloSpan  Based on equivalence of projected databases  Two projected sequence databases are equivalent iff the total number of items match  CloSpan prunes non-closed sequences whenever two projected databases are exactly the same by stopping the growth of one  Requires Post-processing to eliminate any remaining non-closed sequential patterns  BIDE (Bidirectional Search) algorithm avoids additional checking 24
  • 25. Sequential Pattern Mining  Multi-dimensional, multi-level Sequential patterns  Additional information maybe associated with Sequence ID – Customer age, address, group and profession  Additional information associated with items – item category, brand, model type, model number, place…  Example: “Retired customers who purchase a digital camera are likely to purchase a color printer within a month”  Additional information can be attached with Sequence ID / Item ID 25
  • 26. Constraint based Mining of Sequential Patterns  Un-focused mining reduces the efficiency and usability of frequent-pattern mining  Constraint based mining incorporates user-specified constraints to reduce the search space  Regular expressions can be used to specify pattern templates  Helps to improve efficiency of mining and interestingness of patterns 26
  • 27. Constraint based Mining  Constraints related to duration T  Constraints related to maximal or minimal length – anti- monotonic / monotonic constraints  Anti-monotonic constraint: T<= 10  Monotonic : T > 10  Succinct : T = 2005  Periodic patterns – related to sets of partitioned sequences such as every two weeks before and after an earthquake 27
  • 28. Constraint based Mining  Event folding window w  specifies the periodicity for treating events as occurring together  w=0 – No event sequence folding  w=T – time-insensitive frequent patterns  Gap between events  Gap=0 – Strictly consecutive sequential patterns  min_gap and max_gap  Exact gaps and approximate gaps  Serial Episodes  Set of events occurring in total order  Parallel Episodes  Occurrence order is trivial  Examples  (A|B)C*(D|E) – A and B first occur (relative order is unimportant) followed by any number of C events followed by D and E (in any order)  C = <a*{bb|(bc)d|dd}> a-projected databases followed by SuffixSpan 28
  • 29. Periodicity Analysis for Time-Related Sequence data  Periodicity analysis – mining of periodic patterns – searching for recurring patterns in time-related sequence data  Seasons, tides, traffic patterns, power consumption  Often performed over time-series data  Full periodic pattern  Every point in time contributes (precisely or approximately) to the periodicity  Example: All days in a year – contribute to season cycle  Partial periodic pattern  Specifies periodic behavior of a time-related sequence at some but not all points of time  Example: ABC reads the paper between 7:00-7:30 am every week day 29
  • 30. Periodicity Analysis  Synchronous periodicity – event occurs at a relatively fixed offset in each “stable” period  3 pm every day  Asynchronous periodicity – event fluctuates in loosely defined period  Precise or approximate – depending on data value or offset within a period  Mining partial periodicity – leads to the discovery of cyclic or periodic association rules (rules that associate a set of events that occur periodically)  Example: If tea sells well between 3 – 5 pm dinner will also sell well between 7 – 9 pm on weekends 30