SlideShare a Scribd company logo
1 of 22
Sequence mining algorithm



           Monica Dăgădiţă
                        ISI
 Introduction
             to sequence mining
 Why sequence mining?
 Sequence mining algorithms
 SPADE
    Motivation
    Definitions and examples
    Algorithm
    Implementation




                     Data Mining   11/8/2011   2
 Aim - finding statistically relevant patterns
 between data examples where the values are
 delivered in a sequence

 Originallyintroduced for market basket
 analysis - customer behaviour predictions

2    types of sequence mining:
     string mining – biology (gene/protein sequences)
     itemset mining - marketing and CRM applications

                       Data Mining   11/8/2011   3
 Discovering   patterns:
    Bookstore: 70% of the people who buy Jane
     Austen’s “Pride and Prejudice” also buy “Emma”
     within a month
    Website: finding sequences of most frequently
     accessed pages

 Usage:
    Promotions
    Shelf placement
    Restructure the website
    Recommender systems

                     Data Mining   11/8/2011   4
 Apriori
 GSP  (Generalized Sequential Pattern)
 FreeSpan (Frequent pattern-projected
  Sequential pattern mining)
 PrefixSpan (Prefix-projected Sequential
  pattern mining)
 SPADE (Sequential PAttern Discovery using
  Equivalence classes)




                  Data Mining   11/8/2011   5
 Problems   of existing solutions
    Repeated database scans
    Complex internal data structures


 Key   features of SPADE:
    Fixed number of database scans
    Vertical id-list database format
    Decomposition of search space into smaller
     pieces – processed independently




                     Data Mining   11/8/2011      6
 Itemset:    set of m distinct items
   I = {i1, i2, …, im }
 Event: non-empty collection of items
   (i1,i2 … ik)
 Sequence : ordered list of events
  < e1 -> e2 -> … -> en >
 K-sequence : sequence with k items
  (B->AC) – 3-sequence



                  Data Mining   11/8/2011   7
 Subsequence:   given two sequences α=<a1 a2 … an>
 and β=<b1 b2 … bm>, α is called a subsequence of
 β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2
 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn

  Examples:
  1. (B->AC) is a subsequence of (AB->E->ACD)
  2. (AB->E) is not a subsequence of (ABE)




                    Data Mining   11/8/2011     8
Data Mining   11/8/2011   9
Id-lists of the most frequent items (1-sequences)




                   Data Mining   11/8/2011   10
 D->BF->A
    Step 1: D->B




    Step 2: D->BF




                     Data Mining   11/8/2011   11
 D->BF->A
    Step 3 : D->BF->A




 Not   space-efficient
    Solution: 2 columns - (sid,eid) for each sequence
    Eid – id of the sequence’s last item


                      Data Mining   11/8/2011   12
 D->BF->A   (space-efficient id-list joins)
                                                               D->B

                                                       SID       EID
                                                       1         15
                                                       1         20
                                                       4         20




                   D->BF->A                                  D->BF

             SID       EID                         SID          EID
             1         25                          1            20
             4         25                          4            20


                         Data Mining   11/8/2011                      13
 Complete   latice representation




                   Data Mining   11/8/2011   14
Data Mining   11/8/2011   15
 Decomposing  the latice => smaller pieces
 that can be solved independently

 Equivalence   classes
 2 sequences are in the same class (Θk) if they
  share a common k length prefix
 Example
   k=1 : Θ1 -> {[A],[B],[D],[F]}




                    Data Mining   11/8/2011   16
Data Mining   11/8/2011   17
Data Mining   11/8/2011   18
 SPADE(min_sup,D)
  //min_sup – minimum_support
 //D –initial dataset
 F1<- {frequent items or 1-sequences}
 F2<- {frequent 2-sequences}
 Ε <- {equivalence classes [X] Θ1 }
 for all [X] in E
   enumerate_frequent_seq([X],min_sup)




                  Data Mining   11/8/2011   19
   Enumerate_frequent_seq(S,min_sup)
      for all Ai in S
          Ti <- {}
          for all Aj in S, with j≥i
              R<- Ai v Aj (join)
              if R satisfies min_sup
                   Ti <- Ti U {R}
          end
          Enumerate_frequent_seq(Ti , min_sup) //DFS
    end
    For all non-empty Ti
      Enumerate_frequent_seq(Ti , min_sup) //BFS


                       Data Mining   11/8/2011   20
 The   R Project for Statistical Computing
    developed at Bell Laboratories (formerly
     AT&T, now Lucent Technologies) by John
     Chambers and colleagues

    Different implementation of S language

    arulesSequences package




                      Data Mining   11/8/2011   21
Data Mining   11/8/2011   22

More Related Content

What's hot

01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtracking
mandlapure
 

What's hot (20)

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Compiler question bank
Compiler question bankCompiler question bank
Compiler question bank
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
gSpan algorithm
gSpan algorithmgSpan algorithm
gSpan algorithm
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Randomized Algorithm
Randomized AlgorithmRandomized Algorithm
Randomized Algorithm
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 
Machine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-offMachine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-off
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtracking
 
Presentation on Breadth First Search (BFS)
Presentation on Breadth First Search (BFS)Presentation on Breadth First Search (BFS)
Presentation on Breadth First Search (BFS)
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
 
Predictive coding
Predictive codingPredictive coding
Predictive coding
 

Similar to SPADE -

Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
kiran said
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
liqiang xu
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
shravanthium111
 
Citation data flow 2012 nat latipat
Citation data flow 2012 nat latipatCitation data flow 2012 nat latipat
Citation data flow 2012 nat latipat
LATIPAT
 

Similar to SPADE - (20)

OSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo RickliOSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
 
FP-growth.pptx
FP-growth.pptxFP-growth.pptx
FP-growth.pptx
 
Cdi implementation
Cdi implementationCdi implementation
Cdi implementation
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
 
Interval intersection
Interval intersectionInterval intersection
Interval intersection
 
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatterns
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
 
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSAWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
 
Citation data flow 2012 nat latipat
Citation data flow 2012 nat latipatCitation data flow 2012 nat latipat
Citation data flow 2012 nat latipat
 
Datamining at SemWebPro 2012
Datamining at SemWebPro 2012Datamining at SemWebPro 2012
Datamining at SemWebPro 2012
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
ScilabTEC 2015 - KIT
ScilabTEC 2015 - KITScilabTEC 2015 - KIT
ScilabTEC 2015 - KIT
 
SMDMS'13
SMDMS'13SMDMS'13
SMDMS'13
 
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

SPADE -

  • 1. Sequence mining algorithm Monica Dăgădiţă ISI
  • 2.  Introduction to sequence mining  Why sequence mining?  Sequence mining algorithms  SPADE  Motivation  Definitions and examples  Algorithm  Implementation Data Mining 11/8/2011 2
  • 3.  Aim - finding statistically relevant patterns between data examples where the values are delivered in a sequence  Originallyintroduced for market basket analysis - customer behaviour predictions 2 types of sequence mining:  string mining – biology (gene/protein sequences)  itemset mining - marketing and CRM applications Data Mining 11/8/2011 3
  • 4.  Discovering patterns:  Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month  Website: finding sequences of most frequently accessed pages  Usage:  Promotions  Shelf placement  Restructure the website  Recommender systems Data Mining 11/8/2011 4
  • 5.  Apriori  GSP (Generalized Sequential Pattern)  FreeSpan (Frequent pattern-projected Sequential pattern mining)  PrefixSpan (Prefix-projected Sequential pattern mining)  SPADE (Sequential PAttern Discovery using Equivalence classes) Data Mining 11/8/2011 5
  • 6.  Problems of existing solutions  Repeated database scans  Complex internal data structures  Key features of SPADE:  Fixed number of database scans  Vertical id-list database format  Decomposition of search space into smaller pieces – processed independently Data Mining 11/8/2011 6
  • 7.  Itemset: set of m distinct items I = {i1, i2, …, im }  Event: non-empty collection of items (i1,i2 … ik)  Sequence : ordered list of events < e1 -> e2 -> … -> en >  K-sequence : sequence with k items (B->AC) – 3-sequence Data Mining 11/8/2011 7
  • 8.  Subsequence: given two sequences α=<a1 a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn  Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE) Data Mining 11/8/2011 8
  • 9. Data Mining 11/8/2011 9
  • 10. Id-lists of the most frequent items (1-sequences) Data Mining 11/8/2011 10
  • 11.  D->BF->A  Step 1: D->B  Step 2: D->BF Data Mining 11/8/2011 11
  • 12.  D->BF->A  Step 3 : D->BF->A  Not space-efficient  Solution: 2 columns - (sid,eid) for each sequence  Eid – id of the sequence’s last item Data Mining 11/8/2011 12
  • 13.  D->BF->A (space-efficient id-list joins) D->B SID EID 1 15 1 20 4 20 D->BF->A D->BF SID EID SID EID 1 25 1 20 4 25 4 20 Data Mining 11/8/2011 13
  • 14.  Complete latice representation Data Mining 11/8/2011 14
  • 15. Data Mining 11/8/2011 15
  • 16.  Decomposing the latice => smaller pieces that can be solved independently  Equivalence classes 2 sequences are in the same class (Θk) if they share a common k length prefix Example k=1 : Θ1 -> {[A],[B],[D],[F]} Data Mining 11/8/2011 16
  • 17. Data Mining 11/8/2011 17
  • 18. Data Mining 11/8/2011 18
  • 19.  SPADE(min_sup,D) //min_sup – minimum_support //D –initial dataset F1<- {frequent items or 1-sequences} F2<- {frequent 2-sequences} Ε <- {equivalence classes [X] Θ1 } for all [X] in E enumerate_frequent_seq([X],min_sup) Data Mining 11/8/2011 19
  • 20. Enumerate_frequent_seq(S,min_sup) for all Ai in S Ti <- {} for all Aj in S, with j≥i R<- Ai v Aj (join) if R satisfies min_sup Ti <- Ti U {R} end Enumerate_frequent_seq(Ti , min_sup) //DFS end For all non-empty Ti Enumerate_frequent_seq(Ti , min_sup) //BFS Data Mining 11/8/2011 20
  • 21.  The R Project for Statistical Computing  developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues  Different implementation of S language  arulesSequences package Data Mining 11/8/2011 21
  • 22. Data Mining 11/8/2011 22