Mining Frequent Patterns <ul><li>©Jiawei Han and Micheline Kamber </li></ul><ul><li>http://www-sal.cs.uiuc.edu/~hanj/bk2/ ...
Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Aprior...
What Is Association Mining? <ul><li>Association rule mining  </li></ul><ul><ul><li>First proposed by Agrawal, Imielinski a...
Why Is Frequent Pattern or Association Mining an Essential Task in Data Mining? <ul><li>Foundation for many essential data...
Basic Concepts: Frequent Patterns and Association Rules <ul><li>Itemset X={x 1 , …, x k } </li></ul><ul><li>Find all the r...
Mining Association Rules—an Example <ul><li>For rule  A      C : </li></ul><ul><ul><li>support = support({ A }  { C }) =...
From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets) <ul><li>Given a frequent itemset X, how...
Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Aprior...
Apriori: A Candidate Generation-and-test Approach <ul><li>Any subset of a frequent itemset must be frequent </li></ul><ul>...
The Apriori Algorithm—An Example Database TDB 1 st  scan C 1 L 1 L 2 C 2 C 2 2 nd  scan C 3 L 3 3 rd  scan B, E 40 A, B, C...
The Apriori Algorithm <ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></u...
Important Details of Apriori <ul><li>How to generate candidates? </li></ul><ul><ul><li>Step 1: self-joining  L k </li></ul...
How to Generate Candidates? <ul><li>Suppose the items in  L k-1  are listed in an order </li></ul><ul><li>Step 1: self-joi...
Efficient Implementation of Apriori in SQL <ul><li>Hard to get good performance out of pure SQL (SQL-92) based approaches ...
Challenges of Frequent Pattern Mining <ul><li>Challenges </li></ul><ul><ul><li>Multiple scans of transaction database </li...
Dynamically Reduce # Transactions <ul><li>If a transaction does not contain any frequent  k -itemset, it can not contain a...
Partition: Scan Database Only Twice <ul><li>Any itemset that is potentially frequent in DB must be frequent in at least on...
Sampling for Frequent Patterns <ul><li>Select a sample of original database, mine frequent patterns within sample using Ap...
DHP: Reduce the Number of Candidates <ul><li>A  k -itemset whose corresponding hashing bucket count is below the threshold...
Bottleneck of Frequent-pattern Mining <ul><li>Multiple database scans are  costly </li></ul><ul><li>Mining long patterns n...
Mining Frequent Patterns Without   Candidate Generation <ul><li>Grow long patterns from short ones using local frequent it...
Main idea for FP-Growth <ul><li>Let the frequent items in DB be: a, b, c, … </li></ul><ul><li>Find frequent itemsets conta...
Main idea for FP-Growth <ul><li>Let the frequent items in DB be: a, b, c, … </li></ul><ul><li>Find frequent itemsets conta...
Construct FP-tree from a Transaction Database Header Table Item  frequency  head  f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 ...
Construct FP-tree from a Transaction Database {} f:1 c:1 a:1 m:1 p:1 Header Table Item  frequency  head  f 4 c 4 a 3 b 3 m...
Construct FP-tree from a Transaction Database {} f:2 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item  frequency  head  f 4 c 4 a...
Construct FP-tree from a Transaction Database {} f:3 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item  frequency  head  f 4 c 4 a...
Construct FP-tree from a Transaction Database {} f:3 c:1 b:1 p:1 b:1 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item  frequency ...
Construct FP-tree from a Transaction Database {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item  frequency ...
Benefits of the FP-tree Structure <ul><li>Completeness  </li></ul><ul><ul><li>Preserve complete information for frequent p...
Partition Patterns and Databases <ul><li>Frequent patterns can be partitioned into subsets according to f-list </li></ul><...
Find Patterns Having P From P-conditional Database <ul><li>Starting at the frequent item header table in the FP-tree </li>...
From Conditional Pattern-bases to Conditional FP-trees   <ul><li>For each pattern-base </li></ul><ul><ul><li>Accumulate th...
From Conditional Pattern-bases to Conditional FP-trees   <ul><li>For each pattern-base </li></ul><ul><ul><li>Accumulate th...
Some pitfalls of using FP-Growth <ul><li>Not remove non-frequent items. </li></ul><ul><li>Use relative min-support when fi...
Mining Frequent Patterns With FP-trees <ul><li>Idea: Frequent pattern growth </li></ul><ul><ul><li>Recursively grow freque...
Algorithm FP_growth( tree ,   ) <ul><li>// initially, should call FP_growth(FP-tree, null) </li></ul><ul><li>If  tree  co...
A Special Case: Single Prefix Path in FP-tree <ul><li>Suppose a (conditional) FP-tree T has a shared single prefix-path P ...
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
Why Is FP-Growth the Winner? <ul><li>Divide-and-conquer:  </li></ul><ul><ul><li>decompose both the mining task and DB acco...
Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Aprior...
Max-patterns & Close-patterns <ul><li>If there are frequent patterns with many items, enumerating all of them is costly. <...
Max-patterns <ul><li>Frequent pattern {a 1 , …, a 100 }    ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 =  1.27...
MaxMiner: Mining Max-patterns <ul><li>R. Bayardo.  Efficiently mining long patterns from databases . In  SIGMOD’98 </li></...
Local Pruning Techniques (e.g. at node A) <ul><li>Check the frequency of ABCD and AB, AC, AD. </li></ul><ul><li>If ABCD is...
Algorithm MaxMiner <ul><li>Initially, generate one node N=  , where h(N)=   and t(N)={A,B,C,D}. </li></ul><ul><li>Conside...
Global Pruning Technique (across sub-trees) <ul><li>When a max pattern is identified (e.g. ABCD), prune all nodes (B, C an...
Example    (ABCDEF) Min_sup=2 Max patterns: A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid 3 C 2 B 2 E 3 D 1 F 2 A 0 ABCDE...
Example Min_sup=2 A  (BCDE) B  (CDE) C  (DE) E  () D  (E) Max patterns: Node A A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items T...
Example Min_sup=2 A  (BCDE) B  (CDE) C  (DE) E  () D  (E) Max patterns: Node B A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items T...
Example Min_sup=2 A  (BCDE) B  (CDE) C  (DE) E  () D  (E) AC  (D) AD  () Max patterns: BCDE Node AC A,C,D,F 30 B,C,D,E, 20...
Frequent Closed Patterns <ul><li>For frequent itemset X, if there exists no item y s.t. every transaction containing X als...
Max Pattern vs. Frequent Closed Pattern <ul><li>max pattern    closed pattern </li></ul><ul><ul><li>if itemset X is a max...
Mining Frequent Closed Patterns: CLOSET <ul><li>Flist: list of all frequent items in support ascending order </li></ul><ul...
Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Aprior...
Mining Various Kinds of Rules or Regularities <ul><li>Multi-level rules. </li></ul><ul><li>Multi-dimensional rules. </li><...
Multiple-level Association Rules <ul><li>Items often form hierarchy </li></ul><ul><li>Flexible support settings: Items at ...
ML Associations with Flexible Support Constraints <ul><li>Why flexible support constraints? </li></ul><ul><ul><li>Real lif...
Multi-Level Mining: Progressive Deepening <ul><li>A top-down, progressive deepening approach: </li></ul><ul><ul><li>First ...
Multi-level Association:  Redundancy Filtering <ul><li>Some rules may be redundant due to “ancestor” relationships between...
Multi-dimensional Association <ul><li>Single-dimensional rules: </li></ul><ul><ul><ul><li>buys(X, “milk”)     buys(X, “br...
Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball...
Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball...
Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball...
<ul><li>corr A,B  = 1    A and B are independent. </li></ul><ul><li>corr A,B  > 1    A and B are positively correlated: ...
<ul><li>For instance, A=‘buy milk’, B=‘buy bread’. </li></ul><ul><li>Among 100 persons, 30 buys milk and 30 buys bread. I....
Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Aprior...
Constraint-based Data Mining <ul><li>Finding  all  the patterns in a database  autonomously ? — unrealistic! </li></ul><ul...
Constraints in Data Mining <ul><li>Knowledge type constraint :  </li></ul><ul><ul><li>classification, association, etc. </...
Constrained Mining vs. Constraint-Based Search <ul><li>Constrained mining vs. constraint-based search/reasoning </li></ul>...
Constrained Frequent Pattern Mining: A Mining Query Optimization Problem <ul><li>Given a frequent pattern mining query wit...
Anti-Monotonicity in Constraint-Based Mining <ul><li>Anti-monotonicity </li></ul><ul><ul><li>When an itemset S  violates  ...
Which Constraints Are Anti-Monotone? no support(S)        no range(S)    v no sum(S)    v ( a     S, a    0 ) no cou...
Monotonicity in Constraint-Based Mining <ul><li>Monotonicity </li></ul><ul><ul><li>When an itemset S  satisfies  the const...
Which Constraints Are Monotone? yes support(S)        yes range(S)    v yes sum(S)    v ( a     S, a    0 ) yes coun...
Succinctness <ul><li>Succinctness: </li></ul><ul><ul><li>Given  A 1,  the set of items satisfying a succinctness constrain...
Which Constraints Are Succinct? no support(S)        no range(S)    v no sum(S)    v ( a     S, a    0 ) weakly coun...
The Apriori Algorithm — Example Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D
Naïve Algorithm: Apriori + Constraint  Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D Constraint:  Sum{S.pric...
The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep   Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C ...
The Constrained Apriori Algorithm: Push a Succinct Constraint Deep   Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 ...
Converting “Tough” Constraints <ul><li>Convert tough constraints into anti-monotone or monotone by properly ordering items...
Convertible Constraints <ul><li>Let R be an order of items </li></ul><ul><li>Convertible anti-monotone </li></ul><ul><ul><...
Strongly Convertible Constraints <ul><li>avg(X)    25 is convertible anti-monotone w.r.t. item  value descending  order R...
What Constraints Are Convertible? …… No No Yes sum(S)    v (items could be of any value, v    0) No Yes No sum(S)    v ...
Combing Them Together—A General Picture no yes no support(S)        no yes no range(S)    v no yes no sum(S)    v ( a ...
Classification of Constraints Convertible anti-monotone Convertible monotone Strongly convertible Inconvertible Succinct A...
Chapter 5 (in fact 8.3): Mining Association Rules in Large Databases  <ul><li>Association rule mining </li></ul><ul><li>Al...
Sequence Databases and Sequential Pattern Analysis <ul><li>Transaction databases, time-series databases vs. sequence datab...
A Sequence and a Subsequence <ul><li>A sequence is an ordered list of itemsets. Each itemset itself is unordered. </li></u...
What Is Sequential Pattern Mining? <ul><li>Given a set of sequences, find the complete set of  frequent  subsequences </li...
Challenges on Sequential Pattern Mining <ul><li>A  huge  number of possible sequential patterns are hidden in databases </...
Studies on Sequential Pattern Mining <ul><li>Concept introduction and an initial Apriori-like algorithm </li></ul><ul><ul>...
A Basic Property of Sequential Patterns: Apriori <ul><li>A basic property: Apriori (Agrawal & Sirkant’94)  </li></ul><ul><...
GSP—A Generalized Sequential Pattern Mining Algorithm <ul><li>GSP (Generalized Sequential Pattern) mining algorithm </li><...
Finding Length-1 Sequential Patterns <ul><li>Examine GSP using an example  </li></ul><ul><li>Initial candidates: all singl...
Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes  44...
Finding Length-2 Sequential Patterns <ul><li>Scan database one more time, collect support count for each length-2 candidat...
Generating Length-3 Candidates and Finding Length-3 Patterns <ul><li>Generate Length-3 Candidates  </li></ul><ul><ul><li>S...
The GSP Mining Process min_sup  =2  <a> <b> <c> <d> <e> <f>  <g> <h> <aa> <ab> …  <af>  <ba> <bb> … <ff>  <(ab)>  …  <(ef)...
The GSP Algorithm <ul><li>Take sequences in form of <x> as length-1 candidates </li></ul><ul><li>Scan database once, find ...
Bottlenecks of GSP <ul><li>A huge set of candidates could be generated </li></ul><ul><ul><li>1,000 frequent length-1 seque...
PrefixSpan: Prefix-Projected Sequential Pattern Mining <ul><li>A divide-and-conquer approach </li></ul><ul><ul><li>Recursi...
Mining Sequential Patterns by Prefix Projections <ul><li>Step 1: find length-1 sequential patterns </li></ul><ul><ul><li><...
Finding Seq. Patterns with Prefix <a> <ul><li>Only need to consider projections w.r.t. <a> </li></ul><ul><ul><li><a>-proje...
Completeness of PrefixSpan SDB Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> <a>-projected database <(abc)(ac)...
Efficiency of PrefixSpan <ul><li>No candidate sequence needs to be generated </li></ul><ul><li>Projected databases keep sh...
Optimization Techniques in PrefixSpan <ul><li>Physical projection vs. pseudo-projection  </li></ul><ul><ul><li>Pseudo-proj...
Speed-up by Pseudo-projection <ul><li>Major cost of PrefixSpan: projection </li></ul><ul><ul><li>Postfixes of sequences of...
Pseudo-Projection vs. Physical Projection <ul><li>Pseudo-projection avoids physically copying postfixes </li></ul><ul><ul>...
PrefixSpan Is Faster than GSP and FreeSpan
Effect of Pseudo-Projection
Extensions of Sequential Pattern Mining <ul><li>Closed- and max- sequential patterns </li></ul><ul><ul><li>Finding only th...
Summary <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and close...
Upcoming SlideShare
Loading in …5
×

My6asso

1,031 views

Published on

Data Mining

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,031
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

My6asso

  1. 1. Mining Frequent Patterns <ul><li>©Jiawei Han and Micheline Kamber </li></ul><ul><li>http://www-sal.cs.uiuc.edu/~hanj/bk2/ </li></ul><ul><li>Chp 5, Chp 8.3 </li></ul>modified by Donghui Zhang
  2. 2. Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  3. 3. What Is Association Mining? <ul><li>Association rule mining </li></ul><ul><ul><li>First proposed by Agrawal, Imielinski and Swami [AIS93] </li></ul></ul><ul><ul><li>Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, etc. </li></ul></ul><ul><ul><li>Frequent pattern : pattern (set of items, sequence, etc.) that occurs frequently in a database </li></ul></ul><ul><li>Motivation: finding regularities in data </li></ul><ul><ul><li>What products were often purchased together?— Beer and diapers?! </li></ul></ul><ul><ul><li>What are the subsequent purchases after buying a PC? </li></ul></ul><ul><ul><li>What kinds of DNA are sensitive to this new drug? </li></ul></ul><ul><ul><li>Can we automatically classify web documents? </li></ul></ul>
  4. 4. Why Is Frequent Pattern or Association Mining an Essential Task in Data Mining? <ul><li>Foundation for many essential data mining tasks </li></ul><ul><ul><li>Association, correlation, causality </li></ul></ul><ul><ul><li>Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association </li></ul></ul><ul><ul><li>Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) </li></ul></ul><ul><li>Broad applications </li></ul><ul><ul><li>Basket data analysis, cross-marketing, catalog design, sale campaign analysis </li></ul></ul><ul><ul><li>Web log (click stream) analysis, DNA sequence analysis, etc. </li></ul></ul>
  5. 5. Basic Concepts: Frequent Patterns and Association Rules <ul><li>Itemset X={x 1 , …, x k } </li></ul><ul><li>Find all the rules X  Y with min confidence and support </li></ul><ul><ul><li>support , s , probability that a transaction contains X  Y </li></ul></ul><ul><ul><li>confidence , c, conditional probability that a transaction having X also contains Y . </li></ul></ul><ul><li>Let min_support = 50%, min_conf = 50%: </li></ul><ul><ul><li>A  C (50%, 66.7%) </li></ul></ul><ul><ul><li>C  A (50%, 100%) </li></ul></ul>Customer buys diaper Customer buys both Customer buys beer B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id
  6. 6. Mining Association Rules—an Example <ul><li>For rule A  C : </li></ul><ul><ul><li>support = support({ A }  { C }) = 50% </li></ul></ul><ul><ul><li>confidence = support({ A }  { C })/support({ A }) = 66.6% </li></ul></ul>Min. support 50% Min. confidence 50% B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id 50% {A, C} 50% {C} 50% {B} 75% {A} Support Frequent pattern
  7. 7. From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets) <ul><li>Given a frequent itemset X, how to find association rules? </li></ul><ul><li>Examine every subset S of X. </li></ul><ul><li>Confidence( S  X – S ) = support(X)/support(S) </li></ul><ul><li>Compare with min_conf </li></ul><ul><li>An optimization is possible (refer to exercises 6.1, 6.2). </li></ul>
  8. 8. Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  9. 9. Apriori: A Candidate Generation-and-test Approach <ul><li>Any subset of a frequent itemset must be frequent </li></ul><ul><ul><li>if {beer, diaper, nuts} is frequent, so is {beer, diaper} </li></ul></ul><ul><ul><li>Every transaction having {beer, diaper, nuts} also contains {beer, diaper} </li></ul></ul><ul><li>Apriori pruning principle : If there is any itemset which is infrequent, its superset should not be generated/tested! </li></ul><ul><li>Method: </li></ul><ul><ul><li>generate length (k+1) candidate itemsets from length k frequent itemsets, and </li></ul></ul><ul><ul><li>test the candidates against DB </li></ul></ul><ul><li>The performance studies show its efficiency and scalability </li></ul><ul><li>Agrawal & Srikant 1994, Mannila, et al. 1994 </li></ul>
  10. 10. The Apriori Algorithm—An Example Database TDB 1 st scan C 1 L 1 L 2 C 2 C 2 2 nd scan C 3 L 3 3 rd scan B, E 40 A, B, C, E 30 B, C, E 20 A, C, D 10 Items Tid 1 {D} 3 {E} 3 {C} 3 {B} 2 {A} sup Itemset 3 {E} 3 {C} 3 {B} 2 {A} sup Itemset {C, E} {B, E} {B, C} {A, E} {A, C} {A, B} Itemset 1 {A, B} 2 {A, C} 1 {A, E} 2 {B, C} 3 {B, E} 2 {C, E} sup Itemset 2 {A, C} 2 {B, C} 3 {B, E} 2 {C, E} sup Itemset {B, C, E} Itemset 2 {B, C, E} sup Itemset
  11. 11. The Apriori Algorithm <ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L k : frequent itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L 1 = {frequent items}; </li></ul></ul></ul><ul><ul><ul><li>for ( k = 1; L k !=  ; k ++) do begin </li></ul></ul></ul><ul><ul><ul><li>C k+1 = candidates generated from L k ; </li></ul></ul></ul><ul><ul><ul><li>for each transaction t in database do </li></ul></ul></ul><ul><ul><ul><ul><li>increment the count of all candidates in C k+1 that are contained in t </li></ul></ul></ul></ul><ul><ul><ul><li>L k+1 = candidates in C k+1 with min_support </li></ul></ul></ul><ul><ul><ul><li>end </li></ul></ul></ul><ul><ul><ul><li>return  k L k ; </li></ul></ul></ul>
  12. 12. Important Details of Apriori <ul><li>How to generate candidates? </li></ul><ul><ul><li>Step 1: self-joining L k </li></ul></ul><ul><ul><li>Step 2: pruning </li></ul></ul><ul><li>How to count supports of candidates? </li></ul><ul><li>Example of Candidate-generation </li></ul><ul><ul><li>L 3 = { abc, abd, acd, ace, bcd } </li></ul></ul><ul><ul><li>Self-joining: L 3 *L 3 </li></ul></ul><ul><ul><ul><li>abcd from abc and abd </li></ul></ul></ul><ul><ul><ul><li>acde from acd and ace </li></ul></ul></ul><ul><ul><li>Pruning: </li></ul></ul><ul><ul><ul><li>acde is removed because ade is not in L 3 </li></ul></ul></ul><ul><ul><li>C 4 ={ abcd } </li></ul></ul>
  13. 13. How to Generate Candidates? <ul><li>Suppose the items in L k-1 are listed in an order </li></ul><ul><li>Step 1: self-joining L k-1 </li></ul><ul><ul><li>insert into C k </li></ul></ul><ul><ul><li>select p.item 1 , p.item 2 , …, p.item k-1 , q.item k-1 </li></ul></ul><ul><ul><li>from L k-1 p, L k-1 q </li></ul></ul><ul><ul><li>where p.item 1 =q.item 1 , …, p.item k-2 =q.item k-2 , p.item k-1 < q.item k-1 </li></ul></ul><ul><li>Step 2: pruning </li></ul><ul><ul><li>forall itemsets c in C k do </li></ul></ul><ul><ul><ul><li>forall (k-1)-subsets s of c do </li></ul></ul></ul><ul><ul><ul><ul><li>if (s is not in L k-1 ) then delete c from C k </li></ul></ul></ul></ul>
  14. 14. Efficient Implementation of Apriori in SQL <ul><li>Hard to get good performance out of pure SQL (SQL-92) based approaches alone </li></ul><ul><li>Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. </li></ul><ul><ul><li>Get orders of magnitude improvement </li></ul></ul><ul><li>S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications . In SIGMOD’98 </li></ul>
  15. 15. Challenges of Frequent Pattern Mining <ul><li>Challenges </li></ul><ul><ul><li>Multiple scans of transaction database </li></ul></ul><ul><ul><li>Huge number of candidates </li></ul></ul><ul><ul><li>Tedious workload of support counting for candidates </li></ul></ul><ul><li>Improving Apriori: general ideas </li></ul><ul><ul><li>Reduce passes of transaction database scans </li></ul></ul><ul><ul><li>Shrink number of candidates </li></ul></ul><ul><ul><li>Facilitate support counting of candidates </li></ul></ul>
  16. 16. Dynamically Reduce # Transactions <ul><li>If a transaction does not contain any frequent k -itemset, it can not contain any frequent ( k +1)-itemset. </li></ul><ul><li>Remove such transaction from further consideration. </li></ul>
  17. 17. Partition: Scan Database Only Twice <ul><li>Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB </li></ul><ul><ul><li>Scan 1: partition database and find local frequent patterns </li></ul></ul><ul><ul><li>Scan 2: consolidate global frequent patterns </li></ul></ul><ul><li>A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases . In VLDB’95 </li></ul>
  18. 18. Sampling for Frequent Patterns <ul><li>Select a sample of original database, mine frequent patterns within sample using Apriori </li></ul><ul><li>Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked </li></ul><ul><ul><li>Example: check abcd instead of ab, ac, …, etc. </li></ul></ul><ul><li>Scan database again to find missed frequent patterns </li></ul><ul><li>H. Toivonen. Sampling large databases for association rules . In VLDB’96 </li></ul>
  19. 19. DHP: Reduce the Number of Candidates <ul><li>A k -itemset whose corresponding hashing bucket count is below the threshold cannot be frequent </li></ul><ul><ul><li>Candidates: a, b, c, d, e </li></ul></ul><ul><ul><li>Hash entries: {ab, ad, ae} {bd, be, de} … </li></ul></ul><ul><ul><li>Frequent 1-itemset: a, b, d, e </li></ul></ul><ul><ul><li>ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold </li></ul></ul><ul><li>J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules . In SIGMOD’95 </li></ul>
  20. 20. Bottleneck of Frequent-pattern Mining <ul><li>Multiple database scans are costly </li></ul><ul><li>Mining long patterns needs many passes of scanning and generates lots of candidates </li></ul><ul><ul><li>To find frequent itemset i 1 i 2 …i 100 </li></ul></ul><ul><ul><ul><li># of scans: 100 </li></ul></ul></ul><ul><ul><ul><li># of Candidates: ( 100 1 ) + ( 100 2 ) + … + ( 100 100 ) = 2 100 -1 = 1.27*10 30 ! </li></ul></ul></ul><ul><li>Bottleneck: candidate-generation-and-test </li></ul><ul><li>Can we avoid candidate generation? </li></ul>
  21. 21. Mining Frequent Patterns Without Candidate Generation <ul><li>Grow long patterns from short ones using local frequent items </li></ul><ul><ul><li>“ a” is a frequent pattern </li></ul></ul><ul><ul><li>Get all transactions having “a”: DB|a </li></ul></ul><ul><ul><li>“ ab” is frequent in DB if and only if “b” is frequent in DB|a </li></ul></ul>
  22. 22. Main idea for FP-Growth <ul><li>Let the frequent items in DB be: a, b, c, … </li></ul><ul><li>Find frequent itemsets containing “a” </li></ul><ul><li>Find frequent itemsets containing “b” but not “a” </li></ul><ul><li>Find frequent itemsets containing “c” but not “a” or “b” </li></ul><ul><li>……………… </li></ul>
  23. 23. Main idea for FP-Growth <ul><li>Let the frequent items in DB be: a, b, c, … </li></ul><ul><li>Find frequent itemsets containing “a” </li></ul><ul><ul><li>by checking DB|a. </li></ul></ul><ul><li>Find frequent itemsets containing “b” but not “a” </li></ul><ul><ul><li>by checking DB-{a}|b </li></ul></ul><ul><li>Find frequent itemsets containing “c” but not “a” or “b” </li></ul><ul><ul><li>by checking DB-{a,b}|c </li></ul></ul><ul><li>……………… </li></ul>
  24. 24. Construct FP-tree from a Transaction Database Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p
  25. 25. Construct FP-tree from a Transaction Database {} f:1 c:1 a:1 m:1 p:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p
  26. 26. Construct FP-tree from a Transaction Database {} f:2 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p
  27. 27. Construct FP-tree from a Transaction Database {} f:3 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p b:1
  28. 28. Construct FP-tree from a Transaction Database {} f:3 c:1 b:1 p:1 b:1 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p
  29. 29. Construct FP-tree from a Transaction Database {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p } <ul><li>Scan DB once, find frequent 1-itemset (single item pattern) </li></ul><ul><li>Sort frequent items in frequency descending order, f-list </li></ul><ul><li>Scan DB again, construct FP-tree </li></ul>F-list =f-c-a-b-m-p
  30. 30. Benefits of the FP-tree Structure <ul><li>Completeness </li></ul><ul><ul><li>Preserve complete information for frequent pattern mining </li></ul></ul><ul><ul><li>Never break a long pattern of any transaction </li></ul></ul><ul><li>Compactness </li></ul><ul><ul><li>Reduce irrelevant info—infrequent items are gone </li></ul></ul><ul><ul><li>Items in frequency descending order: the more frequently occurring, the more likely to be shared </li></ul></ul><ul><ul><li>Never be larger than the original database (not count node-links and the count field) </li></ul></ul><ul><ul><li>For Connect-4 DB, compression ratio could be over 100 </li></ul></ul>
  31. 31. Partition Patterns and Databases <ul><li>Frequent patterns can be partitioned into subsets according to f-list </li></ul><ul><ul><li>F-list=f-c-a-b-m-p </li></ul></ul><ul><ul><li>Patterns containing p </li></ul></ul><ul><ul><li>Patterns having m but no p </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>Patterns having c but no a nor b, m, p </li></ul></ul><ul><ul><li>Pattern f </li></ul></ul><ul><li>Completeness and non-redundency </li></ul>
  32. 32. Find Patterns Having P From P-conditional Database <ul><li>Starting at the frequent item header table in the FP-tree </li></ul><ul><li>Traverse the FP-tree by following the link of each frequent item p </li></ul><ul><li>Accumulate all of transformed prefix paths of item p to form p’ s conditional pattern base </li></ul>Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  33. 33. From Conditional Pattern-bases to Conditional FP-trees <ul><li>For each pattern-base </li></ul><ul><ul><li>Accumulate the count for each item in the base </li></ul></ul><ul><ul><li>Construct the FP-tree for the frequent items of the pattern base </li></ul></ul>p-conditional FP-tree All frequent patterns relate to p: p, cp  {} Header Table Item frequency head c 3 E.g., Consider the Conditional pattern bases for p: { fcam:2, cb:1} c:3
  34. 34. From Conditional Pattern-bases to Conditional FP-trees <ul><li>For each pattern-base </li></ul><ul><ul><li>Accumulate the count for each item in the base </li></ul></ul><ul><ul><li>Construct the FP-tree for the frequent items of the pattern base </li></ul></ul><ul><li>m-conditional pattern base: </li></ul><ul><ul><li>fca:2, fcab:1 </li></ul></ul>All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:3 c:3 a:3 m-conditional FP-tree
  35. 35. Some pitfalls of using FP-Growth <ul><li>Not remove non-frequent items. </li></ul><ul><li>Use relative min-support when finding frequent itemsets recursively. </li></ul><ul><li>Use wrong count for individual patterns in the conditional pattern base. </li></ul><ul><li>Forget to report the frequent items as part of “all frequent itemsets”. </li></ul>
  36. 36. Mining Frequent Patterns With FP-trees <ul><li>Idea: Frequent pattern growth </li></ul><ul><ul><li>Recursively grow frequent patterns by pattern and database partition </li></ul></ul><ul><li>Method </li></ul><ul><ul><li>For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree </li></ul></ul><ul><ul><li>Repeat the process on each newly created conditional FP-tree </li></ul></ul><ul><ul><li>Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern </li></ul></ul>
  37. 37. Algorithm FP_growth( tree ,  ) <ul><li>// initially, should call FP_growth(FP-tree, null) </li></ul><ul><li>If tree contains a single path P </li></ul><ul><ul><li>For each combination  of nodes in P </li></ul></ul><ul><ul><ul><li>Generate pattern    with support = min support of nodes in  ; </li></ul></ul></ul><ul><li>Else for each  i in the header of tree </li></ul><ul><ul><li>Generate pattern  =  i   with support =  i .support; </li></ul></ul><ul><ul><li>Construct  ’s conditional pattern base and  ’s conditional FP-tree tree  </li></ul></ul><ul><ul><li>If tree  is not empty, call FP_growth( tree  ,  ) </li></ul></ul>
  38. 38. A Special Case: Single Prefix Path in FP-tree <ul><li>Suppose a (conditional) FP-tree T has a shared single prefix-path P </li></ul><ul><li>Mining can be decomposed into two parts </li></ul><ul><ul><li>Reduction of the single prefix path into one node </li></ul></ul><ul><ul><li>Concatenation of the mining results of the two parts </li></ul></ul> + a 2 :n 2 a 3 :n 3 a 1 :n 1 {} b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 r 1 a 2 :n 2 a 3 :n 3 a 1 :n 1 {} r 1 =
  39. 39. FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
  40. 40. Why Is FP-Growth the Winner? <ul><li>Divide-and-conquer: </li></ul><ul><ul><li>decompose both the mining task and DB according to the frequent patterns obtained so far </li></ul></ul><ul><ul><li>leads to focused search of smaller databases </li></ul></ul><ul><li>Other factors </li></ul><ul><ul><li>no candidate generation, no candidate test </li></ul></ul><ul><ul><li>compressed database: FP-tree structure </li></ul></ul><ul><ul><li>no repeated scan of entire database </li></ul></ul><ul><ul><li>basic ops—counting local freq items and building sub FP-tree, no pattern search and matching </li></ul></ul>
  41. 41. Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  42. 42. Max-patterns & Close-patterns <ul><li>If there are frequent patterns with many items, enumerating all of them is costly. </li></ul><ul><li>We may be interested in finding the ‘boundary’ frequent patterns. </li></ul><ul><li>Two types… </li></ul>
  43. 43. Max-patterns <ul><li>Frequent pattern {a 1 , …, a 100 }  ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 = 1.27*10 30 frequent sub-patterns! </li></ul><ul><li>Max-pattern: frequent patterns without proper frequent super pattern </li></ul><ul><ul><li>BCDE, ACD are max-patterns </li></ul></ul><ul><ul><li>BCD is not a max-pattern </li></ul></ul>Min_sup=2 A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid
  44. 44. MaxMiner: Mining Max-patterns <ul><li>R. Bayardo. Efficiently mining long patterns from databases . In SIGMOD’98 </li></ul><ul><li>Idea: generate the complete set-enumeration tree one level at a time, while prune if applicable. </li></ul> (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()
  45. 45. Local Pruning Techniques (e.g. at node A) <ul><li>Check the frequency of ABCD and AB, AC, AD. </li></ul><ul><li>If ABCD is frequent, prune the whole sub-tree. </li></ul><ul><li>If AC is NOT frequent, remove C from the parenthesis before expanding. </li></ul> (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()
  46. 46. Algorithm MaxMiner <ul><li>Initially, generate one node N= , where h(N)=  and t(N)={A,B,C,D}. </li></ul><ul><li>Consider expanding N, </li></ul><ul><ul><li>If h(N)  t(N) is frequent, do not expand N. </li></ul></ul><ul><ul><li>If for some i  t(N), h(N)  {i} is NOT frequent, remove i from t(N) before expanding N. </li></ul></ul><ul><li>Apply global pruning techniques… </li></ul> (ABCD)
  47. 47. Global Pruning Technique (across sub-trees) <ul><li>When a max pattern is identified (e.g. ABCD), prune all nodes (B, C and D) where h(N)  t(N) is a sub-set of ABCD. </li></ul> (ABCD) A (BCD) B (CD) C (D) D () AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABCD () ABD () ACD () BCD ()
  48. 48. Example  (ABCDEF) Min_sup=2 Max patterns: A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid 3 C 2 B 2 E 3 D 1 F 2 A 0 ABCDEF Frequency Items A (BCDE) B (CDE) C (DE) E () D (E)
  49. 49. Example Min_sup=2 A (BCDE) B (CDE) C (DE) E () D (E) Max patterns: Node A A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid 2 AD 2 AC 1 AE 1 AB 1 ABCDE Frequency Items  (ABCDEF) AC (D) AD ()
  50. 50. Example Min_sup=2 A (BCDE) B (CDE) C (DE) E () D (E) Max patterns: Node B A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid BC BD 2 BCDE BE Frequency Items AC (D) AD () BCDE  (ABCDEF)
  51. 51. Example Min_sup=2 A (BCDE) B (CDE) C (DE) E () D (E) AC (D) AD () Max patterns: BCDE Node AC A,C,D,F 30 B,C,D,E, 20 A,B,C,D,E 10 Items Tid 2 ACD Frequency Items ACD  (ABCDEF)
  52. 52. Frequent Closed Patterns <ul><li>For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern </li></ul><ul><ul><li>“ ab” is a frequent closed pattern </li></ul></ul><ul><li>Concise rep. of freq pats </li></ul><ul><li>Reduce # of patterns and rules </li></ul><ul><li>N. Pasquier et al. In ICDT’99 </li></ul>Min_sup=2 e, f 50 a, b, d 40 a, b, d 30 a, b, c 20 a, b, c 10 Items TID
  53. 53. Max Pattern vs. Frequent Closed Pattern <ul><li>max pattern  closed pattern </li></ul><ul><ul><li>if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y. </li></ul></ul><ul><li>closed pattern  max pattern </li></ul><ul><ul><li>“ ab” is a closed pattern, but not max </li></ul></ul>Min_sup=2 e, f 50 a, b, d 40 a, b, d 30 a, b, c 20 a, b, c 10 Items TID
  54. 54. Mining Frequent Closed Patterns: CLOSET <ul><li>Flist: list of all frequent items in support ascending order </li></ul><ul><ul><li>Flist: d-a-f-e-c </li></ul></ul><ul><li>Divide search space </li></ul><ul><ul><li>Patterns having d </li></ul></ul><ul><ul><li>Patterns having a but not d, etc. </li></ul></ul><ul><li>Find frequent closed pattern recursively </li></ul><ul><ul><li>Among the transactions having d, cfa is frequent closed  cfad is a frequent closed pattern </li></ul></ul><ul><li>J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets&quot;, DMKD'00. </li></ul>Min_sup=2 c, e, f 50 a, c, d, f 40 c, e, f 30 a, b, e 20 a, c, d, e, f 10 Items TID
  55. 55. Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  56. 56. Mining Various Kinds of Rules or Regularities <ul><li>Multi-level rules. </li></ul><ul><li>Multi-dimensional rules. </li></ul><ul><li>Correlation analysis. </li></ul>
  57. 57. Multiple-level Association Rules <ul><li>Items often form hierarchy </li></ul><ul><li>Flexible support settings: Items at the lower level are expected to have lower support. </li></ul><ul><li>Transaction database can be encoded based on dimensions and levels </li></ul><ul><li>explore shared multi-level mining </li></ul>uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support
  58. 58. ML Associations with Flexible Support Constraints <ul><li>Why flexible support constraints? </li></ul><ul><ul><li>Real life occurrence frequencies vary greatly </li></ul></ul><ul><ul><ul><li>Diamond, watch, pens in a shopping basket </li></ul></ul></ul><ul><ul><li>Uniform support may not be an interesting model </li></ul></ul><ul><li>A flexible model </li></ul><ul><ul><li>The lower-level, the more dimension combination, and the long pattern length, usually the smaller support </li></ul></ul><ul><ul><li>General rules should be easy to specify and understand </li></ul></ul><ul><ul><li>Special items and special group of items may be specified individually and have higher priority </li></ul></ul>
  59. 59. Multi-Level Mining: Progressive Deepening <ul><li>A top-down, progressive deepening approach: </li></ul><ul><ul><li>First mine high-level frequent items: </li></ul></ul><ul><li>milk (15%), bread (10%) </li></ul><ul><ul><li>Then mine their lower-level “weaker” frequent itemsets: </li></ul></ul><ul><li>reduced-fat milk (5%), wheat bread (4%) </li></ul><ul><li>Different min_support threshold across multi-levels lead to different algorithms: </li></ul><ul><ul><li>If adopting the same min_support across multi-levels, </li></ul></ul><ul><ul><ul><li>toss t if any of t ’s ancestors is infrequent. </li></ul></ul></ul><ul><ul><li>If adopting reduced min_support at lower levels, </li></ul></ul><ul><ul><ul><li>even if a high-level item is not frequent, we may need to examine its descendant. Use level passage threshold to control this. </li></ul></ul></ul>
  60. 60. Multi-level Association: Redundancy Filtering <ul><li>Some rules may be redundant due to “ancestor” relationships between items. </li></ul><ul><li>Example </li></ul><ul><ul><li>milk  wheat bread [support = 8%, confidence = 70%] </li></ul></ul><ul><ul><li>reduced-fat milk  wheat bread [support = 2%, confidence = 72%] </li></ul></ul><ul><li>We say the first rule is an ancestor of the second rule. </li></ul><ul><li>A rule is redundant if its support and confidence are close to the “expected” value, based on the rule’s ancestor. </li></ul>
  61. 61. Multi-dimensional Association <ul><li>Single-dimensional rules: </li></ul><ul><ul><ul><li>buys(X, “milk”)  buys(X, “bread”) </li></ul></ul></ul><ul><li>Multi-dimensional rules:  2 dimensions or predicates </li></ul><ul><ul><li>Inter-dimension assoc. rules ( no repeated predicates ) </li></ul></ul><ul><ul><ul><li>age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) </li></ul></ul></ul><ul><ul><li>hybrid-dimension assoc. rules ( repeated predicates ) </li></ul></ul><ul><ul><ul><li>age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) </li></ul></ul></ul><ul><li>No details. </li></ul>
  62. 62. Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball  eat cereal </li></ul><ul><li>play basketball  not eat cereal </li></ul>100 40 60 Sum(col.) 25 5 20 Not cereal 75 35 40 Cereal Sum (row) Not basketball Basketball
  63. 63. Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball  eat cereal [40%, 66.7%] </li></ul><ul><li>play basketball  not eat cereal [20%, 33.3%] </li></ul>100 40 60 Sum(col.) 25 5 20 Not cereal 75 35 40 Cereal Sum (row) Not basketball Basketball
  64. 64. Interestingness Measure: Correlations (Lift) <ul><li>Compute the support and confidence… </li></ul><ul><li>play basketball  eat cereal [40%, 66.7%] </li></ul><ul><li>play basketball  not eat cereal [20%, 33.3%] </li></ul><ul><li>Dilemma: The possibility that a student eats cereal is 75%. But among students playing basketball, the possibility that a student eats cereal (66.7%) is less than average. So the second one is more accurate. </li></ul>100 40 60 Sum(col.) 25 5 20 Not cereal 75 35 40 Cereal Sum (row) Not basketball Basketball
  65. 65. <ul><li>corr A,B = 1  A and B are independent. </li></ul><ul><li>corr A,B > 1  A and B are positively correlated: the occurrence of one encourages the other. </li></ul><ul><li>corr A,B < 1  A and B are negatively correlated: the occurrence of one discourages the other. </li></ul>
  66. 66. <ul><li>For instance, A=‘buy milk’, B=‘buy bread’. </li></ul><ul><li>Among 100 persons, 30 buys milk and 30 buys bread. I.e. P(A)=P(B)=30. </li></ul><ul><li>If A and B are independent, there should have 9 person who buy both milk and bread. That is, among those who buys milk, the possibility that someone buys bread remains 30%. </li></ul><ul><li>Here, P(A  B)=P(A)*P(B|A) is the possibility that a person buys both milk and bread. </li></ul><ul><li>If there are 20 persons who buy both milk and bread, i.e. corr A,B > 1, the occurrence of one encourages the other. </li></ul>
  67. 67. Chapter 5: Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  68. 68. Constraint-based Data Mining <ul><li>Finding all the patterns in a database autonomously ? — unrealistic! </li></ul><ul><ul><li>The patterns could be too many but not focused! </li></ul></ul><ul><li>Data mining should be an interactive process </li></ul><ul><ul><li>User directs what to be mined using a data mining query language (or a graphical user interface) </li></ul></ul><ul><li>Constraint-based mining </li></ul><ul><ul><li>User flexibility: provides constraints on what to be mined </li></ul></ul><ul><ul><li>System optimization: explores such constraints for efficient mining— constraint-based mining </li></ul></ul>
  69. 69. Constraints in Data Mining <ul><li>Knowledge type constraint : </li></ul><ul><ul><li>classification, association, etc. </li></ul></ul><ul><li>Data constraint — using SQL-like queries </li></ul><ul><ul><li>find product pairs sold together in stores in Vancouver in Dec.’00 </li></ul></ul><ul><li>Dimension/level constraint </li></ul><ul><ul><li>in relevance to region, price, brand, customer category </li></ul></ul><ul><li>Rule (or pattern) constraint </li></ul><ul><ul><li>small sales (price < $10) triggers big sales (sum > $200) </li></ul></ul><ul><li>Interestingness constraint </li></ul><ul><ul><li>strong rules: min_support  3%, min_confidence  60% </li></ul></ul>
  70. 70. Constrained Mining vs. Constraint-Based Search <ul><li>Constrained mining vs. constraint-based search/reasoning </li></ul><ul><ul><li>Both are aimed at reducing search space </li></ul></ul><ul><ul><li>Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI </li></ul></ul><ul><ul><li>Constraint-pushing vs. heuristic search </li></ul></ul><ul><ul><li>It is an interesting research problem on how to integrate them </li></ul></ul><ul><li>Constrained mining vs. query processing in DBMS </li></ul><ul><ul><li>Database query processing requires to find all </li></ul></ul><ul><ul><li>Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing </li></ul></ul>
  71. 71. Constrained Frequent Pattern Mining: A Mining Query Optimization Problem <ul><li>Given a frequent pattern mining query with a set of constraints C, the algorithm should be </li></ul><ul><ul><li>sound: it only finds frequent sets that satisfy the given constraints C </li></ul></ul><ul><ul><li>complete : all frequent sets satisfying the given constraints C are found </li></ul></ul><ul><li>A naïve solution </li></ul><ul><ul><li>First find all frequent sets, and then test them for constraint satisfaction </li></ul></ul><ul><li>More efficient approaches: </li></ul><ul><ul><li>Analyze the properties of constraints comprehensively </li></ul></ul><ul><ul><li>Push them as deeply as possible inside the frequent pattern computation. </li></ul></ul>
  72. 72. Anti-Monotonicity in Constraint-Based Mining <ul><li>Anti-monotonicity </li></ul><ul><ul><li>When an itemset S violates the constraint, so does any of its superset </li></ul></ul><ul><ul><li>sum(S.Price)  v is anti-monotone </li></ul></ul><ul><ul><li>sum(S.Price)  v is not anti-monotone </li></ul></ul><ul><li>Example. C: range(S.profit)  15 is anti-monotone </li></ul><ul><ul><li>Itemset ab violates C </li></ul></ul><ul><ul><li>So does every superset of ab </li></ul></ul>TDB (min_sup=2) Transaction TID a, b, c, d, f 10 b, c, d, f, g, h 20 a, c, d, e, f 30 c, e, f, g 40 -10 h 20 g 30 f -30 e 10 d -20 c 0 b 40 a Profit Item
  73. 73. Which Constraints Are Anti-Monotone? no support(S)   no range(S)  v no sum(S)  v ( a  S, a  0 ) no count(S)  v no max(S)  v yes max(S)  v yes min(S)  v no min(S)  v yes S  V no S  V yes support(S)   convertible avg(S)  v,   {  ,  ,  } yes range(S)  v yes sum(S)  v ( a  S, a  0 ) yes count(S)  v No v  S Antimonotone Constraint
  74. 74. Monotonicity in Constraint-Based Mining <ul><li>Monotonicity </li></ul><ul><ul><li>When an itemset S satisfies the constraint, so does any of its superset </li></ul></ul><ul><ul><li>sum(S.Price)  v is monotone </li></ul></ul><ul><ul><li>min(S.Price)  v is monotone </li></ul></ul><ul><li>Example. C: range(S.profit)  15 </li></ul><ul><ul><li>Itemset ab satisfies C </li></ul></ul><ul><ul><li>So does every superset of ab </li></ul></ul>TDB (min_sup=2) Transaction TID a, b, c, d, f 10 b, c, d, f, g, h 20 a, c, d, e, f 30 c, e, f, g 40 -10 h 20 g 30 f -30 e 10 d -20 c 0 b 40 a Profit Item
  75. 75. Which Constraints Are Monotone? yes support(S)   yes range(S)  v yes sum(S)  v ( a  S, a  0 ) yes count(S)  v yes max(S)  v no max(S)  v no min(S)  v yes min(S)  v no S  V yes S  V no support(S)   convertible avg(S)  v,   {  ,  ,  } no range(S)  v no sum(S)  v ( a  S, a  0 ) no count(S)  v yes v  S Monotone Constraint
  76. 76. Succinctness <ul><li>Succinctness: </li></ul><ul><ul><li>Given A 1, the set of items satisfying a succinctness constraint C , then any set S satisfying C is based on A 1 , i.e., S contains a subset belonging to A 1 </li></ul></ul><ul><ul><li>Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items </li></ul></ul><ul><ul><li>min ( S.Price )  v is succinct </li></ul></ul><ul><ul><li>sum ( S.Price )  v is not succinct </li></ul></ul><ul><li>Optimization: If C is succinct, C is pre-counting prunable </li></ul>
  77. 77. Which Constraints Are Succinct? no support(S)   no range(S)  v no sum(S)  v ( a  S, a  0 ) weakly count(S)  v yes max(S)  v yes max(S)  v yes min(S)  v yes min(S)  v yes S  V yes S  V no support(S)   no avg(S)  v,   {  ,  ,  } no range(S)  v no sum(S)  v ( a  S, a  0 ) weakly count(S)  v yes v  S Succinct Constraint
  78. 78. The Apriori Algorithm — Example Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D
  79. 79. Naïve Algorithm: Apriori + Constraint Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D Constraint: Sum{S.price < 5}
  80. 80. The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D Constraint: Sum{S.price < 5}
  81. 81. The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D Constraint: min{S.price <= 1 }
  82. 82. Converting “Tough” Constraints <ul><li>Convert tough constraints into anti-monotone or monotone by properly ordering items </li></ul><ul><li>Examine C: avg( S .profit)  25 </li></ul><ul><ul><li>Order items in value-descending order </li></ul></ul><ul><ul><ul><li>< a, f, g, d, b, h, c, e > </li></ul></ul></ul><ul><ul><li>If an itemset afb violates C </li></ul></ul><ul><ul><ul><li>So does afbh, afb* </li></ul></ul></ul><ul><ul><ul><li>It becomes anti-monotone! </li></ul></ul></ul>TDB (min_sup=2) Transaction TID a, b, c, d, f 10 b, c, d, f, g, h 20 a, c, d, e, f 30 c, e, f, g 40 -10 h 20 g 30 f -30 e 10 d -20 c 0 b 40 a Profit Item
  83. 83. Convertible Constraints <ul><li>Let R be an order of items </li></ul><ul><li>Convertible anti-monotone </li></ul><ul><ul><li>If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R </li></ul></ul><ul><ul><li>Ex. avg(S)  v w.r.t. item value descending order </li></ul></ul><ul><li>Convertible monotone </li></ul><ul><ul><li>If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R </li></ul></ul><ul><ul><li>Ex. avg( S )  v w.r.t. item value descending order </li></ul></ul>
  84. 84. Strongly Convertible Constraints <ul><li>avg(X)  25 is convertible anti-monotone w.r.t. item value descending order R: < a, f, g, d, b, h, c, e > </li></ul><ul><ul><li>If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd </li></ul></ul><ul><li>avg(X)  25 is convertible monotone w.r.t. item value ascending order R -1 : < e, c, h, b, d, g, f, a > </li></ul><ul><ul><li>If an itemset d satisfies a constraint C , so does itemsets df and dfa , which having d as a prefix </li></ul></ul><ul><li>Thus, avg(X)  25 is strongly convertible </li></ul>-10 h 20 g 30 f -30 e 10 d -20 c 0 b 40 a Profit Item
  85. 85. What Constraints Are Convertible? …… No No Yes sum(S)  v (items could be of any value, v  0) No Yes No sum(S)  v (items could be of any value, v  0) No Yes No sum(S)  v (items could be of any value, v  0) No No Yes sum(S)  v (items could be of any value, v  0) Yes Yes Yes median(S)  ,  v Yes Yes Yes avg(S)  ,  v Strongly convertible Convertible monotone Convertible anti-monotone Constraint
  86. 86. Combing Them Together—A General Picture no yes no support(S)   no yes no range(S)  v no yes no sum(S)  v ( a  S, a  0 ) weakly yes no count(S)  v yes yes no max(S)  v yes no yes max(S)  v yes no yes min(S)  v yes yes no min(S)  v yes no yes S  V yes yes no S  V no no yes support(S)   no convertible convertible avg(S)  v,   {  ,  ,  } no no yes range(S)  v no no yes sum(S)  v ( a  S, a  0 ) weakly no yes count(S)  v yes yes no v  S Succinct Monotone Antimonotone Constraint
  87. 87. Classification of Constraints Convertible anti-monotone Convertible monotone Strongly convertible Inconvertible Succinct Antimonotone Monotone
  88. 88. Chapter 5 (in fact 8.3): Mining Association Rules in Large Databases <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>
  89. 89. Sequence Databases and Sequential Pattern Analysis <ul><li>Transaction databases, time-series databases vs. sequence databases </li></ul><ul><li>Frequent patterns vs. (frequent) sequential patterns </li></ul><ul><li>Applications of sequential pattern mining </li></ul><ul><ul><li>Customer shopping sequences: </li></ul></ul><ul><ul><ul><li>First buy computer, then CD-ROM, and then digital camera, within 3 months. </li></ul></ul></ul><ul><ul><li>Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc. </li></ul></ul><ul><ul><li>Telephone calling patterns, Weblog click streams </li></ul></ul><ul><ul><li>DNA sequences and gene structures </li></ul></ul>
  90. 90. A Sequence and a Subsequence <ul><li>A sequence is an ordered list of itemsets. Each itemset itself is unordered. </li></ul><ul><li>E.g. < C (MP) (ST) > is a sequence. The corresponding record means that a person bought a computer, then a monitor and a printer, and then a scanner and a table. </li></ul><ul><li>A subsequence is < P S > . Here {P}  {M,P} and {S}  {S,T}. </li></ul><ul><li>< C (ST) > is also a subsequence. </li></ul><ul><li>< S P > is not a subsequence, since out of order. </li></ul>
  91. 91. What Is Sequential Pattern Mining? <ul><li>Given a set of sequences, find the complete set of frequent subsequences </li></ul>A sequence database Given support threshold min_sup =2, <(ab)c> is a sequential pattern <eg(af)cbc> 40 <(ef)( ab )(df) c b> 30 <(ad)c(bc)(ae)> 20 <a( ab c)(a c )d(cf)> 10 sequence SID
  92. 92. Challenges on Sequential Pattern Mining <ul><li>A huge number of possible sequential patterns are hidden in databases </li></ul><ul><li>A mining algorithm should </li></ul><ul><ul><li>find the complete set of patterns , when possible, satisfying the minimum support (frequency) threshold </li></ul></ul><ul><ul><li>be highly efficient, scalable , involving only a small number of database scans </li></ul></ul><ul><ul><li>be able to incorporate various kinds of user-specific constraints </li></ul></ul>
  93. 93. Studies on Sequential Pattern Mining <ul><li>Concept introduction and an initial Apriori-like algorithm </li></ul><ul><ul><li>R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95 </li></ul></ul><ul><li>GSP—An Apriori-based, influential mining method (developed at IBM Almaden) </li></ul><ul><ul><li>R. Srikant & R. Agrawal. “Mining sequential patterns: Generalizations and performance improvements,” EDBT’96 </li></ul></ul><ul><li>From sequential patterns to episodes (Apriori-like + constraints) </li></ul><ul><ul><li>H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent episodes in event sequences,” Data Mining and Knowledge Discovery, 1997 </li></ul></ul><ul><li>Mining sequential patterns with constraints </li></ul><ul><ul><li>M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999 </li></ul></ul>
  94. 94. A Basic Property of Sequential Patterns: Apriori <ul><li>A basic property: Apriori (Agrawal & Sirkant’94) </li></ul><ul><ul><li>If a sequence S is not frequent </li></ul></ul><ul><ul><li>Then none of the super-sequences of S is frequent </li></ul></ul><ul><ul><li>E.g, <hb> is infrequent  so do <hab> and <(ah)b> </li></ul></ul>Given support threshold min_sup =2 <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID
  95. 95. GSP—A Generalized Sequential Pattern Mining Algorithm <ul><li>GSP (Generalized Sequential Pattern) mining algorithm </li></ul><ul><ul><li>proposed by Agrawal and Srikant, EDBT’96 </li></ul></ul><ul><li>Outline of the method </li></ul><ul><ul><li>Initially, every item in DB is a candidate of length=1 </li></ul></ul><ul><ul><li>for each level (i.e., sequences of length-k) do </li></ul></ul><ul><ul><ul><li>scan database to collect support count for each candidate sequence </li></ul></ul></ul><ul><ul><ul><li>generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori </li></ul></ul></ul><ul><ul><li>repeat until no frequent sequence or no candidate can be found </li></ul></ul><ul><li>Major strength: Candidate pruning by Apriori </li></ul>
  96. 96. Finding Length-1 Sequential Patterns <ul><li>Examine GSP using an example </li></ul><ul><li>Initial candidates: all singleton sequences </li></ul><ul><ul><li><a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> </li></ul></ul><ul><li>Scan database once, count support for candidates </li></ul><a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2 1 <h> 1 <g> 2 <f> 3 <e> 3 <d> 4 <c> 5 <b> 3 <a> Sup Cand
  97. 97. Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates <ff> <fe> <fd> <fc> <fb> <fa> <f> <ef> <ee> <ed> <ec> <eb> <ea> <e> <df> <de> <dd> <dc> <db> <da> <d> <cf> <ce> <cd> <cc> <cb> <ca> <c> <bf> <be> <bd> <bc> <bb> <ba> <b> <af> <ae> <ad> <ac> <ab> <aa> <a> <f> <e> <d> <c> <b> <a> <f> <(ef)> <e> <(df)> <(de)> <d> <(cf)> <(ce)> <(cd)> <c> <(bf)> <(be)> <(bd)> <(bc)> <b> <(af)> <(ae)> <(ad)> <(ac)> <(ab)> <a> <f> <e> <d> <c> <b> <a>
  98. 98. Finding Length-2 Sequential Patterns <ul><li>Scan database one more time, collect support count for each length-2 candidate </li></ul><ul><li>There are 19 length-2 candidates which pass the minimum support threshold </li></ul><ul><ul><li>They are length-2 sequential patterns </li></ul></ul>
  99. 99. Generating Length-3 Candidates and Finding Length-3 Patterns <ul><li>Generate Length-3 Candidates </li></ul><ul><ul><li>Self-join length-2 sequential patterns </li></ul></ul><ul><ul><ul><li>Based on the Apriori property </li></ul></ul></ul><ul><ul><ul><li><ab>, <aa> and <ba> are all length-2 sequential patterns  <aba> is a length-3 candidate </li></ul></ul></ul><ul><ul><ul><li><(bd)>, <bb> and <db> are all length-2 sequential patterns  <(bd)b> is a length-3 candidate </li></ul></ul></ul><ul><ul><li>46 candidates are generated </li></ul></ul><ul><li>Find Length-3 Sequential Patterns </li></ul><ul><ul><li>Scan database once more, collect support counts for candidates </li></ul></ul><ul><ul><li>19 out of 46 candidates pass support threshold </li></ul></ul>
  100. 100. The GSP Mining Process min_sup =2 <a> <b> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID
  101. 101. The GSP Algorithm <ul><li>Take sequences in form of <x> as length-1 candidates </li></ul><ul><li>Scan database once, find F 1 , the set of length-1 sequential patterns </li></ul><ul><li>Let k=1; while F k is not empty do </li></ul><ul><ul><li>Form C k+1 , the set of length-(k+1) candidates from F k ; </li></ul></ul><ul><ul><li>If C k+1 is not empty, scan database once, find F k+1 , the set of length-(k+1) sequential patterns </li></ul></ul><ul><ul><li>Let k=k+1; </li></ul></ul>
  102. 102. Bottlenecks of GSP <ul><li>A huge set of candidates could be generated </li></ul><ul><ul><li>1,000 frequent length-1 sequences generate length-2 candidates! </li></ul></ul><ul><li>Multiple scans of database in mining </li></ul><ul><li>Real challenge: mining long sequential patterns </li></ul><ul><ul><li>An exponential number of short candidates </li></ul></ul><ul><ul><li>A length-100 sequential pattern needs 10 30 candidate sequences! </li></ul></ul>
  103. 103. PrefixSpan: Prefix-Projected Sequential Pattern Mining <ul><li>A divide-and-conquer approach </li></ul><ul><ul><li>Recursively project a sequence database into a set of smaller databases based on current (short) frequent patterns </li></ul></ul><ul><ul><li>To project based on one pattern, remove the pattern from the beginning of each sequence. </li></ul></ul><ul><li>Example projection of sequence <a(abc)(ac)d(cf)> </li></ul><(cf)> <cd> <(_c)(ac)d(cf)> <b> <(abc)(ac)d(cf)> <a> Projection Frequent Pattern
  104. 104. Mining Sequential Patterns by Prefix Projections <ul><li>Step 1: find length-1 sequential patterns </li></ul><ul><ul><li><a>, <b>, <c>, <d>, <e>, <f> </li></ul></ul><ul><li>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: </li></ul><ul><ul><li>The ones having prefix <a>; </li></ul></ul><ul><ul><li>The ones having prefix <b>; </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>The ones having prefix <f> </li></ul></ul><eg(af)cbc> 40 <(ef)(ab)(df)cb> 30 <(ad)c(bc)(ae)> 20 <a(abc)(ac)d(cf)> 10 sequence SID
  105. 105. Finding Seq. Patterns with Prefix <a> <ul><li>Only need to consider projections w.r.t. <a> </li></ul><ul><ul><li><a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> </li></ul></ul><ul><li>Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, </li></ul><ul><li>by checking the frequency of items like a and _a. </li></ul><ul><ul><li>Further partition into 6 subsets </li></ul></ul><ul><ul><ul><li>Having prefix <aa>; </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><ul><ul><li>Having prefix <af> </li></ul></ul></ul><eg(af)cbc> 40 <(ef)(ab)(df)cb> 30 <(ad)c(bc)(ae)> 20 <a(abc)(ac)d(cf)> 10 sequence SID
  106. 106. Completeness of PrefixSpan SDB Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <a> Having prefix <aa> <aa>-proj. db … <af>-proj. db Having prefix <af> <b>-projected database … Having prefix <b> Having prefix <c>, …, <f> … … <eg(af)cbc> 40 <(ef)(ab)(df)cb> 30 <(ad)c(bc)(ae)> 20 <a(abc)(ac)d(cf)> 10 sequence SID
  107. 107. Efficiency of PrefixSpan <ul><li>No candidate sequence needs to be generated </li></ul><ul><li>Projected databases keep shrinking </li></ul><ul><li>Major cost of PrefixSpan: constructing projected databases </li></ul><ul><ul><li>Can be improved by bi-level projections </li></ul></ul>
  108. 108. Optimization Techniques in PrefixSpan <ul><li>Physical projection vs. pseudo-projection </li></ul><ul><ul><li>Pseudo-projection may reduce the effort of projection when the projected database fits in main memory </li></ul></ul><ul><li>Parallel projection vs. partition projection </li></ul><ul><ul><li>Partition projection may avoid the blowup of disk space </li></ul></ul>
  109. 109. Speed-up by Pseudo-projection <ul><li>Major cost of PrefixSpan: projection </li></ul><ul><ul><li>Postfixes of sequences often appear repeatedly in recursive projected databases </li></ul></ul><ul><li>When (projected) database can be held in main memory, use pointers to form projections </li></ul><ul><ul><li>Pointer to the sequence </li></ul></ul><ul><ul><li>Offset of the postfix </li></ul></ul>s=<a(ab c)(ac)d(cf) > <(ab c)(ac)d(cf) > <(_ c)(ac)d(cf) > <a> <ab> s|<a>: ( , 2) s|<ab>: ( , 4)
  110. 110. Pseudo-Projection vs. Physical Projection <ul><li>Pseudo-projection avoids physically copying postfixes </li></ul><ul><ul><li>Efficient in running time and space when database can be held in main memory </li></ul></ul><ul><li>However, it is not efficient when database cannot fit in main memory </li></ul><ul><ul><li>Disk-based random accessing is very costly </li></ul></ul><ul><li>Suggested Approach: </li></ul><ul><ul><li>Integration of physical and pseudo-projection </li></ul></ul><ul><ul><li>Swapping to pseudo-projection when the data set fits in memory </li></ul></ul>
  111. 111. PrefixSpan Is Faster than GSP and FreeSpan
  112. 112. Effect of Pseudo-Projection
  113. 113. Extensions of Sequential Pattern Mining <ul><li>Closed- and max- sequential patterns </li></ul><ul><ul><li>Finding only the most meaningful (longest) sequential patterns </li></ul></ul><ul><li>Constraint-based sequential pattern growth </li></ul><ul><ul><li>Adding user-specific constraints </li></ul></ul><ul><li>From sequential patterns to structured patterns </li></ul><ul><ul><li>Beyond sequential patterns, mining structured patterns in XML documents </li></ul></ul>
  114. 114. Summary <ul><li>Association rule mining </li></ul><ul><li>Algorithms Apriori and FP-Growth </li></ul><ul><li>Max and closed patterns </li></ul><ul><li>Mining various kinds of association/correlation rules </li></ul><ul><li>Constraint-based association mining </li></ul><ul><li>Sequential pattern mining </li></ul>

×