Association Rules Mining with SQL
Upcoming SlideShare
Loading in...5
×
 

Association Rules Mining with SQL

on

  • 409 views

 

Statistics

Views

Total Views
409
Slideshare-icon Views on SlideShare
409
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Association Rules Mining with SQL Association Rules Mining with SQL Presentation Transcript

    • Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Early data mining applications
      • Most early mining systems were developed largely on file systems, with specialized data structures and buffer management strategies devised for each
      • All data was read into memory before beginning computation
      • This limits the amount of data that can be mined
    • Advantage of SQL and RDBMS
      • Make use of database indexing and query processing capabilities
      • More than a decade spent on making these systems robust, portable, scalable, and concurrent
      • Exploit underlying SQL parallelization
      • For long-running algorithms, use checkpointing and space management
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Use of Database in Data Mining
      • “ Loose coupling” of application and data
        • How would you write an Apriori program?
        • Use SQL statements in an application
        • Use a cursor interface to read through records sequentially for each pass
        • Still two major performance problems:
          • Copying of record from database to memory
          • Process context switching for each record retrieved
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Tightly-coupled applications
      • Push computations into the database system to avoid performance degradation
      • Take advantage of user-defined functions (UDFs)
      • Does not require changes to database software
      • Two types of UDFs we will use:
        • Ones that are executed only a few times, regardless of the number of rows
        • Ones that are executed once for each selected row
    • Tight-coupling using UDFs
      • Procedure TightlyCoupledApriori():
      • begin
        • exec sql connect to database;
        • exec sql select allocSpace() into :blob from onerecord;
        • exec sql select * from sales where GenL 1 (:blob, TID, ITEMID) = 1;
        • notDone := true;
    • Tight-coupling using UDFs
      • while notDone do {
      • exec sql select aprioriGen(:blob)
      • into :blob from onerecord;
      • exec sql select *
      • from sales
      • where itemCount(:blob, TID,
      • ITEMID)=1;
      • exec sql select GenL k (:blob) into :notDone from onerecord
      • }
    • Tight-coupling using UDFs
      • exec sql select getResult(:blob) into :resultBlob from onerecord;
      • exec sql select deallocSpace(:blob) from onerecord;
      • compute Answer using resultBlob;
      • end
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Methodology
      • Comparison done with Association Rules against IBM DB2
      • Only consider generation of frequent itemsets using Apriori algorithm
      • Five alternatives considered:
        • Loose-coupling through SQL cursor interface – as described earlier
        • UDF tight-coupling – as described earlier
        • Stored-procedure to encapsulate mining algorithm
        • Cache-mine – caching data and mining on the fly
        • SQL implementations to force processing in the database
          • Consider two classes of implementations
            • SQL-92 – four different implementations
            • SQL-OR (with object relational extensions) – six implementations
    • Architectural Options
      • Stored procedure
        • Apriori algorithm encapsulated as a stored procedure
        • Implication: runs in the same address space as the DBMS
        • Mined results stored back into the DBMS.
      • Cache-mine
        • Variation of stored-procedure
        • Read entire data once from DBMS, temporarily cache data in a lookaside buffer on a local disk
        • Cached data is discarded when execution completes
        • Disadvantage – requires additional disk space for caching
        • Use Intelligent Miner’s “space” option
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Terminology
      • Use the following terminology
        • T: table of items
          • {tid,item} pairs
          • Data is normally sorted by transaction id
        • C k : candidate k-itemsets
          • Obtained from joining and pruning frequent itemsets from previous iteration
        • F k : frequent items sets of length k
          • Obtained from C k and T
    • Candidate Generation in SQL – join step
      • Generate C k from F k-1 by joining F k-1 with itself
        • insert into C k select I 1 .item 1 ,…,I 1 .item k-1 ,I 2 .item k-1
        • from F k-1 I 1 ,F k-1 I 2
        • where I 1 .item 1 = I 2 .item 1 and
        • I 1 .item k-2 = I 2 .item k-2 and
        • I 1 .item k-1 < I 2 .item k-1
    • Candidate Generation Example
      • F 3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}}
      • C 4 is {{1,2,3,4},{1,3,4,5}}
      Table F 3 (I 1 ) Table F 3 (I 2 ) 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1
    • Pruning
      • Modify candidate generation algorithm to ensure all k subsets of C k of length (k-1) are in F k-1
        • Do a k-way join, skipping item n-2 when joining with the n th table (2<n≤k)
        • Create primary index (item 1 , …, item k-1 ) on F k-1 to efficiently process k-way join
      • For k=4, this becomes
        • insert into C 4 select I 1 .item 1 , I 1 .item 2 , I 1 .item 3 ,I 2 .item 3 from F 3 I 1 ,F 3 I 2 ,
        • F 3 I 3 , F3 I 4 where I 1 .item 1 = I 2 .item 1 … and I 1 .item3 < I 2 .item3 and
        • I 1 .item 2 = I 3 .item 1 and I 1 .item 3 = I 3 .item 2 and I 2 .item 3 = I 3 .item 3 and
        • I 1 .item 1 = I 4 .item 1 and I 1 .item 3 = I 4 .item 2 and I 2 .item 3 = I 4 .item 3
    • Pruning Example
      • Evaluate join with I 3 using previous example
      • C 4 is {1,2,3,4}
      Table F 3 (I 1 ) Table F 3 (I 2 ) Table F 3 (I 3 ) 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1
    • Support counting using SQL
      • Two different approaches
        • Use the SQL-92 standard
          • Use ‘standard’ SQL syntax such as joins and subqueries to find support of itemsets
        • Use object-relational extensions of SQL (SQL-OR)
          • User Defined Functions (UDFs) & table functions
          • Binary Large Objects (BLOBs)
    • Support Counting using SQL-92
      • 4 different methods, two of which detailed in the papers
        • K-way Joins
        • SubQuery
      • Other methods not discussed because of unacceptable performance
        • 3-way join
        • 2 Group-Bys
    • SQL-92: K-way join
      • Obtain F k by joining C k with table T of (tid,item)
      • Perform group by on the itemset
        • insert into F k select item 1 ,…,item k ,count(*)
        • from C k , T t 1 , …, T t k ,
        • where t 1 .item = C k .item 1 , … , and
        • t k .item = C k .item k and
        • t 1 .tid = t 2 .tid … and
        • t k-1 .tid = t k .tid
        • group by item 1 ,…,item k
        • having count(*) > :minsup
    • K-way join example
      • C 3 ={B,C,E} and minimum support required is 2
      • Insert into F 3 {B,C,E,2}
    • K-way join: Pass-2 optimization
      • When calculating C 2 , no pruning is required after we join F 1 with itself
      • Don’t calculate and materialize C 2 - replace C 2 in 2-way join algorithm with join of F 1 with itself
        • insert into F 2 select I 1 .item 1 , I 2 .item 1 ,count(*)
        • from F 1 I 1 , F 1 I 2 , T t 1 , T t 2
        • where I 1 .item 1 < I 2 .item 1 and
        • t 1 .item = I 1 .item 1 and t 2 .item = I 2 .item 1 and
        • t 1 .tid = t 2 .tid
        • group by I 1 .item 1 ,I 2 .item 1
        • having count(*) > :minsup
    • SQL-92: SubQuery based
      • Split support counting into cascade of k subqueries
      • n th subquery Q n finds all tids that match the distinct itemsets formed by the first n items of C k
        • insert into F k select item 1 , …, item k , count(*)
        • from (Subquery Q k ) t
        • Group by item 1 , item 2 … , item k having count(*) > :minsup
        • Subquery Q n (for any n between 1 and k):
        • select item 1 , …, item n , tid
        • from T t n , (Subquery Q n-1 ) as r n-1
        • (select distinct item 1 , …, item n from C K ) as d n
        • where r n-1 .item 1 = d n .item 1 and … and r n-1 .item n-1 = d n .item n
        • and r n-1 .tid = t n .tid and t n .item = d n .item n
    • Example of SubQuery based
      • Using previous example from class
        • C 3 = {B,C,E}, minimum support = 2
      • Q 0 : No subquery Q 0
      • Q 1 in this case becomes
        • select item 1 , tid
        • From T t 1 ,
        • (select distinct item 1 from C 3 ) as d 1
        • where t 1 .item = d 1 .item1
    • Example of SubQuery based cnt’d
      • Q 2 becomes
        • select item 1 , item 2 , tid from T t 2 , (Subquery Q 1 ) as r 1 ,
        • (select distinct item 1 , item 2 from C 3 ) as d 2 where r 1 .item 1 = d 2 .item 1 and r 1 .tid = t 2 .tid and t 2 .item = d 2 .item 2
    • Example of SubQuery based cnt’d
      • Q 3 becomes
        • select item 1 ,item 2 ,item 3 , tid from T t 3 , (Subquery Q 2 ) as r 2 ,
        • (select distinct item 1 ,item 2 ,item 3 from C 3 ) as d 3
        • where r 2 .item 1 = d 3 .item 1 and r 2 .item 2 = d 3 .item 2 and
        • r 2 .tid = t 3 .tid and t 3 .item = d 3 .item 3
    • Example of SubQuery based cnt’d
      • Output of Q 3 is
      • Insert statement becomes
        • insert into F 3 select item 1 , item 2 , item 3 , count(*)
        • from (Subquery Q 3 ) t
        • group by item 1 , item 2 ,item 3 having count(*) > :minsup
      • Insert the row {B,C,E,2}
      • For Q 2 , pass-2 optimization can be used
    • Performance Comparisons of SQL-92 approaches
      • Used Version 5 of DB2 UDB and RS/6000 Model 140
        • 200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec
      • Used 4 different item sets based on real-world data
      • Built the following indexes, which are not included in any cost calculations
        • Composite index (item1, …, itemk) on C k
        • k different indices on each of the k items in C k
        • (item,tid) and (tid,item) indexes on the data table T
    • Performance Comparisons of SQL-92 approaches
      • Best performance obtained by SubQuery approach
      • SubQuery was only comparable to loose-coupling in some cases, failing to complete in other cases
        • DataSet C, for support of 2%, SubQuery outperforms loose-coupling but decreasing support to 1%, SubQuery takes 10 times as long to complete
        • Lower support will increase the size of C k and F k at each step, causing the join to process more rows
    • Support Counting using SQL with object-relational extensions
      • 6 different methods, four of which detailed in the papers
        • GatherJoin
        • GatherCount
        • GatherPrune
        • Vertical
      • Other methods not discussed because of unacceptable performance
        • Horizontal
        • SBF
    • SQL Object-Relational Extension: GatherJoin
      • Generates all possible k-item combinations of items contained in a transaction and joins them with C k
        • An index is created on all items of C k
      • Uses the following table functions
        • Gather: Outputs records {tid,item-list}, with item-list being a BLOB or VARCHAR containing all items associated with the tid
        • Comb-K: returns all k-item combinations from the transaction
          • Output has k attributes T_itm 1 , …, T_itm k
    • GatherJoin
        • insert into F k select item 1 ,…, item k , count(*)
        • from C k ,
        • (select t 2 .T_itm 1 ,…,t 2 .itm k from T,
        • table(Gather(T.tid,T.item)) as t 1 ,
        • table(Comb-K(t 1 .tid,t 1 .item-list)) as t 2 )
        • where t 2 .T_itm 1 = C k .item 1 and … and
        • t 2 .T_itm k = C k .item k
        • group by C k .item 1 ,…,C k .item k
        • having count(*) > :minsup
    • Example of GatherJoin
      • t 1 (output from Gather) looks like:
      • t 2 (generated by Comb-K from t 1 ) will be joined with C 3 to obtain F 3
        • 1 row from Tid 10
        • 1 row from Tid 20
        • 4 rows from Tid 30
      • Insert {B,C,E,2}
    • GatherJoin: Pass 2 optimization
      • When calculating C 2 , no pruning is required after we join F 1 with itself
      • Don’t calculate and materialize C 2 - replace C 2 with a join to F1 before the table function
        • Gather is only passed frequent 1-itemset rows
        • insert into F 2 select I 1 .item 1 , I 2 .item 1 , count(*) from F 1 I 1 ,
        • (select t 2 .T_itm 1 ,t 2 .T_itm 2 from T, table(Gather(T.tid,T.item)) as t 1 ,
        • table(Comb-K(t 1 .tid,t 1 .item-list)) as t 2 where T.item = I 1 .item 1 )
        • group by t 2 .T_itm 1 ,t 2 .T_itm 2
        • having count(*) > :minsup
    • Variations of GatherJoin - GatherCount
      • Perform the GROUP BY inside the table function Comb-K for pass 2 optimization
      • Output of the table function Comb-K
        • Not the candidate frequent itemsets (C k )
        • But the actual frequent itemsets (F k ) along with the corresponding support
      • Use a 2-dimensional array to store possible frequent itemsets in Comb-K
        • May lead to excessive memory use
    • Variations of GatherJoin - GatherPrune
      • Push the join with C k into the table function Comb-K
      • C k is converted into a BLOB and passed as an argument to the table function.
        • Will have to pass the BLOB for each invocation of Comb-K - # of rows in table T
    • SQL Object-Relational Extension: Vertical
      • For each item, create a BLOB containing the tids the item belongs to
        • Use function Gather to generate {item,tid-list} pairs, storing results in table TidTable
        • Tid-list are all in the same sorted order
      • Use function Intersect to compare two different tid-lists and extract common values
      • Pass-2 optimization can be used for Vertical
        • Similar to K-way join method
        • Upcoming example does not show optimization
    • Vertical
        • insert into F k select item 1 , …, item k , count(tid-list) as cnt
        • from (Subquery Q k ) t where cnt > :minsup
        • Subquery Q n (for any n between 2 and k)
        • Select item 1 , …, item n ,
        • Intersect(r n-1 .tid-list, t n .tid-list) as tid-list
        • from TidTable t n , (Subquery Q n-1 ) as r n-1
        • (select distinct item 1 , …, item n from C K ) as d n
        • where r n-1 .item 1 = d n .item 1 and … and
        • r n-1 .item n-1 = d n .item n-1 and
        • t n .item = d n .item n
        • Subquery Q 1 : (select * from TidTable)
    • Example of Vertical
      • Using previous example from class
        • C 3 = {B,C,E}, minimum support = 2
      • Q 1 is TidTable
    • Example of Vertical cnt’d
      • Q 2 becomes
        • Select item 1 , item 2 , Intersect(r 1 .tid-list, t 2 .tid-list) as tid-list
        • from TidTable t 2 , (Subquery Q 1 ) as r 1
        • (select distinct item 1 , item 2 from C 3 ) as d 2
        • where r 1 .item 1 = d 2 .item 1 and t 2 .item = d 2 .item 2
    • Example of Vertical cnt’d
      • Q 3 becomes
        • select item 1 , item 2 , item 3 , intersect(r 2 .tid-list, t 3 .tid-list) as tid-list
        • from TidTable t 3 , (Subquery Q 2 ) as r 2
        • (select distinct item 1 , item 2 , item 3 from C 3 ) as d 3
        • where r 2 .item 1 = d 3 .item 1 and r 2 .item 2 = d 3 .item 2 and
        • t 3 .item = d 3 .item 3
    • Performance Comparisons using SQL-OR
    • Performance Comparisons using SQL-OR
    • Performance comparison of SQL object-relational approaches
      • Vertical has best overall performance, sometimes an order of magnitude better than other 3 approaches
        • Majority of time is transforming the data in {item,tid-list} pairs
        • Vertical spends too much time on the second pass
      • Pass-2 optimization has huge impact on performance of GatherJoin
        • For Dataset-B with support of 0.1 %, running time for Pass 2 went from 5.2 hours to 10 minutes
      • Comb-K in GatherJoin generates large number of potential frequent itemsets we must work with
    • Hybrid approach
      • Previous charts and algorithm analysis show
        • Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased
        • GatherJoin degrades when the # of frequent items per transaction increases
      • To improve performance, use a hybrid algorithm
        • Use Vertical for most cases
        • When size of candidate itemset is too large, GatherJoin is a good option if number of frequent items per transaction (N f ) is not too large
        • When N f is large, GatherCount may be the only good option
    • Architecture Comparisons
      • Compare five alternatives
        • Loose-Coupling, Stored-procedure
          • Basically the same except for address space program is being run in
          • Because of limited difference in performance, focus solely on stored procedure in following charts
        • Cache-Mine
        • UDF tight-coupling
        • Best SQL approach (Hybrid)
    • Performance Comparisons of Architectures
    • Performance Comparisons of Architectures cnt’d
    • Performance Comparisons of Architectures cnt’d
      • Cache-Mine is the best or close to the best performance in all cases
        • Factor of 0.8 to 2 times faster than SQL approach
      • Stored procedure is the worst
        • Difference between Cache-Mine directly related to the number of passes through the data
          • Passes increase when the support goes down
          • May need to make multiple passes if all candidates cannot fit in memory
      • UDF time per pass decreases 30-50% compared to stored procedure because of tighter coupling with DB
    • Performance Comparisons of Architectures cnt’d
      • SQL approach comes in second in performance to Cache-Mine
        • Somewhat better than Cache-Mine for high support values
        • 1.8 – 3 times better than Stored-procedure/loose-coupling approach, getting better when support value decreases
        • Cost of converting to Vertical format is less than cost of converting to binary format in Cache-Mine
        • For second pass through data, SQL approach takes much more time than Cache-Mine, particularly when we decrease the support
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Taxonomies - example Beverages Snacks Soft Drinks Alcoholic Drinks Pretzels Chocolate Bar Pepsi Coke Beer Example rule: Soft Drinks  Pretzels with 30% confidence, 2% support Chocolate Bar Snacks Pretzels Snacks Beer Alcoholic Drinks Coke Soft Drinks Pepsi Soft Drinks Alcoholic Drinks Beverages Soft Drinks Beverages Child Parent
    • Taxonomy augmentation
      • Algorithms similar to previous slides
      • Requires two additions to algorithm
        • Pruning itemsets containing an item and its ancestor
        • Pre-computing the ancestors for each item
      • Will also consider support counting
    • Pruning items and ancestors
      • In the second pass we will join F 1 with F 1 to give C 2
      • This will give, for example:
      • beverages,pepsi
      • snacks,coke
      • pretzels,chocolate bar
      • But beverages,pepsi is redundant!
    • Pruning items and ancestors
      • The following modification to the SQL statement eliminates such redundant combinations from being selected:
      • insert into C 2 (select I 1 .item 1 , I 2 .item 1 from F 1 I 1 , F 1 I 2
      • where I 1 .item 1 < I 2 .item 1 ) except
      • (select ancestor, descendant from Ancestor union
      • select descendant, ancestor from Ancestor)
    • Pre-computing ancestors
      • An ancestor table is created
        • Format (ancestor, descendant)
        • Use the transitive closure operation
      • insert into Ancestor with R-Tax (ancestor, descendant) as
      • (select parent, child from Tax union all
      • select p.ancestor, c.child from R-Tax p, Tax c
      • where p.descendant = c.parent)
      • select ancestor, descendant from R-Tax
    • Support Counting
      • Extensions to handle taxonomies
        • Straightforward, but
        • Non-trivial
      • Need an extended transaction table
        • For example, if we have {coke, pretzels}
        • We add also {soft drinks, pretzels}, {beverages, pretzels}, {coke, snacks}, {soft drinks, snacks}, {beverages, snacks}
    • Extended transaction table
      • Can be obtained by the following SQL
      • Query to generate T*
      • select item, tid from T union
      • select distinct A.ancestor as item, T.tid
      • from T, Ancestor A
      • where A.descendant = T.item
      • The “select distinct” clause gets rid of items with common ancestor – e.g. don’t want {beverages, beverages} being added twice from {pepsi, coke}
    • Pipelining of Query
      • No need to actually build T*
      • Make following modification to SQL:
        • insert into F k with T*(tid, item) as (Query for T*)
        • select item 1 ,…,item k ,count(*)
        • from C k , T* t 1 , …, T* t k ,
        • where t 1 .item = C k .item 1 , … , and
        • t k .item = C k .item k and
        • t 1 .tid = t 2 .tid … and
        • t k-1 .tid = t k .tid
        • group by item 1 ,…,item k
        • having count(*) > :minsup
    • Organization of Presentation
      • Overview – Data Mining and RDBMS
      • Loosely-coupled data and programs
      • Tightly-coupled data and programs
      • Architectural approaches
      • Methods of writing efficient SQL
        • Candidate generation, pruning, support counting
        • K-way join, SubQuery, GatherJoin, Vertical, Hybrid
      • Integrating taxonomies
      • Mining sequential patterns
    • Sequential patterns
      • Similar to papers covered on Nov 17
      • Input is sequences of transactions
        • E.g. ((computer,modem),(printer))
      • Similar to association rules, but dealing with sequences as opposed to sets
      • Can also specify maximum and minimum time gaps, as well as sliding time windows
        • Max-gap, min-gap, window-size
    • Input and output formats
      • Input has three columns:
        • Sequence identifier (sid)
        • Transaction time (time)
        • Idem identifier (item)
      • Output format is a collection of frequent sequences, in a fixed-width table
        • (item 1 , eno 1 ,…,item k , eno k , len)
        • For smaller lengths, extra column values are set to NULL
    • GSP algorithm
      • Similar to algorithms shown earlier
      • Each C k has transactions and times, but no length – has fixed length of k
      • Candidates are generated in two steps
        • Join – join F k-1 with itself
          • Sequence s 1 joins with s 2 if the subsequence obtained by dropping the first item of s 1 is the same as the one obtained by dropping the last item of s 2
          • When generating C 2 , we need to generate sequences where both of the items appear as a single element as well as two separate elements
        • Prune
          • All candidate sequences that have a non-frequent contiguous (k-1) subsequence are deleted
    • GSP – Join SQL
      • insert into C k
      • select I 1 .item 1 , I 1 .eno 1 , ... , I 1 .item k-1 , I 1 .eno k-1 ,
      • I 2 .itemk k-1 , I 1 .eno k-1 + I 2 .eno k-1 – I 2 .eno k-2
      • from F k-1 I 1 , F k-1 I 2
      • where I 1 .item 2 = I 2 .item 1 and ... and I 1 .item k-1 = I 2 .item k-2 and
      • I 1 .eno 3 -I 1 .eno 2 = I 2 .eno 2 – I 2 .eno 1 and ... and
      • I 1 .eno k-1 – I 1 .eno k-2 = I 2 .eno k-2 – I 2 .eno k-3
    • GSP – Prune SQL
      • Write as a k-way join, similar to before
      • There are at most k contiguous subsequences of length (k-1) for which F k-1 needs to be checked for membership
      • Note that all (k-1) subsequences may not be contiguous because of the max-gap constraint between consecutive elements.
    • GSP – Support Counting
      • In each pass, we use the candidate table C k and the input data-sequences table D to count the support
      • K-way join
        • We use select distinct before the group by to ensure that only distinct data-sequences are counted
        • We have additional predicates between sequence numbers to handle the special time elements
    • GSP – Support Counting SQL
      • (C k .eno j = C k .eno i and abs(d j .time – d i .time) ≤ window-size) or (C k .eno j = C k .eno i + 1 and d j .time – d i .time max-gap and d j .time – d i .time > min-gap) or (C k .eno j > C k .eno i + 1)
    • References
      • Developing Tightly-Coupled Data Mining Applications on a Relational Database System
        • Rakesh Agrawal, Kyuseok Shim, 1996
      • Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications
        • Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998
        • Refers to 1) above
      • Mining Generalized Association Rules and Sequential Patterns Using SQL Queries
        • Shiby Thomas, Sunita Sarawagi, 1998
        • Refers to 1) and 2) above