Association Rules Mining with SQL

492 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
492
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Association Rules Mining with SQL

  1. 1. Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003
  2. 2. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  3. 3. Early data mining applications <ul><li>Most early mining systems were developed largely on file systems, with specialized data structures and buffer management strategies devised for each </li></ul><ul><li>All data was read into memory before beginning computation </li></ul><ul><li>This limits the amount of data that can be mined </li></ul>
  4. 4. Advantage of SQL and RDBMS <ul><li>Make use of database indexing and query processing capabilities </li></ul><ul><li>More than a decade spent on making these systems robust, portable, scalable, and concurrent </li></ul><ul><li>Exploit underlying SQL parallelization </li></ul><ul><li>For long-running algorithms, use checkpointing and space management </li></ul>
  5. 5. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  6. 6. Use of Database in Data Mining <ul><li>“ Loose coupling” of application and data </li></ul><ul><ul><li>How would you write an Apriori program? </li></ul></ul><ul><ul><li>Use SQL statements in an application </li></ul></ul><ul><ul><li>Use a cursor interface to read through records sequentially for each pass </li></ul></ul><ul><ul><li>Still two major performance problems: </li></ul></ul><ul><ul><ul><li>Copying of record from database to memory </li></ul></ul></ul><ul><ul><ul><li>Process context switching for each record retrieved </li></ul></ul></ul>
  7. 7. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  8. 8. Tightly-coupled applications <ul><li>Push computations into the database system to avoid performance degradation </li></ul><ul><li>Take advantage of user-defined functions (UDFs) </li></ul><ul><li>Does not require changes to database software </li></ul><ul><li>Two types of UDFs we will use: </li></ul><ul><ul><li>Ones that are executed only a few times, regardless of the number of rows </li></ul></ul><ul><ul><li>Ones that are executed once for each selected row </li></ul></ul>
  9. 9. Tight-coupling using UDFs <ul><li>Procedure TightlyCoupledApriori(): </li></ul><ul><li>begin </li></ul><ul><ul><li>exec sql connect to database; </li></ul></ul><ul><ul><li>exec sql select allocSpace() into :blob from onerecord; </li></ul></ul><ul><ul><li>exec sql select * from sales where GenL 1 (:blob, TID, ITEMID) = 1; </li></ul></ul><ul><ul><li>notDone := true; </li></ul></ul>
  10. 10. Tight-coupling using UDFs <ul><li>while notDone do { </li></ul><ul><li>exec sql select aprioriGen(:blob) </li></ul><ul><li>into :blob from onerecord; </li></ul><ul><li>exec sql select * </li></ul><ul><li>from sales </li></ul><ul><li>where itemCount(:blob, TID, </li></ul><ul><li>ITEMID)=1; </li></ul><ul><li>exec sql select GenL k (:blob) into :notDone from onerecord </li></ul><ul><li>} </li></ul>
  11. 11. Tight-coupling using UDFs <ul><li>exec sql select getResult(:blob) into :resultBlob from onerecord; </li></ul><ul><li>exec sql select deallocSpace(:blob) from onerecord; </li></ul><ul><li>compute Answer using resultBlob; </li></ul><ul><li>end </li></ul>
  12. 12. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  13. 13. Methodology <ul><li>Comparison done with Association Rules against IBM DB2 </li></ul><ul><li>Only consider generation of frequent itemsets using Apriori algorithm </li></ul><ul><li>Five alternatives considered: </li></ul><ul><ul><li>Loose-coupling through SQL cursor interface – as described earlier </li></ul></ul><ul><ul><li>UDF tight-coupling – as described earlier </li></ul></ul><ul><ul><li>Stored-procedure to encapsulate mining algorithm </li></ul></ul><ul><ul><li>Cache-mine – caching data and mining on the fly </li></ul></ul><ul><ul><li>SQL implementations to force processing in the database </li></ul></ul><ul><ul><ul><li>Consider two classes of implementations </li></ul></ul></ul><ul><ul><ul><ul><li>SQL-92 – four different implementations </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SQL-OR (with object relational extensions) – six implementations </li></ul></ul></ul></ul>
  14. 14. Architectural Options <ul><li>Stored procedure </li></ul><ul><ul><li>Apriori algorithm encapsulated as a stored procedure </li></ul></ul><ul><ul><li>Implication: runs in the same address space as the DBMS </li></ul></ul><ul><ul><li>Mined results stored back into the DBMS. </li></ul></ul><ul><li>Cache-mine </li></ul><ul><ul><li>Variation of stored-procedure </li></ul></ul><ul><ul><li>Read entire data once from DBMS, temporarily cache data in a lookaside buffer on a local disk </li></ul></ul><ul><ul><li>Cached data is discarded when execution completes </li></ul></ul><ul><ul><li>Disadvantage – requires additional disk space for caching </li></ul></ul><ul><ul><li>Use Intelligent Miner’s “space” option </li></ul></ul>
  15. 15. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  16. 16. Terminology <ul><li>Use the following terminology </li></ul><ul><ul><li>T: table of items </li></ul></ul><ul><ul><ul><li>{tid,item} pairs </li></ul></ul></ul><ul><ul><ul><li>Data is normally sorted by transaction id </li></ul></ul></ul><ul><ul><li>C k : candidate k-itemsets </li></ul></ul><ul><ul><ul><li>Obtained from joining and pruning frequent itemsets from previous iteration </li></ul></ul></ul><ul><ul><li>F k : frequent items sets of length k </li></ul></ul><ul><ul><ul><li>Obtained from C k and T </li></ul></ul></ul>
  17. 17. Candidate Generation in SQL – join step <ul><li>Generate C k from F k-1 by joining F k-1 with itself </li></ul><ul><ul><li>insert into C k select I 1 .item 1 ,…,I 1 .item k-1 ,I 2 .item k-1 </li></ul></ul><ul><ul><li>from F k-1 I 1 ,F k-1 I 2 </li></ul></ul><ul><ul><li>where I 1 .item 1 = I 2 .item 1 and </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>I 1 .item k-2 = I 2 .item k-2 and </li></ul></ul><ul><ul><li>I 1 .item k-1 < I 2 .item k-1 </li></ul></ul>
  18. 18. Candidate Generation Example <ul><li>F 3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} </li></ul><ul><li>C 4 is {{1,2,3,4},{1,3,4,5}} </li></ul>Table F 3 (I 1 ) Table F 3 (I 2 ) 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1
  19. 19. Pruning <ul><li>Modify candidate generation algorithm to ensure all k subsets of C k of length (k-1) are in F k-1 </li></ul><ul><ul><li>Do a k-way join, skipping item n-2 when joining with the n th table (2<n≤k) </li></ul></ul><ul><ul><li>Create primary index (item 1 , …, item k-1 ) on F k-1 to efficiently process k-way join </li></ul></ul><ul><li>For k=4, this becomes </li></ul><ul><ul><li>insert into C 4 select I 1 .item 1 , I 1 .item 2 , I 1 .item 3 ,I 2 .item 3 from F 3 I 1 ,F 3 I 2 , </li></ul></ul><ul><ul><li>F 3 I 3 , F3 I 4 where I 1 .item 1 = I 2 .item 1 … and I 1 .item3 < I 2 .item3 and </li></ul></ul><ul><ul><li>I 1 .item 2 = I 3 .item 1 and I 1 .item 3 = I 3 .item 2 and I 2 .item 3 = I 3 .item 3 and </li></ul></ul><ul><ul><li>I 1 .item 1 = I 4 .item 1 and I 1 .item 3 = I 4 .item 2 and I 2 .item 3 = I 4 .item 3 </li></ul></ul>
  20. 20. Pruning Example <ul><li>Evaluate join with I 3 using previous example </li></ul><ul><li>C 4 is {1,2,3,4} </li></ul>Table F 3 (I 1 ) Table F 3 (I 2 ) Table F 3 (I 3 ) 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1 4 3 2 5 3 1 4 3 1 4 2 1 3 2 1 item 3 item 2 item 1
  21. 21. Support counting using SQL <ul><li>Two different approaches </li></ul><ul><ul><li>Use the SQL-92 standard </li></ul></ul><ul><ul><ul><li>Use ‘standard’ SQL syntax such as joins and subqueries to find support of itemsets </li></ul></ul></ul><ul><ul><li>Use object-relational extensions of SQL (SQL-OR) </li></ul></ul><ul><ul><ul><li>User Defined Functions (UDFs) & table functions </li></ul></ul></ul><ul><ul><ul><li>Binary Large Objects (BLOBs) </li></ul></ul></ul>
  22. 22. Support Counting using SQL-92 <ul><li>4 different methods, two of which detailed in the papers </li></ul><ul><ul><li>K-way Joins </li></ul></ul><ul><ul><li>SubQuery </li></ul></ul><ul><li>Other methods not discussed because of unacceptable performance </li></ul><ul><ul><li>3-way join </li></ul></ul><ul><ul><li>2 Group-Bys </li></ul></ul>
  23. 23. SQL-92: K-way join <ul><li>Obtain F k by joining C k with table T of (tid,item) </li></ul><ul><li>Perform group by on the itemset </li></ul><ul><ul><li>insert into F k select item 1 ,…,item k ,count(*) </li></ul></ul><ul><ul><li>from C k , T t 1 , …, T t k , </li></ul></ul><ul><ul><li>where t 1 .item = C k .item 1 , … , and </li></ul></ul><ul><ul><li>t k .item = C k .item k and </li></ul></ul><ul><ul><li>t 1 .tid = t 2 .tid … and </li></ul></ul><ul><ul><li>t k-1 .tid = t k .tid </li></ul></ul><ul><ul><li>group by item 1 ,…,item k </li></ul></ul><ul><ul><li>having count(*) > :minsup </li></ul></ul>
  24. 24. K-way join example <ul><li>C 3 ={B,C,E} and minimum support required is 2 </li></ul><ul><li>Insert into F 3 {B,C,E,2} </li></ul>
  25. 25. K-way join: Pass-2 optimization <ul><li>When calculating C 2 , no pruning is required after we join F 1 with itself </li></ul><ul><li>Don’t calculate and materialize C 2 - replace C 2 in 2-way join algorithm with join of F 1 with itself </li></ul><ul><ul><li>insert into F 2 select I 1 .item 1 , I 2 .item 1 ,count(*) </li></ul></ul><ul><ul><li>from F 1 I 1 , F 1 I 2 , T t 1 , T t 2 </li></ul></ul><ul><ul><li>where I 1 .item 1 < I 2 .item 1 and </li></ul></ul><ul><ul><li>t 1 .item = I 1 .item 1 and t 2 .item = I 2 .item 1 and </li></ul></ul><ul><ul><li>t 1 .tid = t 2 .tid </li></ul></ul><ul><ul><li>group by I 1 .item 1 ,I 2 .item 1 </li></ul></ul><ul><ul><li>having count(*) > :minsup </li></ul></ul>
  26. 26. SQL-92: SubQuery based <ul><li>Split support counting into cascade of k subqueries </li></ul><ul><li>n th subquery Q n finds all tids that match the distinct itemsets formed by the first n items of C k </li></ul><ul><ul><li>insert into F k select item 1 , …, item k , count(*) </li></ul></ul><ul><ul><li>from (Subquery Q k ) t </li></ul></ul><ul><ul><li>Group by item 1 , item 2 … , item k having count(*) > :minsup </li></ul></ul><ul><ul><li>Subquery Q n (for any n between 1 and k): </li></ul></ul><ul><ul><li>select item 1 , …, item n , tid </li></ul></ul><ul><ul><li>from T t n , (Subquery Q n-1 ) as r n-1 </li></ul></ul><ul><ul><li>(select distinct item 1 , …, item n from C K ) as d n </li></ul></ul><ul><ul><li>where r n-1 .item 1 = d n .item 1 and … and r n-1 .item n-1 = d n .item n </li></ul></ul><ul><ul><li>and r n-1 .tid = t n .tid and t n .item = d n .item n </li></ul></ul>
  27. 27. Example of SubQuery based <ul><li>Using previous example from class </li></ul><ul><ul><li>C 3 = {B,C,E}, minimum support = 2 </li></ul></ul><ul><li>Q 0 : No subquery Q 0 </li></ul><ul><li>Q 1 in this case becomes </li></ul><ul><ul><li>select item 1 , tid </li></ul></ul><ul><ul><li>From T t 1 , </li></ul></ul><ul><ul><li>(select distinct item 1 from C 3 ) as d 1 </li></ul></ul><ul><ul><li>where t 1 .item = d 1 .item1 </li></ul></ul>
  28. 28. Example of SubQuery based cnt’d <ul><li>Q 2 becomes </li></ul><ul><ul><li>select item 1 , item 2 , tid from T t 2 , (Subquery Q 1 ) as r 1 , </li></ul></ul><ul><ul><li>(select distinct item 1 , item 2 from C 3 ) as d 2 where r 1 .item 1 = d 2 .item 1 and r 1 .tid = t 2 .tid and t 2 .item = d 2 .item 2 </li></ul></ul>
  29. 29. Example of SubQuery based cnt’d <ul><li>Q 3 becomes </li></ul><ul><ul><li>select item 1 ,item 2 ,item 3 , tid from T t 3 , (Subquery Q 2 ) as r 2 , </li></ul></ul><ul><ul><li>(select distinct item 1 ,item 2 ,item 3 from C 3 ) as d 3 </li></ul></ul><ul><ul><li>where r 2 .item 1 = d 3 .item 1 and r 2 .item 2 = d 3 .item 2 and </li></ul></ul><ul><ul><li>r 2 .tid = t 3 .tid and t 3 .item = d 3 .item 3 </li></ul></ul>
  30. 30. Example of SubQuery based cnt’d <ul><li>Output of Q 3 is </li></ul><ul><li>Insert statement becomes </li></ul><ul><ul><li>insert into F 3 select item 1 , item 2 , item 3 , count(*) </li></ul></ul><ul><ul><li>from (Subquery Q 3 ) t </li></ul></ul><ul><ul><li>group by item 1 , item 2 ,item 3 having count(*) > :minsup </li></ul></ul><ul><li>Insert the row {B,C,E,2} </li></ul><ul><li>For Q 2 , pass-2 optimization can be used </li></ul>
  31. 31. Performance Comparisons of SQL-92 approaches <ul><li>Used Version 5 of DB2 UDB and RS/6000 Model 140 </li></ul><ul><ul><li>200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec </li></ul></ul><ul><li>Used 4 different item sets based on real-world data </li></ul><ul><li>Built the following indexes, which are not included in any cost calculations </li></ul><ul><ul><li>Composite index (item1, …, itemk) on C k </li></ul></ul><ul><ul><li>k different indices on each of the k items in C k </li></ul></ul><ul><ul><li>(item,tid) and (tid,item) indexes on the data table T </li></ul></ul>
  32. 32. Performance Comparisons of SQL-92 approaches <ul><li>Best performance obtained by SubQuery approach </li></ul><ul><li>SubQuery was only comparable to loose-coupling in some cases, failing to complete in other cases </li></ul><ul><ul><li>DataSet C, for support of 2%, SubQuery outperforms loose-coupling but decreasing support to 1%, SubQuery takes 10 times as long to complete </li></ul></ul><ul><ul><li>Lower support will increase the size of C k and F k at each step, causing the join to process more rows </li></ul></ul>
  33. 33. Support Counting using SQL with object-relational extensions <ul><li>6 different methods, four of which detailed in the papers </li></ul><ul><ul><li>GatherJoin </li></ul></ul><ul><ul><li>GatherCount </li></ul></ul><ul><ul><li>GatherPrune </li></ul></ul><ul><ul><li>Vertical </li></ul></ul><ul><li>Other methods not discussed because of unacceptable performance </li></ul><ul><ul><li>Horizontal </li></ul></ul><ul><ul><li>SBF </li></ul></ul>
  34. 34. SQL Object-Relational Extension: GatherJoin <ul><li>Generates all possible k-item combinations of items contained in a transaction and joins them with C k </li></ul><ul><ul><li>An index is created on all items of C k </li></ul></ul><ul><li>Uses the following table functions </li></ul><ul><ul><li>Gather: Outputs records {tid,item-list}, with item-list being a BLOB or VARCHAR containing all items associated with the tid </li></ul></ul><ul><ul><li>Comb-K: returns all k-item combinations from the transaction </li></ul></ul><ul><ul><ul><li>Output has k attributes T_itm 1 , …, T_itm k </li></ul></ul></ul>
  35. 35. GatherJoin <ul><ul><li>insert into F k select item 1 ,…, item k , count(*) </li></ul></ul><ul><ul><li>from C k , </li></ul></ul><ul><ul><li>(select t 2 .T_itm 1 ,…,t 2 .itm k from T, </li></ul></ul><ul><ul><li>table(Gather(T.tid,T.item)) as t 1 , </li></ul></ul><ul><ul><li>table(Comb-K(t 1 .tid,t 1 .item-list)) as t 2 ) </li></ul></ul><ul><ul><li>where t 2 .T_itm 1 = C k .item 1 and … and </li></ul></ul><ul><ul><li>t 2 .T_itm k = C k .item k </li></ul></ul><ul><ul><li>group by C k .item 1 ,…,C k .item k </li></ul></ul><ul><ul><li>having count(*) > :minsup </li></ul></ul>
  36. 36. Example of GatherJoin <ul><li>t 1 (output from Gather) looks like: </li></ul><ul><li>t 2 (generated by Comb-K from t 1 ) will be joined with C 3 to obtain F 3 </li></ul><ul><ul><li>1 row from Tid 10 </li></ul></ul><ul><ul><li>1 row from Tid 20 </li></ul></ul><ul><ul><li>4 rows from Tid 30 </li></ul></ul><ul><li>Insert {B,C,E,2} </li></ul>
  37. 37. GatherJoin: Pass 2 optimization <ul><li>When calculating C 2 , no pruning is required after we join F 1 with itself </li></ul><ul><li>Don’t calculate and materialize C 2 - replace C 2 with a join to F1 before the table function </li></ul><ul><ul><li>Gather is only passed frequent 1-itemset rows </li></ul></ul><ul><ul><li>insert into F 2 select I 1 .item 1 , I 2 .item 1 , count(*) from F 1 I 1 , </li></ul></ul><ul><ul><li>(select t 2 .T_itm 1 ,t 2 .T_itm 2 from T, table(Gather(T.tid,T.item)) as t 1 , </li></ul></ul><ul><ul><li>table(Comb-K(t 1 .tid,t 1 .item-list)) as t 2 where T.item = I 1 .item 1 ) </li></ul></ul><ul><ul><li>group by t 2 .T_itm 1 ,t 2 .T_itm 2 </li></ul></ul><ul><ul><li>having count(*) > :minsup </li></ul></ul>
  38. 38. Variations of GatherJoin - GatherCount <ul><li>Perform the GROUP BY inside the table function Comb-K for pass 2 optimization </li></ul><ul><li>Output of the table function Comb-K </li></ul><ul><ul><li>Not the candidate frequent itemsets (C k ) </li></ul></ul><ul><ul><li>But the actual frequent itemsets (F k ) along with the corresponding support </li></ul></ul><ul><li>Use a 2-dimensional array to store possible frequent itemsets in Comb-K </li></ul><ul><ul><li>May lead to excessive memory use </li></ul></ul>
  39. 39. Variations of GatherJoin - GatherPrune <ul><li>Push the join with C k into the table function Comb-K </li></ul><ul><li>C k is converted into a BLOB and passed as an argument to the table function. </li></ul><ul><ul><li>Will have to pass the BLOB for each invocation of Comb-K - # of rows in table T </li></ul></ul>
  40. 40. SQL Object-Relational Extension: Vertical <ul><li>For each item, create a BLOB containing the tids the item belongs to </li></ul><ul><ul><li>Use function Gather to generate {item,tid-list} pairs, storing results in table TidTable </li></ul></ul><ul><ul><li>Tid-list are all in the same sorted order </li></ul></ul><ul><li>Use function Intersect to compare two different tid-lists and extract common values </li></ul><ul><li>Pass-2 optimization can be used for Vertical </li></ul><ul><ul><li>Similar to K-way join method </li></ul></ul><ul><ul><li>Upcoming example does not show optimization </li></ul></ul>
  41. 41. Vertical <ul><ul><li>insert into F k select item 1 , …, item k , count(tid-list) as cnt </li></ul></ul><ul><ul><li>from (Subquery Q k ) t where cnt > :minsup </li></ul></ul><ul><ul><li>Subquery Q n (for any n between 2 and k) </li></ul></ul><ul><ul><li>Select item 1 , …, item n , </li></ul></ul><ul><ul><li>Intersect(r n-1 .tid-list, t n .tid-list) as tid-list </li></ul></ul><ul><ul><li>from TidTable t n , (Subquery Q n-1 ) as r n-1 </li></ul></ul><ul><ul><li>(select distinct item 1 , …, item n from C K ) as d n </li></ul></ul><ul><ul><li>where r n-1 .item 1 = d n .item 1 and … and </li></ul></ul><ul><ul><li>r n-1 .item n-1 = d n .item n-1 and </li></ul></ul><ul><ul><li>t n .item = d n .item n </li></ul></ul><ul><ul><li>Subquery Q 1 : (select * from TidTable) </li></ul></ul>
  42. 42. Example of Vertical <ul><li>Using previous example from class </li></ul><ul><ul><li>C 3 = {B,C,E}, minimum support = 2 </li></ul></ul><ul><li>Q 1 is TidTable </li></ul>
  43. 43. Example of Vertical cnt’d <ul><li>Q 2 becomes </li></ul><ul><ul><li>Select item 1 , item 2 , Intersect(r 1 .tid-list, t 2 .tid-list) as tid-list </li></ul></ul><ul><ul><li>from TidTable t 2 , (Subquery Q 1 ) as r 1 </li></ul></ul><ul><ul><li>(select distinct item 1 , item 2 from C 3 ) as d 2 </li></ul></ul><ul><ul><li>where r 1 .item 1 = d 2 .item 1 and t 2 .item = d 2 .item 2 </li></ul></ul>
  44. 44. Example of Vertical cnt’d <ul><li>Q 3 becomes </li></ul><ul><ul><li>select item 1 , item 2 , item 3 , intersect(r 2 .tid-list, t 3 .tid-list) as tid-list </li></ul></ul><ul><ul><li>from TidTable t 3 , (Subquery Q 2 ) as r 2 </li></ul></ul><ul><ul><li>(select distinct item 1 , item 2 , item 3 from C 3 ) as d 3 </li></ul></ul><ul><ul><li>where r 2 .item 1 = d 3 .item 1 and r 2 .item 2 = d 3 .item 2 and </li></ul></ul><ul><ul><li>t 3 .item = d 3 .item 3 </li></ul></ul>
  45. 45. Performance Comparisons using SQL-OR
  46. 46. Performance Comparisons using SQL-OR
  47. 47. Performance comparison of SQL object-relational approaches <ul><li>Vertical has best overall performance, sometimes an order of magnitude better than other 3 approaches </li></ul><ul><ul><li>Majority of time is transforming the data in {item,tid-list} pairs </li></ul></ul><ul><ul><li>Vertical spends too much time on the second pass </li></ul></ul><ul><li>Pass-2 optimization has huge impact on performance of GatherJoin </li></ul><ul><ul><li>For Dataset-B with support of 0.1 %, running time for Pass 2 went from 5.2 hours to 10 minutes </li></ul></ul><ul><li>Comb-K in GatherJoin generates large number of potential frequent itemsets we must work with </li></ul>
  48. 48. Hybrid approach <ul><li>Previous charts and algorithm analysis show </li></ul><ul><ul><li>Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased </li></ul></ul><ul><ul><li>GatherJoin degrades when the # of frequent items per transaction increases </li></ul></ul><ul><li>To improve performance, use a hybrid algorithm </li></ul><ul><ul><li>Use Vertical for most cases </li></ul></ul><ul><ul><li>When size of candidate itemset is too large, GatherJoin is a good option if number of frequent items per transaction (N f ) is not too large </li></ul></ul><ul><ul><li>When N f is large, GatherCount may be the only good option </li></ul></ul>
  49. 49. Architecture Comparisons <ul><li>Compare five alternatives </li></ul><ul><ul><li>Loose-Coupling, Stored-procedure </li></ul></ul><ul><ul><ul><li>Basically the same except for address space program is being run in </li></ul></ul></ul><ul><ul><ul><li>Because of limited difference in performance, focus solely on stored procedure in following charts </li></ul></ul></ul><ul><ul><li>Cache-Mine </li></ul></ul><ul><ul><li>UDF tight-coupling </li></ul></ul><ul><ul><li>Best SQL approach (Hybrid) </li></ul></ul>
  50. 50. Performance Comparisons of Architectures
  51. 51. Performance Comparisons of Architectures cnt’d
  52. 52. Performance Comparisons of Architectures cnt’d <ul><li>Cache-Mine is the best or close to the best performance in all cases </li></ul><ul><ul><li>Factor of 0.8 to 2 times faster than SQL approach </li></ul></ul><ul><li>Stored procedure is the worst </li></ul><ul><ul><li>Difference between Cache-Mine directly related to the number of passes through the data </li></ul></ul><ul><ul><ul><li>Passes increase when the support goes down </li></ul></ul></ul><ul><ul><ul><li>May need to make multiple passes if all candidates cannot fit in memory </li></ul></ul></ul><ul><li>UDF time per pass decreases 30-50% compared to stored procedure because of tighter coupling with DB </li></ul>
  53. 53. Performance Comparisons of Architectures cnt’d <ul><li>SQL approach comes in second in performance to Cache-Mine </li></ul><ul><ul><li>Somewhat better than Cache-Mine for high support values </li></ul></ul><ul><ul><li>1.8 – 3 times better than Stored-procedure/loose-coupling approach, getting better when support value decreases </li></ul></ul><ul><ul><li>Cost of converting to Vertical format is less than cost of converting to binary format in Cache-Mine </li></ul></ul><ul><ul><li>For second pass through data, SQL approach takes much more time than Cache-Mine, particularly when we decrease the support </li></ul></ul>
  54. 54. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  55. 55. Taxonomies - example Beverages Snacks Soft Drinks Alcoholic Drinks Pretzels Chocolate Bar Pepsi Coke Beer Example rule: Soft Drinks  Pretzels with 30% confidence, 2% support Chocolate Bar Snacks Pretzels Snacks Beer Alcoholic Drinks Coke Soft Drinks Pepsi Soft Drinks Alcoholic Drinks Beverages Soft Drinks Beverages Child Parent
  56. 56. Taxonomy augmentation <ul><li>Algorithms similar to previous slides </li></ul><ul><li>Requires two additions to algorithm </li></ul><ul><ul><li>Pruning itemsets containing an item and its ancestor </li></ul></ul><ul><ul><li>Pre-computing the ancestors for each item </li></ul></ul><ul><li>Will also consider support counting </li></ul>
  57. 57. Pruning items and ancestors <ul><li>In the second pass we will join F 1 with F 1 to give C 2 </li></ul><ul><li>This will give, for example: </li></ul><ul><li>beverages,pepsi </li></ul><ul><li>snacks,coke </li></ul><ul><li>pretzels,chocolate bar </li></ul><ul><li>But beverages,pepsi is redundant! </li></ul>
  58. 58. Pruning items and ancestors <ul><li>The following modification to the SQL statement eliminates such redundant combinations from being selected: </li></ul><ul><li>insert into C 2 (select I 1 .item 1 , I 2 .item 1 from F 1 I 1 , F 1 I 2 </li></ul><ul><li>where I 1 .item 1 < I 2 .item 1 ) except </li></ul><ul><li>(select ancestor, descendant from Ancestor union </li></ul><ul><li>select descendant, ancestor from Ancestor) </li></ul>
  59. 59. Pre-computing ancestors <ul><li>An ancestor table is created </li></ul><ul><ul><li>Format (ancestor, descendant) </li></ul></ul><ul><ul><li>Use the transitive closure operation </li></ul></ul><ul><li>insert into Ancestor with R-Tax (ancestor, descendant) as </li></ul><ul><li>(select parent, child from Tax union all </li></ul><ul><li>select p.ancestor, c.child from R-Tax p, Tax c </li></ul><ul><li>where p.descendant = c.parent) </li></ul><ul><li>select ancestor, descendant from R-Tax </li></ul>
  60. 60. Support Counting <ul><li>Extensions to handle taxonomies </li></ul><ul><ul><li>Straightforward, but </li></ul></ul><ul><ul><li>Non-trivial </li></ul></ul><ul><li>Need an extended transaction table </li></ul><ul><ul><li>For example, if we have {coke, pretzels} </li></ul></ul><ul><ul><li>We add also {soft drinks, pretzels}, {beverages, pretzels}, {coke, snacks}, {soft drinks, snacks}, {beverages, snacks} </li></ul></ul>
  61. 61. Extended transaction table <ul><li>Can be obtained by the following SQL </li></ul><ul><li>Query to generate T* </li></ul><ul><li>select item, tid from T union </li></ul><ul><li>select distinct A.ancestor as item, T.tid </li></ul><ul><li>from T, Ancestor A </li></ul><ul><li>where A.descendant = T.item </li></ul><ul><li>The “select distinct” clause gets rid of items with common ancestor – e.g. don’t want {beverages, beverages} being added twice from {pepsi, coke} </li></ul>
  62. 62. Pipelining of Query <ul><li>No need to actually build T* </li></ul><ul><li>Make following modification to SQL: </li></ul><ul><ul><li>insert into F k with T*(tid, item) as (Query for T*) </li></ul></ul><ul><ul><li>select item 1 ,…,item k ,count(*) </li></ul></ul><ul><ul><li>from C k , T* t 1 , …, T* t k , </li></ul></ul><ul><ul><li>where t 1 .item = C k .item 1 , … , and </li></ul></ul><ul><ul><li>t k .item = C k .item k and </li></ul></ul><ul><ul><li>t 1 .tid = t 2 .tid … and </li></ul></ul><ul><ul><li>t k-1 .tid = t k .tid </li></ul></ul><ul><ul><li>group by item 1 ,…,item k </li></ul></ul><ul><ul><li>having count(*) > :minsup </li></ul></ul>
  63. 63. Organization of Presentation <ul><li>Overview – Data Mining and RDBMS </li></ul><ul><li>Loosely-coupled data and programs </li></ul><ul><li>Tightly-coupled data and programs </li></ul><ul><li>Architectural approaches </li></ul><ul><li>Methods of writing efficient SQL </li></ul><ul><ul><li>Candidate generation, pruning, support counting </li></ul></ul><ul><ul><li>K-way join, SubQuery, GatherJoin, Vertical, Hybrid </li></ul></ul><ul><li>Integrating taxonomies </li></ul><ul><li>Mining sequential patterns </li></ul>
  64. 64. Sequential patterns <ul><li>Similar to papers covered on Nov 17 </li></ul><ul><li>Input is sequences of transactions </li></ul><ul><ul><li>E.g. ((computer,modem),(printer)) </li></ul></ul><ul><li>Similar to association rules, but dealing with sequences as opposed to sets </li></ul><ul><li>Can also specify maximum and minimum time gaps, as well as sliding time windows </li></ul><ul><ul><li>Max-gap, min-gap, window-size </li></ul></ul>
  65. 65. Input and output formats <ul><li>Input has three columns: </li></ul><ul><ul><li>Sequence identifier (sid) </li></ul></ul><ul><ul><li>Transaction time (time) </li></ul></ul><ul><ul><li>Idem identifier (item) </li></ul></ul><ul><li>Output format is a collection of frequent sequences, in a fixed-width table </li></ul><ul><ul><li>(item 1 , eno 1 ,…,item k , eno k , len) </li></ul></ul><ul><ul><li>For smaller lengths, extra column values are set to NULL </li></ul></ul>
  66. 66. GSP algorithm <ul><li>Similar to algorithms shown earlier </li></ul><ul><li>Each C k has transactions and times, but no length – has fixed length of k </li></ul><ul><li>Candidates are generated in two steps </li></ul><ul><ul><li>Join – join F k-1 with itself </li></ul></ul><ul><ul><ul><li>Sequence s 1 joins with s 2 if the subsequence obtained by dropping the first item of s 1 is the same as the one obtained by dropping the last item of s 2 </li></ul></ul></ul><ul><ul><ul><li>When generating C 2 , we need to generate sequences where both of the items appear as a single element as well as two separate elements </li></ul></ul></ul><ul><ul><li>Prune </li></ul></ul><ul><ul><ul><li>All candidate sequences that have a non-frequent contiguous (k-1) subsequence are deleted </li></ul></ul></ul>
  67. 67. GSP – Join SQL <ul><li>insert into C k </li></ul><ul><li>select I 1 .item 1 , I 1 .eno 1 , ... , I 1 .item k-1 , I 1 .eno k-1 , </li></ul><ul><li>I 2 .itemk k-1 , I 1 .eno k-1 + I 2 .eno k-1 – I 2 .eno k-2 </li></ul><ul><li>from F k-1 I 1 , F k-1 I 2 </li></ul><ul><li>where I 1 .item 2 = I 2 .item 1 and ... and I 1 .item k-1 = I 2 .item k-2 and </li></ul><ul><li>I 1 .eno 3 -I 1 .eno 2 = I 2 .eno 2 – I 2 .eno 1 and ... and </li></ul><ul><li>I 1 .eno k-1 – I 1 .eno k-2 = I 2 .eno k-2 – I 2 .eno k-3 </li></ul>
  68. 68. GSP – Prune SQL <ul><li>Write as a k-way join, similar to before </li></ul><ul><li>There are at most k contiguous subsequences of length (k-1) for which F k-1 needs to be checked for membership </li></ul><ul><li>Note that all (k-1) subsequences may not be contiguous because of the max-gap constraint between consecutive elements. </li></ul>
  69. 69. GSP – Support Counting <ul><li>In each pass, we use the candidate table C k and the input data-sequences table D to count the support </li></ul><ul><li>K-way join </li></ul><ul><ul><li>We use select distinct before the group by to ensure that only distinct data-sequences are counted </li></ul></ul><ul><ul><li>We have additional predicates between sequence numbers to handle the special time elements </li></ul></ul>
  70. 70. GSP – Support Counting SQL <ul><li>(C k .eno j = C k .eno i and abs(d j .time – d i .time) ≤ window-size) or (C k .eno j = C k .eno i + 1 and d j .time – d i .time max-gap and d j .time – d i .time > min-gap) or (C k .eno j > C k .eno i + 1) </li></ul>
  71. 71. References <ul><li>Developing Tightly-Coupled Data Mining Applications on a Relational Database System </li></ul><ul><ul><li>Rakesh Agrawal, Kyuseok Shim, 1996 </li></ul></ul><ul><li>Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications </li></ul><ul><ul><li>Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998 </li></ul></ul><ul><ul><li>Refers to 1) above </li></ul></ul><ul><li>Mining Generalized Association Rules and Sequential Patterns Using SQL Queries </li></ul><ul><ul><li>Shiby Thomas, Sunita Sarawagi, 1998 </li></ul></ul><ul><ul><li>Refers to 1) and 2) above </li></ul></ul>

×