Mining Interesting Sets and Rules in Relational Databases

731 views

Published on

Presentation given at the Dutch-Belgian Database Day 2009 in Delft

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
731
On SlideShare
0
From Embeds
0
Number of Embeds
244
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Mining Interesting Sets and Rules in Relational Databases

  1. 1. Mining Interesting Sets and Rules in Relational Databases Bart Goethals, Wim Le Page, Michael Mampaey ADReM Research Group
  2. 2. Itemset Mining 2
  3. 3. Itemset Mining Find frequently occurring sets of items in DB 2
  4. 4. Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
  5. 5. Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 2 Databases 10 No G010 { (room=G010) } 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 4 Algebra 10 No G004 { (credits=5), (room=G005) } 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
  6. 6. Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 7 2 Databases 10 No G010 { (room=G010) } 5 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 2 4 Algebra 10 No G004 { (credits=5), (room=G005) } 0 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
  7. 7. Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 7 2 Databases 10 No G010 { (room=G010) } 5 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 2 4 Algebra 10 No G004 { (credits=5), (room=G005) } 0 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 Problem: Extend to Multi-relational Databases 2
  8. 8. Relational DB Running Example 3
  9. 9. Relational DB Running Example Professor PID name surname A Jan P B Jan H C Jan VDB D Piet V E Erik B F Flor C G Gerrit DC H Patrick S I Susan S 3
  10. 10. Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 C Jan VDB D Piet V E Erik B F Flor C G Gerrit DC H Patrick S I Susan S 3
  11. 11. Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
  12. 12. Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 Teaches 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
  13. 13. Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 Teaches 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 Takes 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
  14. 14. Each entity has a Key Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 Student A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 6 Adriana Math 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 4
  15. 15. Binary Relations Teaches PID CID Course Professor A 1 CID title credits project room PID name surname A 2 1 C++ 10 Yes G010 A Jan P B 2 2 Databases 10 No G010 B Jan H B 3 3 Thesis 20 No G006 C Jan VDB C 4 4 Algebra 10 No G004 D Piet V D 5 5 Compilers 5 No G015 E Erik B D 6 6 Calculus 10 No G005 F Flor C E 7 7 Physics 30 Yes G005 G Gerrit DC F 8 8 Data Mining 30 Yes G010 H Patrick S G 9 9 Telecom 10 No G004 I Susan S G 10 10 AI 10 No G010 G 11 11 Graphics 10 No G010 I 11 5
  16. 16. Binary Relations Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 B Jan H 3 Thesis 20 No G006 C Jan VDB 4 Algebra 10 No G004 D Piet V 5 Compilers 5 No G015 E Erik B 6 Calculus 10 No G005 F Flor C 7 Physics 30 Yes G005 G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 5
  17. 17. Existing Approaches 6
  18. 18. Existing Approaches Join all tables into one 6
  19. 19. Existing Approaches Join all tables Apply standard into one FIM algorithm 6
  20. 20. Existing Approaches PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
  21. 21. Existing Approaches { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
  22. 22. Existing Approaches { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
  23. 23. Existing Approaches { (name=Jan), (credits=10) } support = 8 / 18 PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
  24. 24. Existing Approaches { (name=Jan), (credits=10) } support = 8 / 18 PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B B Jan Jan H H ✘ Join table is huge 2 Databases 2 Databases 10 10 No No G010 G010 1 5 Wim Calin LP G C D Jan Piet VDB V ✘ Meaning of support? 4 Algebra 5 Compilers 10 5 No No G004 G015 NULL NULL NULL NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
  25. 25. The “Key Idea” 8
  26. 26. The “Key Idea” Do not count tuples, count key values. 8
  27. 27. The “Key Idea” Do not count tuples, count key values. Itemset has multiple supports for multiple tables 8
  28. 28. Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  29. 29. Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  30. 30. Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  31. 31. Counting Key Values { (name=Jan), (credits=10) } {A,B,C} 3 professors PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  32. 32. Counting Key Values { (name=Jan), (credits=10) } {A,B,C} 3 professors PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  33. 33. Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} 3 professors 3 courses PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  34. 34. Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} 3 professors 3 courses PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  35. 35. Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} {1,2,3,5} 3 professors 3 courses 4 students PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
  36. 36. Relational Itemset Support Itemset {(room=G010),(study=Math)} πCID (σroom=G010,study=Math (course student)) 10
  37. 37. Naive Approach πCID (σroom=G010,study=Math (course student)) 11
  38. 38. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 11
  39. 39. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 11
  40. 40. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection 11
  41. 41. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection 11
  42. 42. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection ✘ Takes a lot of time & space 11
  43. 43. Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection ✘ Takes a lot of time & space ✘ Unnecessary blowup & shrinking 11
  44. 44. SMuRFIG Algorithm 12
  45. 45. SMuRFIG Algorithm Simple 12
  46. 46. SMuRFIG Algorithm Simple Multi- Relational 12
  47. 47. SMuRFIG Algorithm Simple Multi- Relational Frequent 12
  48. 48. SMuRFIG Algorithm Simple Multi- Relational Frequent Itemset 12
  49. 49. SMuRFIG Algorithm Simple Multi- Relational Frequent Itemset Generator 12
  50. 50. SMuRFIG Algorithm Computes supports of all frequent itemset 13
  51. 51. SMuRFIG Algorithm Computes supports of all frequent itemset Efficiently 13
  52. 52. SMuRFIG Algorithm Computes supports of all frequent itemset For all keys Efficiently simultaneously 13
  53. 53. Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm 14
  54. 54. Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm TID lists: - transaction identifiers where itemset occurs 14
  55. 55. Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm TID lists: KeyID Lists: - transaction - list of tuples identifiers identified by where itemset key values occurs 14
  56. 56. KeyID Lists: Basic Operations ∩ Intersection ↠ Propagation 15
  57. 57. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 16
  58. 58. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 16
  59. 59. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 16
  60. 60. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
  61. 61. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
  62. 62. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
  63. 63. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
  64. 64. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
  65. 65. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 ∩ 8 = 7 Physics 30 Yes G005 6 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
  66. 66. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 1 5 Compilers 5 No G015 4 2 2 6 Calculus 10 No G005 ∩ 8 = 10 7 Physics 30 Yes G005 6 9 10 11 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
  67. 67. KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 1 5 Compilers 5 No G015 4 2 2 6 Calculus 10 No G005 ∩ 8 = 10 7 Physics 30 Yes G005 6 9 10 11 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 support 4 16
  68. 68. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 17
  69. 69. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 17
  70. 70. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} 17
  71. 71. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} 17
  72. 72. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} 17
  73. 73. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} 17
  74. 74. KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} ↠ KeyID list = {1,2,3,4,5} 17
  75. 75. Complexity ∩ Intersection ↠ Propagation 18
  76. 76. Complexity ∩ Intersection linear in size of entity table ↠ Propagation 18
  77. 77. Complexity ∩ Intersection linear in size of entity table ↠ Propagation linear in size of relation table 18
  78. 78. ‘Interestingness’ 19
  79. 79. ‘Interestingness’ Classical Frequent Itemset Mining 19
  80. 80. ‘Interestingness’ Classical Frequent Itemset Mining - overload of ‘interesting’ patterns 19
  81. 81. ‘Interestingness’ Classical Frequent Itemset Mining - overload of ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
  82. 82. ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
  83. 83. ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of - same problems ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
  84. 84. ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of - same problems ‘interesting’ patterns - similar solutions - solutions: closed can be applied sets, maximal sets, NDI, ... 19
  85. 85. ‘Interestingness’ Additional combinatorial blow-up Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 B Jan H 3 Thesis 20 No G006 C Jan VDB 4 Algebra 10 No G004 D Piet V 5 Compilers 5 No G015 E Erik B 6 Calculus 10 No G005 F Flor C 7 Physics 30 Yes G005 G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 9 Telecom 10 No G004 I Susan S ... 10 AI 10 No G010 11 Graphics 10 No G010 2 courses 3 professors 6 courses per professor = specific to relational data 20
  86. 86. Support Deviation: Main Idea 21
  87. 87. Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure 21
  88. 88. Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure Compare with actual support of the itemset 21
  89. 89. Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure Compare with actual support of the itemset Only consider relational itemsets which diverge sufficiently 21
  90. 90. Summary 22
  91. 91. Summary • Multi-relational support - count key values - easily interpretable 22
  92. 92. Summary • Multi-relational support - count key values - easily interpretable • SMuRFIG algorithm - depth-first - efficient - support for all tables at once - scalable: linear in number of tuples, tables 22
  93. 93. Summary • Multi-relational support - count key values - easily interpretable • SMuRFIG algorithm - depth-first - efficient - support for all tables at once - scalable: linear in number of tuples, tables • Support deviation 22
  94. 94. for more info • SMuRFIG - http://www.adrem.ua.ac.be/smurfig - or Google “smurfig” • Paper - “Mining Interesting Sets and Rules in Relational Databases”, B. Goethals, W. Le Page, M. Mampaey, Proceedings of ACM SAC, 2010. 23
  95. 95. Questions? 24

×