• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining Interesting Sets and Rules in Relational Databases
 

Mining Interesting Sets and Rules in Relational Databases

on

  • 666 views

Presentation given at the Dutch-Belgian Database Day 2009 in Delft

Presentation given at the Dutch-Belgian Database Day 2009 in Delft

Statistics

Views

Total Views
666
Views on SlideShare
438
Embed Views
228

Actions

Likes
0
Downloads
17
Comments
0

3 Embeds 228

http://adrem.ua.ac.be 224
http://wildfire.gigya.com 3
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining Interesting Sets and Rules in Relational Databases Mining Interesting Sets and Rules in Relational Databases Presentation Transcript

    • Mining Interesting Sets and Rules in Relational Databases Bart Goethals, Wim Le Page, Michael Mampaey ADReM Research Group
    • Itemset Mining 2
    • Itemset Mining Find frequently occurring sets of items in DB 2
    • Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
    • Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 2 Databases 10 No G010 { (room=G010) } 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 4 Algebra 10 No G004 { (credits=5), (room=G005) } 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
    • Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 7 2 Databases 10 No G010 { (room=G010) } 5 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 2 4 Algebra 10 No G004 { (credits=5), (room=G005) } 0 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 2
    • Itemset Mining Find frequently occurring sets of items in DB Course CID title credits project room 1 C++ 10 Yes G010 { (credits=10) } 7 2 Databases 10 No G010 { (room=G010) } 5 3 Thesis 20 No G006 { (project=Yes), (credits=30) } 2 4 Algebra 10 No G004 { (credits=5), (room=G005) } 0 5 Compilers 5 No G015 6 Calculus 10 No G005 ... 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 Problem: Extend to Multi-relational Databases 2
    • Relational DB Running Example 3
    • Relational DB Running Example Professor PID name surname A Jan P B Jan H C Jan VDB D Piet V E Erik B F Flor C G Gerrit DC H Patrick S I Susan S 3
    • Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 C Jan VDB D Piet V E Erik B F Flor C G Gerrit DC H Patrick S I Susan S 3
    • Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
    • Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 Teaches 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
    • Relational DB Running Example Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 Teaches 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 Takes 6 Calculus 10 No G005 Professor 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 PID name surname 9 Telecom 10 No G004 A Jan P Student 10 AI 10 No G010 B Jan H 11 Graphics 10 No G010 SID name study C Jan VDB 1 Wim CompSci D Piet V 2 Jeroen CompSci E Erik B 3 Michael CompSci F Flor C G Gerrit DC 4 Joris Math H Patrick S 5 Calin Math I Susan S 6 Adriana Math 3
    • Each entity has a Key Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 Student A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 6 Adriana Math 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 4
    • Binary Relations Teaches PID CID Course Professor A 1 CID title credits project room PID name surname A 2 1 C++ 10 Yes G010 A Jan P B 2 2 Databases 10 No G010 B Jan H B 3 3 Thesis 20 No G006 C Jan VDB C 4 4 Algebra 10 No G004 D Piet V D 5 5 Compilers 5 No G015 E Erik B D 6 6 Calculus 10 No G005 F Flor C E 7 7 Physics 30 Yes G005 G Gerrit DC F 8 8 Data Mining 30 Yes G010 H Patrick S G 9 9 Telecom 10 No G004 I Susan S G 10 10 AI 10 No G010 G 11 11 Graphics 10 No G010 I 11 5
    • Binary Relations Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 B Jan H 3 Thesis 20 No G006 C Jan VDB 4 Algebra 10 No G004 D Piet V 5 Compilers 5 No G015 E Erik B 6 Calculus 10 No G005 F Flor C 7 Physics 30 Yes G005 G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 5
    • Existing Approaches 6
    • Existing Approaches Join all tables into one 6
    • Existing Approaches Join all tables Apply standard into one FIM algorithm 6
    • Existing Approaches PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
    • Existing Approaches { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
    • Existing Approaches { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
    • Existing Approaches { (name=Jan), (credits=10) } support = 8 / 18 PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
    • Existing Approaches { (name=Jan), (credits=10) } support = 8 / 18 PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B B Jan Jan H H ✘ Join table is huge 2 Databases 2 Databases 10 10 No No G010 G010 1 5 Wim Calin LP G C D Jan Piet VDB V ✘ Meaning of support? 4 Algebra 5 Compilers 10 5 No No G004 G015 NULL NULL NULL NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL G Gerrit DC 11 Graphics 10 No G010 6 Adriana P H Patrick S NULL NULL NULL NULL NULL NULL NULL NULL I Susan S 11 Graphics 10 No G010 6 Adriana P 7
    • The “Key Idea” 8
    • The “Key Idea” Do not count tuples, count key values. 8
    • The “Key Idea” Do not count tuples, count key values. Itemset has multiple supports for multiple tables 8
    • Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } {A,B,C} 3 professors PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } {A,B,C} 3 professors PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} 3 professors 3 courses PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} 3 professors 3 courses PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Counting Key Values { (name=Jan), (credits=10) } {A,B,C} {1,2,4} {1,2,3,5} 3 professors 3 courses 4 students PID name surname CID title credits project room SID name surname A Jan P 1 C++ 10 Yes G010 1 Wim LP A Jan P 1 C++ 10 Yes G010 2 Jeroen A A Jan P 1 C++ 10 Yes G010 3 Michael A A Jan P 2 Databases 10 No G010 1 Wim LP A Jan P 2 Databases 10 No G010 5 Calin G B Jan H 3 Thesis 20 No G006 4 Joris VG B Jan H 2 Databases 10 No G010 1 Wim LP B Jan H 2 Databases 10 No G010 5 Calin G C Jan VDB 4 Algebra 10 No G004 NULL NULL NULL D Piet V 5 Compilers 5 No G015 NULL NULL NULL D Piet V 6 Calculus 10 No G005 NULL NULL NULL E Erik B 7 Physics 30 Yes G005 NULL NULL NULL F Flor C 8 Data Mining 30 Yes G010 NULL NULL NULL G Gerrit DC 9 Telecom 10 No G004 NULL NULL NULL G Gerrit DC 10 AI 10 No G010 NULL NULL NULL 9
    • Relational Itemset Support Itemset {(room=G010),(study=Math)} πCID (σroom=G010,study=Math (course student)) 10
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection ✘ Takes a lot of time & space 11
    • Naive Approach πCID (σroom=G010,study=Math (course student)) 1. Compute Join 2. Apply Selection 3. Perform Projection ✘ Takes a lot of time & space ✘ Unnecessary blowup & shrinking 11
    • SMuRFIG Algorithm 12
    • SMuRFIG Algorithm Simple 12
    • SMuRFIG Algorithm Simple Multi- Relational 12
    • SMuRFIG Algorithm Simple Multi- Relational Frequent 12
    • SMuRFIG Algorithm Simple Multi- Relational Frequent Itemset 12
    • SMuRFIG Algorithm Simple Multi- Relational Frequent Itemset Generator 12
    • SMuRFIG Algorithm Computes supports of all frequent itemset 13
    • SMuRFIG Algorithm Computes supports of all frequent itemset Efficiently 13
    • SMuRFIG Algorithm Computes supports of all frequent itemset For all keys Efficiently simultaneously 13
    • Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm 14
    • Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm TID lists: - transaction identifiers where itemset occurs 14
    • Efficient computation: KeyID Lists Eclat Algorithm SMuRFIG Algorithm TID lists: KeyID Lists: - transaction - list of tuples identifiers identified by where itemset key values occurs 14
    • KeyID Lists: Basic Operations ∩ Intersection ↠ Propagation 15
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 4 Algebra 10 No G004 5 Compilers 5 No G015 6 Calculus 10 No G005 7 Physics 30 Yes G005 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 AI 10 No G010 11 Graphics 10 No G010 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 5 Compilers 5 No G015 4 6 Calculus 10 No G005 7 Physics 30 Yes G005 6 8 Data Mining 30 Yes G010 9 9 Telecom 10 No G004 10 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 6 8 7 Physics 30 Yes G005 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 5 Compilers 5 No G015 4 2 6 Calculus 10 No G005 ∩ 8 = 7 Physics 30 Yes G005 6 9 10 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 1 5 Compilers 5 No G015 4 2 2 6 Calculus 10 No G005 ∩ 8 = 10 7 Physics 30 Yes G005 6 9 10 11 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 16
    • KeyID List Intersection { (credits=10), (room=G010) } Course CID title credits project room {credits=10} {room=G010} 1 C++ 10 Yes G010 2 Databases 10 No G010 {(credits=10),(room=G010)} 3 Thesis 20 No G006 1 4 Algebra 10 No G004 2 1 1 5 Compilers 5 No G015 4 2 2 6 Calculus 10 No G005 ∩ 8 = 10 7 Physics 30 Yes G005 6 9 10 11 8 Data Mining 30 Yes G010 9 Telecom 10 No G004 10 11 10 AI 10 No G010 11 Graphics 10 No G010 11 support 7 support 5 support 4 16
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} 17
    • KeyID List Propagation {(name=Jan)} Course Professor CID title credits project room Student PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 SID name study B Jan H 3 Thesis 20 No G006 1 Wim CompSci C Jan VDB 4 Algebra 10 No G004 2 Jeroen CompSci D Piet V 5 Compilers 5 No G015 3 Michael CompSci E Erik B 6 Calculus 10 No G005 4 Joris Math F Flor C 7 Physics 30 Yes G005 5 Calin Math G Gerrit DC 8 Data Mining 30 Yes G010 6 Adriana Math H Patrick S 9 Telecom 10 No G004 I Susan S 10 AI 10 No G010 11 Graphics 10 No G010 KeyID list = {A,B,C} ↠ KeyID list = {1,2,3,4} ↠ KeyID list = {1,2,3,4,5} 17
    • Complexity ∩ Intersection ↠ Propagation 18
    • Complexity ∩ Intersection linear in size of entity table ↠ Propagation 18
    • Complexity ∩ Intersection linear in size of entity table ↠ Propagation linear in size of relation table 18
    • ‘Interestingness’ 19
    • ‘Interestingness’ Classical Frequent Itemset Mining 19
    • ‘Interestingness’ Classical Frequent Itemset Mining - overload of ‘interesting’ patterns 19
    • ‘Interestingness’ Classical Frequent Itemset Mining - overload of ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
    • ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
    • ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of - same problems ‘interesting’ patterns - solutions: closed sets, maximal sets, NDI, ... 19
    • ‘Interestingness’ Classical Frequent Relational Itemset Mining Itemset Mining - overload of - same problems ‘interesting’ patterns - similar solutions - solutions: closed can be applied sets, maximal sets, NDI, ... 19
    • ‘Interestingness’ Additional combinatorial blow-up Course Professor CID title credits project room PID name surname 1 C++ 10 Yes G010 A Jan P 2 Databases 10 No G010 B Jan H 3 Thesis 20 No G006 C Jan VDB 4 Algebra 10 No G004 D Piet V 5 Compilers 5 No G015 E Erik B 6 Calculus 10 No G005 F Flor C 7 Physics 30 Yes G005 G Gerrit DC 8 Data Mining 30 Yes G010 H Patrick S 9 Telecom 10 No G004 I Susan S ... 10 AI 10 No G010 11 Graphics 10 No G010 2 courses 3 professors 6 courses per professor = specific to relational data 20
    • Support Deviation: Main Idea 21
    • Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure 21
    • Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure Compare with actual support of the itemset 21
    • Support Deviation: Main Idea Compute expected support of an itemset based on the DB’s structure Compare with actual support of the itemset Only consider relational itemsets which diverge sufficiently 21
    • Summary 22
    • Summary • Multi-relational support - count key values - easily interpretable 22
    • Summary • Multi-relational support - count key values - easily interpretable • SMuRFIG algorithm - depth-first - efficient - support for all tables at once - scalable: linear in number of tuples, tables 22
    • Summary • Multi-relational support - count key values - easily interpretable • SMuRFIG algorithm - depth-first - efficient - support for all tables at once - scalable: linear in number of tuples, tables • Support deviation 22
    • for more info • SMuRFIG - http://www.adrem.ua.ac.be/smurfig - or Google “smurfig” • Paper - “Mining Interesting Sets and Rules in Relational Databases”, B. Goethals, W. Le Page, M. Mampaey, Proceedings of ACM SAC, 2010. 23
    • Questions? 24