Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Efficient Algorithms for Association Finding and
Frequent Association Pattern Mining
Gong Cheng, Daxin Liu, Yuzhong Qu
Web...
Background and motivation
• To suggest friends, recognize suspected terrorists,
answer questions … based on massive graph ...
Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query en...
Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query en...
Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query en...
Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query en...
Problem statement
1. How to efficiently find associations in a possibly
very large graph?
2. How to help users explore a p...
Problem statement
1. How to efficiently find associations in a possibly
very large graph?
2. How to help users explore a p...
Association finding: Problem
• To find all the associations having a limited diameter
(Diameter = Greatest distance betwee...
Association finding: Basic solution
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Pap...
Association finding: Basic solution
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Pap...
Association finding: Optimization
• Distance-based search space pruning
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
cor...
Association finding: Optimization
• Distance computation
• Materializing offline computed results: O(V2) space
• Online co...
Association finding: Deduplication
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Pape...
Association finding: Deduplication
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Pape...
Association finding: Deduplication
• Canonical code
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
co...
Frequent association pattern mining
• Association pattern: A conceptual abstract that
summarizes a group of associations
C...
Frequent association pattern mining
• Problem: To mine all the association patterns matched by
more than a threshold propo...
Frequent association pattern mining
• Basic solution:
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
...
Frequent association pattern mining
• Canonical code
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference
attended
reviewer
Ali...
Frequent association pattern mining
• Canonical code
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference
attended
reviewer
Ali...
Experiments
• Datasets
• LinkedMDB: 1M vertices and 2M arcs
• DBpedia (2015-04): 4M vertices and 15M arcs
• Parameter sett...
Experiments
• Results: Association finding
BSC: Basic solution (not pruning)
PRN: Optimized solution (distance-based pruni...
Experiments
• Results: Frequent association pattern mining
• LinkedMDB
• <10,000 associations: <21ms
• 13,531 associations...
Takeaway messages
• Subgraph finding and mining are faster than what we expected.
• Consider distance oracle and canonical...
Takeaway messages
• Subgraph finding and mining are faster than what we expected.
• Consider distance oracle and canonical...
Upcoming SlideShare
Loading in …5
×

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

615 views

Published on

Presented at ISWC'16, Kobe, 10/21/2016.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

  1. 1. Efficient Algorithms for Association Finding and Frequent Association Pattern Mining Gong Cheng, Daxin Liu, Yuzhong Qu Websoft Research Group National Key Laboratory for Novel Software Technology Nanjing University, China Websoft
  2. 2. Background and motivation • To suggest friends, recognize suspected terrorists, answer questions … based on massive graph data
  3. 3. Problem statement An association connecting a set of query entities is a minimal subgraph that • contains all the query entities, and • is connected. Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  4. 4. Problem statement An association connecting a set of query entities is a minimal subgraph that • contains all the query entities, and • is connected. Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  5. 5. Problem statement An association connecting a set of query entities is a minimal subgraph that • contains all the query entities, and • is connected. Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  6. 6. Problem statement An association connecting a set of query entities is a minimal subgraph that • contains all the query entities, and • is connected. Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice • Tree-structured • Leaves ⊆ Query entities
  7. 7. Problem statement 1. How to efficiently find associations in a possibly very large graph? 2. How to help users explore a possibly large set of associations that have been found? Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  8. 8. Problem statement 1. How to efficiently find associations in a possibly very large graph? 2. How to help users explore a possibly large set of associations that have been found? Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice Association finding Frequent association pattern mining
  9. 9. Association finding: Problem • To find all the associations having a limited diameter (Diameter = Greatest distance between any pair of vertices) Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  10. 10. Association finding: Basic solution Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Chris attended Paper-AisAuthorOf acceptedAt ISWC Alice Bob reviewer ISWC ISWC An association  A set of paths
  11. 11. Association finding: Basic solution Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Chris attended Paper-AisAuthorOf acceptedAt ISWC Alice Bob reviewer ISWC ISWC 1. Path finding 2. Path merging
  12. 12. Association finding: Optimization • Distance-based search space pruning Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice • DiameterConstraint ≤ 3 • Length(AliceDan) = 1 • Distance(Dan, Bob) = 4 Length+Distance > DiameterConstraint
  13. 13. Association finding: Optimization • Distance computation • Materializing offline computed results: O(V2) space • Online computing: O(E) time per pair • Using distance oracle: a space-time trade-off Chris Bob Paper-A Paper-B Dan isAuthorOf knows correspondingAuthor acceptedAt Ellen knows ISWC isAuthorOf COLD acceptedAt attended attended attended reviewer reviewer Frank knows Alice
  14. 14. Association finding: Deduplication Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Chris attended Paper-AisAuthorOf acceptedAt ISWC Alice Bob reviewer ISWC ISWC An association  A set of paths
  15. 15. Association finding: Deduplication Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Chris attended Paper-AisAuthorOf Alice Bob reviewer ISWC ISWC A duplicate association  Another set of paths Paper-A acceptedAt Paper-A acceptedAt
  16. 16. Association finding: Deduplication • Canonical code Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper-A))$ = … Paper-A,acceptedAt,code(Tree(ISWC))$ … = … ISWC,reviewer,code(Tree(Bob)),~attended,code(Tree(Chris))$ … = … Bob$ … (Assuming Bob precedes Chris)
  17. 17. Frequent association pattern mining • Association pattern: A conceptual abstract that summarizes a group of associations Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Association  Association pattern (Intermediate entity  Class) Chris Bob PaperisAuthorOf acceptedAt Conference attended reviewer Alice
  18. 18. Frequent association pattern mining • Problem: To mine all the association patterns matched by more than a threshold proportion of associations Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Association  Association pattern (Intermediate entity  Class) Chris Bob PaperisAuthorOf acceptedAt Conference attended reviewer Alice
  19. 19. Frequent association pattern mining • Basic solution: Chris Bob Paper-AisAuthorOf acceptedAt ISWC attended reviewer Alice Association  Association pattern (Intermediate entity  Class) Chris Bob PaperisAuthorOf acceptedAt Conference attended reviewer Alice Calculating the frequency of an association pattern = Counting the occurrence of its canonical code
  20. 20. Frequent association pattern mining • Canonical code Chris Bob PaperisAuthorOf acceptedAt Conference attended reviewer Alice Paper isAuthorOf acceptedAt Conference code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$ Paper equals Paper. So which subtree should go first? (Canonical code may not be unique!)? ?
  21. 21. Frequent association pattern mining • Canonical code Chris Bob PaperisAuthorOf acceptedAt Conference attended reviewer Alice Paper isAuthorOf acceptedAt Conference code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$ Smallest leaf as its proxy to be compared Equality would never happen. (Canonical code is now unique!) (Assuming Bob precedes Chris)
  22. 22. Experiments • Datasets • LinkedMDB: 1M vertices and 2M arcs • DBpedia (2015-04): 4M vertices and 15M arcs • Parameter settings • Diameter constraint (λ): 2, 4 • Number of query entities (n): 2, 3, 4, 5 • Test queries • 1,000 random sets of query entities under each setting of λ and n • Hardware configuration • 3.3GHz CPU, 24GB memory • Data graphs: in memory • Distance oracles: on disk
  23. 23. Experiments • Results: Association finding BSC: Basic solution (not pruning) PRN: Optimized solution (distance-based pruning) PRN-1: Optimized solution (distance-based pruning except for the last level of search)
  24. 24. Experiments • Results: Frequent association pattern mining • LinkedMDB • <10,000 associations: <21ms • 13,531 associations: 68ms • DBpedia • <10,000 associations: <65ms • 1,198,968 associations: 2909ms
  25. 25. Takeaway messages • Subgraph finding and mining are faster than what we expected. • Consider distance oracle and canonical code in your own research.
  26. 26. Takeaway messages • Subgraph finding and mining are faster than what we expected. • Consider distance oracle and canonical code in your own research.

×