Your SlideShare is downloading. ×
Substructure Similarity Search in Graph Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Substructure Similarity Search in Graph Databases

1,041
views

Published on

Published in: Business, Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,041
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Special internet group on management of data
  • Transcript

    • 1. Substructure Similarity Search in Graph Databases SIGMOD ‘05 Xifeng Yan, Philip S. Yu, Jiawei Han Reporter : Chang-Yang Chen 2005/10/18
    • 2. Outline
      • Introduction
      • Related Work
      • Preliminary Concepts
      • Structural Filtering
      • Feature Set Selection
      • Algorithm Implementation
      • Empirical Study
      • Conclusion
    • 3. Introduction
      • Emergence of massive, complex structural data, in the form of sequences, trees, and graphs.
      • Tremendous efforts have been put into building practical graph query systems.
      • Kinds of structure search queries :
        • full structure search
        • substructure search
        • full structure similarity search
    • 4. Example
    • 5. Contribution
      • Grafil ( Gra ph Similarity Fil tering)
      • Grafil models each query graph as a set of features and transforms the edge deletions into the feature misses in the query graph.
      • Two data structures :
        • feature-graph matrix
        • edge-feature matrix
      • Multi-filter composition strategy
    • 6. Related Work
      • L. Holder et al. Substructure discovery in the subdue system.
      • G. Navarro. A guided tour to approximate string matching.
      • X. Yan, P. Yu, and J. Han. Graph indexing: A frequent structure-based approach.
    • 7. Preliminary Concepts
      • Three categories of similarity measure :
        • physical property-based (e.g. toxicity and weight)
        • feature-based
        • structure-based
      • Simulating the graph matching process in a way similar to the string matching process.
    • 8. Preliminary Concepts
      • Substructure similarity between G and Q :
        • P is the maximum common subgraph of G and Q.
      • Relation ratio :
      • Example : target graph in Figure 1(a) and the query graph in Figure 2.
        • common subgraph has 11 edges.
        • Substructure similarity of them is 92%
        • Relation ratio is 8%
    • 9. Structure Filtering
      • Should have at least three occurrences of these features.
    • 10. Feature-Graph Matrix Index
      • Without complex computation
      • Flexibly insert and delete features
    • 11. Framework
      • Steps of similarity search :
        • Index construction
        • Feature miss estimation
          • compute the upper bound of feature misses. This upper bound is written as
        • Query processing
        • Query relaxation
          • if the user needs more matches than those returned from the previous step; iterate Steps 2 to 4.
      • The filtering algorithm proposed should return the candidate answer set as small as possible.
    • 12. Feature Miss Estimation
      • Filtering algorithms only work for the moderate relaxation ratio and need a validation algorithm to check the actual matches
    • 13. Feature Miss Estimation
      • M(r, ・ ) denotes the rth row vector of matrix M.
      • M( ・ , c) denotes the cth column vector of matrix M.
      • |M(r, ・ )| represents the number of non-zero entries in M(r, ・ ).
    • 14. Feature Miss Estimation
    • 15. Estimation Refinement
      • Any optimal solution that leads to should be in the following two cases :
        • Case 1 : is selected in this solution.
        • Case 2 : is not selected in this solution.
    • 16. Estimation Refinement
    • 17. Frequency Difference
    • 18. Feature Set Selection
      • Should we use all the features together in a single filter?
      • feature conjugation :
        • Compensating the misses of by adding more occurrences of another feature in G.
    • 19. Feature Set Selection
      • selectivity :
        • average frequency difference within D and Q, written as .
      • Rule for feature set selection :
        • Rule 1. Select a large number of features.
        • Rule 2. Make sure features cover the query graph uniformly.
        • Rule 3. Separate features with different selectivity.
    • 20. Multi-filter composition strategy
      • Make a trade-off among these rules.
        • group features by their size to create feature sets.
    • 21. Algorithm Implementation
      • Two components :
        • Base component and clustering component.
      base clustering
    • 22. Empirical Study
      • Compare performance with two algorithms
        • Using individual edge as features (denoted as Edge )
        • Using all feature of a query graph (denoted as Allfeature )
      • Two kinds of data set are used :
        • Real dataset and a series of synthetic datasets.
    • 23. Experiments on the Chemical Compound Dataset
    • 24. Experiments on the Chemical Compound Dataset
      • filtering ratio :
    • 25. Experiments on the Synthetic Datasets
      • A typical dataset may have 10,000 graphs and use 200 seed fragments with 10 kinds of nodes and edges.
      • When the types of labels in a graph become very diverse, Edge will perform nearly as well as Grafil.
        • a set of tuples {node1 label, node2 label, edge label} instead of a complex structure.
        • This setting will generate (10 × 11)/2 × 10 = 550 different edge tuples.
        • Most of graphs in this synthetic dataset have 30 to 100 edges.
    • 26. Experiments on the Synthetic Datasets
    • 27. Conclusion
      • Exploring the filtering algorithm using indexed structural patterns, without doing costly structure comparisons.
      • Attractive in terms of accuracy and efficiency.
      • Developing in Grafil can be directly applied to searching inexact non-consecutive sequences, trees, and other complicated structures as well.

    ×