Substructure Similarity Search in Graph Databases


Published on

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Special internet group on management of data
  • Substructure Similarity Search in Graph Databases

    1. 1. Substructure Similarity Search in Graph Databases SIGMOD ‘05 Xifeng Yan, Philip S. Yu, Jiawei Han Reporter : Chang-Yang Chen 2005/10/18
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Preliminary Concepts </li></ul><ul><li>Structural Filtering </li></ul><ul><li>Feature Set Selection </li></ul><ul><li>Algorithm Implementation </li></ul><ul><li>Empirical Study </li></ul><ul><li>Conclusion </li></ul>
    3. 3. Introduction <ul><li>Emergence of massive, complex structural data, in the form of sequences, trees, and graphs. </li></ul><ul><li>Tremendous efforts have been put into building practical graph query systems. </li></ul><ul><li>Kinds of structure search queries : </li></ul><ul><ul><li>full structure search </li></ul></ul><ul><ul><li>substructure search </li></ul></ul><ul><ul><li>full structure similarity search </li></ul></ul>
    4. 4. Example
    5. 5. Contribution <ul><li>Grafil ( Gra ph Similarity Fil tering) </li></ul><ul><li>Grafil models each query graph as a set of features and transforms the edge deletions into the feature misses in the query graph. </li></ul><ul><li>Two data structures : </li></ul><ul><ul><li>feature-graph matrix </li></ul></ul><ul><ul><li>edge-feature matrix </li></ul></ul><ul><li>Multi-filter composition strategy </li></ul>
    6. 6. Related Work <ul><li>L. Holder et al. Substructure discovery in the subdue system. </li></ul><ul><li>G. Navarro. A guided tour to approximate string matching. </li></ul><ul><li>X. Yan, P. Yu, and J. Han. Graph indexing: A frequent structure-based approach. </li></ul>
    7. 7. Preliminary Concepts <ul><li>Three categories of similarity measure : </li></ul><ul><ul><li>physical property-based (e.g. toxicity and weight) </li></ul></ul><ul><ul><li>feature-based </li></ul></ul><ul><ul><li>structure-based </li></ul></ul><ul><li>Simulating the graph matching process in a way similar to the string matching process. </li></ul>
    8. 8. Preliminary Concepts <ul><li>Substructure similarity between G and Q : </li></ul><ul><ul><li>P is the maximum common subgraph of G and Q. </li></ul></ul><ul><li>Relation ratio : </li></ul><ul><li>Example : target graph in Figure 1(a) and the query graph in Figure 2. </li></ul><ul><ul><li>common subgraph has 11 edges. </li></ul></ul><ul><ul><li>Substructure similarity of them is 92% </li></ul></ul><ul><ul><li>Relation ratio is 8% </li></ul></ul>
    9. 9. Structure Filtering <ul><li>Should have at least three occurrences of these features. </li></ul>
    10. 10. Feature-Graph Matrix Index <ul><li>Without complex computation </li></ul><ul><li>Flexibly insert and delete features </li></ul>
    11. 11. Framework <ul><li>Steps of similarity search : </li></ul><ul><ul><li>Index construction </li></ul></ul><ul><ul><li>Feature miss estimation </li></ul></ul><ul><ul><ul><li>compute the upper bound of feature misses. This upper bound is written as </li></ul></ul></ul><ul><ul><li>Query processing </li></ul></ul><ul><ul><li>Query relaxation </li></ul></ul><ul><ul><ul><li>if the user needs more matches than those returned from the previous step; iterate Steps 2 to 4. </li></ul></ul></ul><ul><li>The filtering algorithm proposed should return the candidate answer set as small as possible. </li></ul>
    12. 12. Feature Miss Estimation <ul><li>Filtering algorithms only work for the moderate relaxation ratio and need a validation algorithm to check the actual matches </li></ul>
    13. 13. Feature Miss Estimation <ul><li>M(r, ・ ) denotes the rth row vector of matrix M. </li></ul><ul><li>M( ・ , c) denotes the cth column vector of matrix M. </li></ul><ul><li>|M(r, ・ )| represents the number of non-zero entries in M(r, ・ ). </li></ul>
    14. 14. Feature Miss Estimation
    15. 15. Estimation Refinement <ul><li>Any optimal solution that leads to should be in the following two cases : </li></ul><ul><ul><li>Case 1 : is selected in this solution. </li></ul></ul><ul><ul><li>Case 2 : is not selected in this solution. </li></ul></ul>
    16. 16. Estimation Refinement
    17. 17. Frequency Difference
    18. 18. Feature Set Selection <ul><li>Should we use all the features together in a single filter? </li></ul><ul><li>feature conjugation : </li></ul><ul><ul><li>Compensating the misses of by adding more occurrences of another feature in G. </li></ul></ul>
    19. 19. Feature Set Selection <ul><li>selectivity : </li></ul><ul><ul><li>average frequency difference within D and Q, written as . </li></ul></ul><ul><li>Rule for feature set selection : </li></ul><ul><ul><li>Rule 1. Select a large number of features. </li></ul></ul><ul><ul><li>Rule 2. Make sure features cover the query graph uniformly. </li></ul></ul><ul><ul><li>Rule 3. Separate features with different selectivity. </li></ul></ul>
    20. 20. Multi-filter composition strategy <ul><li>Make a trade-off among these rules. </li></ul><ul><ul><li>group features by their size to create feature sets. </li></ul></ul>
    21. 21. Algorithm Implementation <ul><li>Two components : </li></ul><ul><ul><li>Base component and clustering component. </li></ul></ul>base clustering
    22. 22. Empirical Study <ul><li>Compare performance with two algorithms </li></ul><ul><ul><li>Using individual edge as features (denoted as Edge ) </li></ul></ul><ul><ul><li>Using all feature of a query graph (denoted as Allfeature ) </li></ul></ul><ul><li>Two kinds of data set are used : </li></ul><ul><ul><li>Real dataset and a series of synthetic datasets. </li></ul></ul>
    23. 23. Experiments on the Chemical Compound Dataset
    24. 24. Experiments on the Chemical Compound Dataset <ul><li>filtering ratio : </li></ul>
    25. 25. Experiments on the Synthetic Datasets <ul><li>A typical dataset may have 10,000 graphs and use 200 seed fragments with 10 kinds of nodes and edges. </li></ul><ul><li>When the types of labels in a graph become very diverse, Edge will perform nearly as well as Grafil. </li></ul><ul><ul><li>a set of tuples {node1 label, node2 label, edge label} instead of a complex structure. </li></ul></ul><ul><ul><li>This setting will generate (10 × 11)/2 × 10 = 550 different edge tuples. </li></ul></ul><ul><ul><li>Most of graphs in this synthetic dataset have 30 to 100 edges. </li></ul></ul>
    26. 26. Experiments on the Synthetic Datasets
    27. 27. Conclusion <ul><li>Exploring the filtering algorithm using indexed structural patterns, without doing costly structure comparisons. </li></ul><ul><li>Attractive in terms of accuracy and efficiency. </li></ul><ul><li>Developing in Grafil can be directly applied to searching inexact non-consecutive sequences, trees, and other complicated structures as well. </li></ul>