Trends In Graph Data Management And Mining


Published on

Keynote speech at Symposium on Emerging Trends in Database Technologies (ETDT), Pune Institute of Engineering and Technology, October 2004.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Trends In Graph Data Management And Mining

  1. 1. Trends in Graph Data Management and Mining Srinath Srinivasa IIIT Bangalore [email_address]
  2. 2. No data is an island…
  3. 3. Outline <ul><li>Graph Data and its characteristics </li></ul><ul><li>Structural Queries </li></ul><ul><li>Storage Models for Graphs </li></ul><ul><li>Data Models for Graph Databases </li></ul><ul><li>Structural Indexes </li></ul><ul><li>Mining Frequent Subgraphs </li></ul><ul><ul><li>gSpan </li></ul></ul><ul><ul><li>FBT </li></ul></ul>
  4. 4. Graph Data A graph G = (V,E) is a collection of nodes (vertices) and edges. A graph represents a “relationship structure” among different data elements. A graph database is a collection of different graphs representing different relationship structures.
  5. 5. Graph database versus Relational database A relational database maintains different instances of the same relationship structure (represented by its ER schema) A graph database maintains different relationship structures
  6. 6. Graph Database Applications <ul><li>Software Engineering </li></ul><ul><ul><li>UML diagrams, flowcharts, state machines, … </li></ul></ul><ul><li>Knowledge Management </li></ul><ul><ul><li>Ontologies, Semantic nets, … </li></ul></ul><ul><li>Bioinformatics </li></ul><ul><ul><li>Molecular structures, bio-pathways, … </li></ul></ul><ul><li>CAD </li></ul><ul><ul><li>Electrical circuits, IC designs, … </li></ul></ul><ul><li>Cartography, XML Bases, HTML Webs, … </li></ul>
  7. 7. Queries over Graph Databases <ul><li>Attribute Queries </li></ul><ul><ul><li>Queries over attributes and values in nodes and edges. Equivalent to a relational query within a given schema </li></ul></ul><ul><li>Structural Queries </li></ul><ul><ul><li>Queries over the relationship structure itself. Examples: Structural similarity, substructure, template matching, etc. </li></ul></ul>
  8. 8. Structural Queries on Graph Data <ul><li>Undirected Graphs </li></ul><ul><ul><li>Structural similarity, substructure </li></ul></ul><ul><li>Directed Graphs </li></ul><ul><ul><li>Structural similarity, substructure, reachability </li></ul></ul><ul><li>Weighted Graphs </li></ul><ul><ul><li>Shortest paths, “best” matching substructure </li></ul></ul><ul><li>Labeled Graphs </li></ul><ul><ul><li>Labeled structural similarity, unlabeled structural similarity </li></ul></ul>
  9. 9. Structural Queries <ul><li>Substructure query </li></ul><ul><ul><li>Given a graph database G = {G 1 , G 2 , … G n } and a query graph Q, return all graphs G i where Q is a subgraph of G i . </li></ul></ul><ul><li>Structural similarity </li></ul><ul><ul><li>Given a graph database G = {G 1 , G 2 , … G n } and a query graph Q and a threshold t, return all graphs G i where the edit distance between Q and G i is at most t. </li></ul></ul><ul><ul><li>The edit distance between two graphs is the number of edge modifications (additions, deletions) required to rewrite one graph into the other </li></ul></ul>
  10. 10. Structural Queries <ul><li>(Sub)graph isomorphism is believed to be neither in P nor in NP-complete </li></ul><ul><li>In graph databases structure matching has to be performed against a set of graphs! </li></ul><ul><li>Proper storage, pre-processing and index structures crucial if structural searches are to be practical </li></ul>
  11. 11. Storing Graph Data Attributed Relational Graphs (ARGs) A B C D p q r s t r D A p C A t D B s C B q B A
  12. 12. Storing Graph Data <ul><li>ARGs </li></ul><ul><ul><li>ARGs store a graph as a set of rows, each depicting an edge </li></ul></ul><ul><ul><li>Amenable to storage in an RDBMS and easy attribute searches using SQL </li></ul></ul><ul><ul><li>Costly structural searches, requiring complex nesting of SELECT statements </li></ul></ul><ul><ul><li>Each graph needs a separate table </li></ul></ul>
  13. 13. Storing Graph Data A B C D p q r s t Maximum walks: A r D t B s C p A q B
  14. 14. Storing Graph Data <ul><li>Maximum walks </li></ul><ul><ul><li>Stores all walks of maximum possible length in the graph </li></ul></ul><ul><ul><li>Traversable graphs stored as a single sequence </li></ul></ul><ul><ul><li>Easy to answer attribute queries and reachability queries </li></ul></ul><ul><ul><li>Non-traversable graphs need multiple sequences </li></ul></ul><ul><ul><li>Variable record length for sequences </li></ul></ul><ul><ul><li>Significant pre-processing time for reducing graph to the best set of sequences </li></ul></ul>
  15. 15. Storing Graph Data Linear DFS Tree: (Example: Glide ) A B C D p q r s t A%1 /p/ C /s/ B%1q /t/ D%1r
  16. 16. Storing Graph Data <ul><li>Linear DFS Tree </li></ul><ul><ul><li>A sequence form of depth-first traversal of the graph </li></ul></ul><ul><ul><li>Suitable for any kind of undirected graphs (but not necessarily for directed graphs) </li></ul></ul><ul><ul><li>Suitable for attribute queries </li></ul></ul><ul><ul><li>Some techniques proposed for substructure queries over linear DFS trees </li></ul></ul><ul><ul><li>Large pre-processing time </li></ul></ul>
  17. 17. Storing Graph Data XML with IDREFS: A B C D <node id=“A”, adj=“C D”> <node id=“B”> <node id=“C”> </node> <node id=“D”> </node> </node> </node>
  18. 18. Storing Graph Data <ul><li>XML with IDREFS </li></ul><ul><ul><li>Reduces graph database to an XML base </li></ul></ul><ul><ul><li>Use XPath / XQuery engines for structural queries </li></ul></ul><ul><ul><li>Widely supported by a variety of XML parsers </li></ul></ul><ul><ul><li>Costly structure/sub-structure matching </li></ul></ul><ul><ul><li>Needs distinction between IDREF edges and hierarchy edges </li></ul></ul>
  19. 19. Graph Database Models <ul><li>“ Schema-less” collection of graphs </li></ul><ul><ul><li>Example: GraphGrep, Daylight ACD, gIndex </li></ul></ul><ul><li>Database as a graph </li></ul><ul><ul><li>Example: SUBDUE </li></ul></ul><ul><li>Database with schema and views </li></ul><ul><ul><li>Example: GRACE </li></ul></ul>
  20. 20. Structural Indexes <ul><li>Used for fast structure-based retrieval of graphs </li></ul><ul><li>Primarily meant for labeled undirected graphs </li></ul><ul><li>Usually support substructure and structural similarity searches </li></ul><ul><li>May either return exact matches (NP-complete) or inexact matches based on heuristics (P) </li></ul>
  21. 21. Structural Indexes GraphGrep (Guigno and Shasha 2002) Two index files: “ Fingerprint” file holding label-paths “ Path” file holding id-paths … paths from length 1 up to a maximum l p
  22. 22. Structural Indexes GraphGrep (Guigno and Shasha 2002) A B A D 1 2 3 4 G1 Database Fingerprint file 0 2 ABA 1 2 AAB 0 1 BD 1 1 AD 1 2 AB 0 2 AA G2 G1 Path
  23. 23. Structural Indexes GraphGrep (Guigno and Shasha 2002) A B A D 1 2 3 4 G1 Database Paths file {1-2-3, 3-2-1} ABA {1-3-2, 3-1-2} AAB {2-4} BD {1-4} AD {1-2, 3-2} AB {1-3, 3-1} AA G1 Path
  24. 24. Structural Indexes <ul><li>GraphGrep </li></ul><ul><ul><li>Stores all paths in member graphs up to a maximum length </li></ul></ul><ul><ul><li>Signature file narrows search space </li></ul></ul><ul><ul><li>Exact substructure matching possible when node id in query matches node id in member graphs </li></ul></ul><ul><ul><li>Exponential preparation time </li></ul></ul><ul><ul><li>Running time increases exponentially as max path length increases </li></ul></ul>
  25. 25. Structural Indexes Hierarchical Conceptual Clusters (SUBDUE) (Jonyer, Cook, Holder 2001) Database Graph 1 Graph 2 Concept 1 Concept 2 Rest of Graph 1 Rest of Graph 2 Concept 1.1
  26. 26. Structural Indexes <ul><li>Hierarchical Conceptual Clusters </li></ul><ul><ul><li>Clusters the database into commonly occurring substructures </li></ul></ul><ul><ul><li>Database is organized as a hierarchical index </li></ul></ul><ul><ul><li>Clustering based on substructures that perform “best compression” by reducing graph description length </li></ul></ul><ul><ul><li>Number of clusters may increase exponentially </li></ul></ul><ul><ul><li>Compression / search time significant </li></ul></ul>
  27. 27. Structural Indexes Hierarchical Vector Spaces (Grace 1) (Srinivasa, Acharya, Khare, Agrawal, 2002) A B A D <ul><ul><li>A:A  1 </li></ul></ul><ul><ul><li>A:B  2 </li></ul></ul><ul><ul><li>A:D  1 </li></ul></ul><ul><ul><li>B:D  1 </li></ul></ul><ul><li>Level 1 vector </li></ul>
  28. 28. Structural Indexes Hierarchical Vector Spaces (Grace 1) A B A D <ul><ul><li>Level 2 graphs and vectors </li></ul></ul>AA BD AB AD <ul><ul><li>AA:BD  1 </li></ul></ul><ul><ul><li>AB:AD  1 </li></ul></ul>
  29. 29. Structural Indexes <ul><li>Hierarchical vector spaces </li></ul><ul><ul><li>Hashes a graph onto vectors in a hierarchy of vector spaces </li></ul></ul><ul><ul><li>Higher level graphs are formed by replacing edges (vectors) of lower level by nodes </li></ul></ul><ul><ul><li>Compression of a graph may lead to several higher level graphs </li></ul></ul><ul><ul><li>Fast structural similarity searches; but based on inexact matching </li></ul></ul><ul><ul><li>View explosion anomaly during refinement </li></ul></ul>
  30. 30. Structural Indexes <ul><li>gIndex (Yan, Yu, Han, 2004) </li></ul><ul><li>Mine database for frequent substructures (using gSpan) </li></ul><ul><li>Maintain index structure containing (size, substructure) pairs </li></ul><ul><li>Increase minsup as the size of the indexed substructure increases </li></ul>
  31. 31. Structural Indexes <ul><li>gIndex (Yan, Yu, Han, 2004) </li></ul><ul><li>Given a query graph q: </li></ul><ul><li>Mine database along with q, and determine all frequent substructures F in q </li></ul><ul><li>Reduce search space to all graphs containing all frequent substructures of F </li></ul><ul><li>Perform graph matching against all graphs in the reduced search space </li></ul>
  32. 32. Graph Mining <ul><li>Given a database of graphs find all frequently occurring substructures in the database </li></ul>
  33. 33. Notes on Frequent Item-set Mining <ul><li>The Apriori algorithm is useful for mining frequent item-sets from transaction logs </li></ul><ul><li>Apriori is based on the fact that in order to construct a frequent L item-set it is sufficient to know only the set of all frequent L-1 item-sets </li></ul><ul><li>Apriori property holds for frequent subgraphs </li></ul><ul><li>However, apriori algorithm on a graph database requires several sub-graph isomorphism checks! </li></ul>
  34. 34. Apriori Based Graph Mining <ul><li>Strategy for Apriori-based graph mining </li></ul><ul><ul><li>Use a re-write strategy to represent all graphs in the database as a unique sequence </li></ul></ul><ul><ul><li>Substructure search reduces to a sub-sequence search </li></ul></ul><ul><ul><li>Use AprioriAll (Apriori for sequences) to mine the database </li></ul></ul><ul><ul><li>Best known rewrite mechanism to date is proposed in gSpan. </li></ul></ul>
  35. 35. gSpan A B A D p q r p 0 1 2 3 <ul><li>First build a DFS tree (shown in thick lines) </li></ul><ul><li>Mark each node by its visiting time in the DFS run (shown by numeral) </li></ul><ul><li>Write the graph as a sequence based on node visiting time. Append all back links from a node after the first forward link into the node. </li></ul>
  36. 36. gSpan A B A D p q r p 0 1 2 3 Sequence: (0,1,A,q,B)(1,2,B,r,A)(2,0,A,p,A)(1,3,B,p,A) Since a graph has many DFS trees, consider only the DFS tree which yields sequence with the least lexicographic value.
  37. 37. Filtration Based Technique (FBT) <ul><li>Proposed by Srinivasa and BalaSundaraRaman (Submitted after first revision to IEEE TKDE) </li></ul><ul><li>Opposite of Apriori construction on graphs but equivalent to Apriori on walks </li></ul><ul><li>Starts with an assertion that all graphs in the database are isomorphic </li></ul><ul><li>Filters away all edges that contradict such an assertion </li></ul><ul><li>Algorithm converges to the maximal common (frequent) subgraph. </li></ul>
  38. 38. Filtration Based Technique (FBT) <ul><li>Filtration is based on enumerating label-walks in the graphs. Label walks accentuate differences between graphs as the length of the walks increase… </li></ul>
  39. 39. FBT A B C A B A C B Length-1 Walks AB, AB, BC, AC AB, AB, BC, AC
  40. 40. FBT A B C A B A C B Length-2 Walks ABA , ABC, BCA, BAC, ABA, ACB ABC, ACB, BCA, BAC, BAB , BAC
  41. 41. FBT <ul><li>i = 1 </li></ul><ul><li>Enumerate walks of length i from member graphs and organize them into different buckets based on label sequence </li></ul><ul><li>Discard buckets that don’t have minsup </li></ul><ul><li>i++ </li></ul><ul><li>Remove as intermediate results all graphs that don’t have walks of length i </li></ul><ul><li>Go to step 2 until no more walks exist </li></ul>
  42. 42. FBT <ul><li>Very fast convergence, but can find only maximal common substructures </li></ul><ul><li>If two or more common substructures overlap, FBT cannot separate the substructures </li></ul><ul><li>Applied successfully to carcinogen dataset from US NTP, protein structures from PDB and Web traversal logs from Yahoo. </li></ul>
  43. 43. GRACE2 and Safari <ul><li>Second version of GRACE </li></ul><ul><li>Supports a query algebra for graph queries, views, and dynamic schemas </li></ul><ul><li>Query language called Safari </li></ul>
  44. 44. GRACE2 Data Model <ul><li>Member graphs </li></ul><ul><li>Node, edge and graph attributes </li></ul><ul><li>The “default” graph </li></ul><ul><li>Schema graphs and meta-graphs </li></ul>
  45. 45. Safari Constructs <ul><li>selectin <cond> <graphref> </li></ul><ul><ul><li>Use graphref as a schema and return a view of the schema based on cond </li></ul></ul><ul><li>selecton <cond> <graphref> </li></ul><ul><ul><li>Search for cond within graph referred by graphref and return a subgraph </li></ul></ul><ul><li>selectgraph <cond> <graphref> </li></ul><ul><ul><li>Retrieve graph matching cond from the schema or meta-graph referred by graphref. If more than one graph matches cond, another view is returned. </li></ul></ul>
  46. 46. References <ul><li>I. Jonyer, D.J. Cook, L.B. Holder. Graph-Based Hierarchical Conceptual Clustering. Journal of Machine Learning Research, Vol 2, 2001. </li></ul><ul><li>Rosalba Guigno, Dennis Shasha. GraphGrep: A Fast and Universal Method for Substructure Searches. Proc of ICCV 2002. </li></ul><ul><li>Srinath Srinivasa, Sumit Acharya, Rajat Khare, Himanshu Agrawal. Vectorization of Structure for Indexing Graph Databases. Proc of IASTED Int’l Conf on Information Systems and Databases, ISDB 2002, Tokyo, Japan. </li></ul><ul><li>Srinath Srinivasa, Sujit Kumar. A Platform Based on the Multi-Dimensional Data Model for Analysis of Bio-Molecular Structures. Proc of VLDB 2003, Berlin, Germany. </li></ul>
  47. 47. References <ul><li>5. Xifeng Yan, Jiawei Han. gSpan: Graph-Based Substructure Pattern Mining. </li></ul><ul><li>6. Xifeng Yan, Philip S. Yu, Jiawei Han. Graph Indexing: A Frequent Substructure Based Approach. Proc of SIGMOD 2004. </li></ul><ul><li>7. Srinath Srinivasa, Martin Meier, Mandar R. Mutalikdesai, Gopinath P.S., Gowrishankar K.A. LWI and Safari: A New Index Structure and Query Model for Graph Databases. </li></ul>
  48. 48. Thank You! For more interaction, contact me at [email_address]