Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Towards Querying Big Graphs 
Arijit Khan 
Systems Group 
Department of Computer Science 
ETH Zürich 
arijit.khan@inf.ethz....
Big Data as Big Graph 
Knowledge Graph 
Bill Gates 
Sergey Brin 
Maryland 
Harvard 
Microsoft 
Stanford 
Jane Stanford 
Se...
More Graph Data 
Social Network Program Flow 
Biological Network 
Chemical Structure 
Transportation Network Image Data 
G...
Emerging Graph Queries 
(ICDE 2012, Tutorial) 
Find an Asian restaurants on my route to the bank that my 
friends rated hi...
My Work in Graphs 
Big-Graphs 
Graph Query 
Tutorial, ICDE 2012 
Heterogeneous 
Graph Search 
Uncertain Graphs 
Influence ...
Querying Knowledge Graphs 
Answering Web Questions Using Structured Data – Dream or 
Reality? - VLDB’09 Panel Discussion 
...
Subgraph Search: Application in 
Malware Detection 
- Is there a Malware embedded in the software? 
Security and Intrusion...
Why is Graph Querying Difficult? 
Heterogeneity  lack of standardized 
schema, or quite large schema (65K for 
DBpedia Da...
Why is Graph Querying Difficult? 
Heterogeneity  lack of standardized 
schema, or quite large schema (65K for 
DBpedia Da...
Graph Query 
Query: Find a project that employee “Bob” is working on, and that project is 
supervised by the same Team_Lea...
Graph Query 
Query: Find a project that employee “Bob” is working on, and that project is 
supervised by the same Team_Lea...
Graph Query 
Query: Find a project that employee “Bob” is working on, and that project is 
supervised by the same Team_Lea...
Graph Query 
Query: Find a project that employee “Bob” is working on, and that project is 
supervised by the same Team_Lea...
Approximate Subgraph Search 
for Graph Querying 
If two entities are close in the query graph, they should also be close i...
Approximate Subgraph Search 
for Graph Querying 
If two entities are close in the query graph, they should also be close i...
Approximate Subgraph Search 
for Graph Querying 
If two entities are close in the query graph, they should also be close i...
Graph to Vector Conversion 
Convert h-hop neighborhood of each node into a multi-dimensional 
vector. 
u1 
u2 u3 
u4 
u5 
...
Graph to Vector Conversion 
Convert h-hop neighborhood of each node into a multi-dimensional 
vector. 
u1 
u2 u3 
u4 
u5 
...
Node Matching Cost 
Node Matching Cost: 
Label Matching Cost + Difference between the neighborhood vectors 
Label Matching...
Node Matching Cost 
Node Matching Cost: 
Label Matching Cost + Difference between the neighborhood vectors 
Neighborhood 
...
Subgraph Matching Cost 
Subgraph Matching Cost: 
- Summation of individual 
node matching costs. 
u1 
u2 u3 
u4 
u5 
Data ...
Problem Formulation 
Given a data graph G, a query graph Q and the label difference threshold ϵ, find 
the minimum cost ma...
Subgraph Matching Cost Function 
Properties 
The cost of a subgraph isomorphic embedding = 0. 
Our subgraph matching cost ...
Subgraph Matching Cost Function 
Complexity 
The problem of finding the minimum cost graph matching is NP-hard. 
The probl...
Key Idea: Inference over Structure 
and Label Similarity 
Inference over Random 
Markov Field 
Each local factor depends 
...
Iterative Inference Algorithm: 
Loopy Belief Propagation 
If a node has ‘’good” neighbors, more likely it is a “good” matc...
Iterative Inference Algorithm: 
Loopy Belief Propagation 
If a node has ‘’good” neighbors, more likely it is a “good” matc...
Iterative Inference Algorithm: 
Loopy Belief Propagation 
If a node has ‘’good” neighbors, more likely it is a “good” matc...
Iterative Inference Algorithm: 
Loopy Belief Propagation 
If a node has ‘’good” neighbors, more likely it is a “good” matc...
Time Complexity: 
NeMa Subgraph Matching 
O(nG . nQ + T . nQ . mQ 
2. dQ) 
- nG = # of nodes in target network G 
- nQ = #...
O(nG . nQ + T . nQ . mQ 
2. dQ) 
- nG = # of nodes in target network G 
- nQ = # of nodes in query graph Q 
- mQ = Avg. # ...
Experiment Setup: 
NeMa Subgraph Matching 
extract subgraphs (7 nodes, diameter 3) from the data graph, add noises to the ...
Structural Noise: # edge updates in the query graph/ # edges in the extracted 
query graph. 
18/20 
Experiment Setup: 
NeM...
Structural Noise: # edge updates in the query graph/ # edges in the extracted 
query graph. 
Label Noise: Original node la...
Structural Noise: # edge updates in the query graph/ # edges in the extracted 
query graph. 
Label Noise: Original node la...
Experimental Results: 
NeMa Subgraph Matching 
0.92 
0.9 
0.88 
0.86 
0.84 
0.82 
0.8 
0.78 
0.76 
IMDB YAGO DBpedia 
Accu...
Graph Query by Example 
(with Chengkai Li, UT Arlington, ICDE 2014) 
Q1. Donald Knuth, Turing Award, Stanford University 
...
Conclusions 
Big-data as big-graphs 
Emerging queries on big-data 
User-friendly, efficient, and effective online-query-an...
Big Picture 
Machine 
Learning 
Databases 
Graph 
Algorithms 
Stream 
Algorithms 
Systems 
Distributed 
Computing 
Social ...
Upcoming SlideShare
Loading in …5
×

Dagstuhl seminar talk on querying big graphs

1,039 views

Published on

Dagstuhl Seminar on "Systems and Algorithms for Large Scale Graph Analytics"

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Dagstuhl seminar talk on querying big graphs

  1. 1. Towards Querying Big Graphs Arijit Khan Systems Group Department of Computer Science ETH Zürich arijit.khan@inf.ethz.ch
  2. 2. Big Data as Big Graph Knowledge Graph Bill Gates Sergey Brin Maryland Harvard Microsoft Stanford Jane Stanford Seattle Steve Woznaik Jerry Yang Apple NeXT Ajim Premji Wipro Yahoo! Silicon Valley Google Founded in founded founded nationality 1/20
  3. 3. More Graph Data Social Network Program Flow Biological Network Chemical Structure Transportation Network Image Data Graphs in Machine Learning
  4. 4. Emerging Graph Queries (ICDE 2012, Tutorial) Find an Asian restaurants on my route to the bank that my friends rated highly. Find a colleague of mine who goes to ETH Zurich, and who has a colleague working in University of Washington. Graph Pattern Matching Queries Graph Pattern Mining Queries Are Lady Gaga, Katy Perry, Britney Spears similar kind of singers? Given an executable, is it a malware? Graph Reachability Queries Who are the top-k most influential persons in Twitter? Graph Stream Queries Given a group of people, does their communication pattern resemble a terrorist network? 3/20
  5. 5. My Work in Graphs Big-Graphs Graph Query Tutorial, ICDE 2012 Heterogeneous Graph Search Uncertain Graphs Influence Maximization Graph Pattern Mining SIGMOD 2011 VLDB 2013 ICDE 2014 EDBT 2014 SDM 2011 GMatrix SIGMOD 2010 SIGMOD 2012 Tutorial, VLDB 2014 CloudMan 2012 CIKM 2012 SIGMOD 2014 Anomaly Detection Distributed Graph Systems Big-Data (Relational) Time-Series Stream Querying Graph Streams
  6. 6. Querying Knowledge Graphs Answering Web Questions Using Structured Data – Dream or Reality? - VLDB’09 Panel Discussion “People who like things I like” – an example of Facebook graph search query (January, 2013) Google Knowledge Graph (May, 2012) 4/20
  7. 7. Subgraph Search: Application in Malware Detection - Is there a Malware embedded in the software? Security and Intrusion Detection - Cesare et. al., TrustCom ’11 - Fredrikson et. al., IEEE Symposium on Security and Privacy ‘10 Function 9 Function 3 Function 7 Function 4 Function 2 Function 10 Function 1 Function 8 Function 11 Function 5 Function 6 Exception Char [], double Char[], double Long, float Exception An Example CallGraph Malware Pattern 5/20
  8. 8. Why is Graph Querying Difficult? Heterogeneity  lack of standardized schema, or quite large schema (65K for DBpedia Data). Uncertainty, noise, dynamic updates Massive volume Relational Data (Structured) vs. Heterogeneous Graph Data (Semi-structured) 6/20 Recursive joins  (useless) large intermediate results  Not scalable [Zeng et. al., VLDB ‘13]
  9. 9. Why is Graph Querying Difficult? Heterogeneity  lack of standardized schema, or quite large schema (65K for DBpedia Data). Relational Data (Structured) vs. Heterogeneous Graph Data (Semi-structured) 6/20 Uncertainty, noise, dynamic updates Massive volume Recursive joins  (useless) large intermediate results  Not scalable [Zeng et. al., VLDB ‘13]
  10. 10. Graph Query Query: Find a project that employee “Bob” is working on, and that project is supervised by the same Team_Leader who also works with employee “Alice”. Query Graph 1 Bob (Team Leader) Alice (Project) (Department) Bob (Employee) supervise (Project) work_in Knowledge Graph (Employee) ? (Employee) ? PM01 (Employee) Alice Y. DP0 John (Team Leader) PM02 Alice M. work_in Rob (Project) (Team Leader) work_on DP1 work_in (Employee) (Department) 7/20
  11. 11. Graph Query Query: Find a project that employee “Bob” is working on, and that project is supervised by the same Team_Leader who also works with employee “Alice”. Query Graph 1 Bob (Team Leader) Alice (Project) (Employee) ? (Employee) ? 7/20
  12. 12. Graph Query Query: Find a project that employee “Bob” is working on, and that project is supervised by the same Team_Leader who also works with employee “Alice”. Query Graph 1 Bob (Team Leader) Alice (Project) (Employee) ? (Employee) ? ? Query Graph 2 Bob (Team Leader) Alice (Project) (Employee) (Employee) ? ? (Project) 7/20
  13. 13. Graph Query Query: Find a project that employee “Bob” is working on, and that project is supervised by the same Team_Leader who also works with employee “Alice”. Query Graph 1 Bob (Team Leader) Alice (Project) (Department) Bob (Employee) supervise (Project) work_in Knowledge Graph (Employee) ? (Employee) ? ? Query Graph 2 Bob (Team Leader) Alice (Project) (Employee) (Employee) ? ? (Project) PM01 (Employee) Alice Y. DP0 John (Team Leader) PM02 Alice M. work_in Rob (Project) (Team Leader) work_on DP1 work_in (Employee) (Department) 7/20
  14. 14. Approximate Subgraph Search for Graph Querying If two entities are close in the query graph, they should also be close in the data graph. [VLDB 2013] 8/20 Relaxed Notion of graph matching  allows noise and small mismatch
  15. 15. Approximate Subgraph Search for Graph Querying If two entities are close in the query graph, they should also be close in the data graph. [VLDB 2013] Subgraph Isomorphism Subgraph Similarity Metrics NP-hard  time consuming. too strict to find approximate matches. Graph Edit Distance, Maximum Common Subgraph, # of Missing Edges. Not suitable to preserve closeness among entities. 8/20 Relaxed Notion of graph matching  allows noise and small mismatch
  16. 16. Approximate Subgraph Search for Graph Querying If two entities are close in the query graph, they should also be close in the data graph. [VLDB 2013] preserve h-hop neighborhood information of each query node. u1 u2 u3 u4 u5 2-hop neighborhood of u1 8/20 Relaxed Notion of graph matching  allows noise and small mismatch
  17. 17. Graph to Vector Conversion Convert h-hop neighborhood of each node into a multi-dimensional vector. u1 u2 u3 u4 u5 Information Propagation Distance between u and u’ 9/20
  18. 18. Graph to Vector Conversion Convert h-hop neighborhood of each node into a multi-dimensional vector. u1 u2 u3 u4 u5 Information Propagation Distance between u and u’  RG (u1) = { ⟨u2, 0.5⟩, ⟨u3, 0.5⟩, ⟨u4,0.25⟩ } Previous Applications of Information Propagation: Semi-supervised Learning [AI’ 08], Concept Propagation [CIKM ’06] 9/20
  19. 19. Node Matching Cost Node Matching Cost: Label Matching Cost + Difference between the neighborhood vectors Label Matching Cost Neighborhood Matching Cost L1 Difference Between Neighborhood Vectors u1 u2 u3 u4 u5 Target Network (G) v1 v2 v4 Φ f(v1,u1) Φ f(v2,u3) Φ f(v4,u4) Query Graph (Q) 10/20
  20. 20. Node Matching Cost Node Matching Cost: Label Matching Cost + Difference between the neighborhood vectors Neighborhood Matching Cost L1 Difference Between Neighborhood Vectors 10/20
  21. 21. Subgraph Matching Cost Subgraph Matching Cost: - Summation of individual node matching costs. u1 u2 u3 u4 u5 Data Graph (G) v1 v2 v4 f(v1,u1) f(v2,u3) Query Graph (Q) Φ Φ Φ f(v4,u4) all query nodes v Subgraph Matching Cost Model 11/20
  22. 22. Problem Formulation Given a data graph G, a query graph Q and the label difference threshold ϵ, find the minimum cost matching Φ arg min C(Φ) Φ such that, ΔL(lv, lu) ≤ ϵ for all v ∈ V(Q), u = Φ(v). u1 u2 u3 u4 u5 Data Graph (G) v1 v2 v4 f(v2,u3) Query Graph (Q) Φ f(v1,u1) Φ Φ f(v4,u4) Subgraph Matching Cost Model 12/20
  23. 23. Subgraph Matching Cost Function Properties The cost of a subgraph isomorphic embedding = 0. Our subgraph matching cost function can have false positives. If we permit only one-to-one node matches, there will not be any false positives. 13/20
  24. 24. Subgraph Matching Cost Function Complexity The problem of finding the minimum cost graph matching is NP-hard. The problem of finding the minimum cost matching is APX-hard. 14/20
  25. 25. Key Idea: Inference over Structure and Label Similarity Inference over Random Markov Field Each local factor depends only on a subset of variables Maximize over all variables NeMa Graph Matching Cost VQ = { v1, v2, …, vn } all neighbors v’ of the query node v all query nodes v 15/20
  26. 26. Iterative Inference Algorithm: Loopy Belief Propagation If a node has ‘’good” neighbors, more likely it is a “good” match. v1 v2 v3 v4 v5 a b c d e a a a b c b c b c e d d e c u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 Query Graph Data Graph Iterative Inference Algorithm 16/20
  27. 27. Iterative Inference Algorithm: Loopy Belief Propagation If a node has ‘’good” neighbors, more likely it is a “good” match. v1 v2 v3 v4 v5 a b c d e a a a b c b c b c e d d e c u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 Query Graph Data Graph Iterative Inference Algorithm 16/20
  28. 28. Iterative Inference Algorithm: Loopy Belief Propagation If a node has ‘’good” neighbors, more likely it is a “good” match. v1 v2 v3 v4 v5 a b c d e a a a b c b c b c e d d e c u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 Query Graph Data Graph Iterative Inference Algorithm 16/20
  29. 29. Iterative Inference Algorithm: Loopy Belief Propagation If a node has ‘’good” neighbors, more likely it is a “good” match. v1 v2 v3 v4 v5 a b c d e a a a b c b c b c e d d e c u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 Query Graph Data Graph Iterative Inference Algorithm 16/20
  30. 30. Time Complexity: NeMa Subgraph Matching O(nG . nQ + T . nQ . mQ 2. dQ) - nG = # of nodes in target network G - nQ = # of nodes in query graph Q - mQ = Avg. # of matches per query node - dQ = Avg # of h-hop neighbors of a query node - T = # of iterations Time-Complexity Linear in the number of nodes in the target graph !!  17/20
  31. 31. O(nG . nQ + T . nQ . mQ 2. dQ) - nG = # of nodes in target network G - nQ = # of nodes in query graph Q - mQ = Avg. # of matches per query node - dQ = Avg # of h-hop neighbors of a query node - T = # of iterations Index node labels to speed up the candidate generation procedure. Pruning of unpromising candidate nodes in successive iterations of inference method. Can be parallelized at the level of each query node  Pregel Model. 17/20 Time Complexity: NeMa Subgraph Matching Time-Complexity Linear in the number of nodes in the target graph !! 
  32. 32. Experiment Setup: NeMa Subgraph Matching extract subgraphs (7 nodes, diameter 3) from the data graph, add noises to the extracted subgraphs. 18/20
  33. 33. Structural Noise: # edge updates in the query graph/ # edges in the extracted query graph. 18/20 Experiment Setup: NeMa Subgraph Matching extract subgraphs (7 nodes, diameter 3) from the data graph, add noises to the extracted subgraphs.
  34. 34. Structural Noise: # edge updates in the query graph/ # edges in the extracted query graph. Label Noise: Original node labels consist of several words. Add extra words in the node labels of the query graph. Label Noise is measured as | w  w | u v   # words in | w w | 1 u v wu and wv are the set of words in labels of query node v and its candidate node u, respectively. Results 35% label noise in query nodes. original label # extra words added 1 1 2 1 3 2 No of Extra words required to add for introducing 35% label noise 18/20 Experiment Setup: NeMa Subgraph Matching extract subgraphs (7 nodes, diameter 3) from the data graph, add noises to the extracted subgraphs.
  35. 35. Structural Noise: # edge updates in the query graph/ # edges in the extracted query graph. Label Noise: Original node labels consist of several words. Add extra words in the node labels of the query graph. Label Noise is measured as | w  w | u v   # words in | w w | 1 u v wu and wv are the set of words in labels of query node v and its candidate node u, respectively. Results 35% label noise in query nodes. original label # extra words added 1 1 2 1 3 2 No of Extra words required to add for introducing 35% label noise u is a considered a candidate of v if their label noise is at most equal to a predefined threshold (set as 50%). Unlabel one query node to simulate real-life query answering process. 18/20 Experiment Setup: NeMa Subgraph Matching extract subgraphs (7 nodes, diameter 3) from the data graph, add noises to the extracted subgraphs.
  36. 36. Experimental Results: NeMa Subgraph Matching 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 IMDB YAGO DBpedia Accuracy (Top-1 Match) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IMDB YAGO Dbpedia Top-k Match Finding Time (sec) Top-1 Top-3 Top-5 BLINKS [SIGMOD ‘07] SAGA [Bioinfo ‘06] IsoRank [PNAS ‘08] gStore [VLDB ‘11] NeMa [Our Method, VLDB ‘13] Accuracy 0.52 0.75 0.63 0.59 0.91 Time (Top-1 Match) 1.92 sec 15.95 sec 4882.0 sec 0.92 sec 0.97 sec Comparison with State-of-the-art Keyword Search and Approximate Subgrph Search Methods, using IMDB Dataset Effectiveness of NeMa Efficiency of NeMa 19/20
  37. 37. Graph Query by Example (with Chengkai Li, UT Arlington, ICDE 2014) Q1. Donald Knuth, Turing Award, Stanford University Dennis Ritchie Turing Award Harvard Univ. Ken Thompson Turing Award UC Berkeley Peter Naur Turing Award Niels Bohr Inst. Q2. Soup, Jewish Cuisine, Chicken Soup Coffee Italian Latte Spice Spain Cuisine Paprika Dessert American Apple Pie 28M nodes, 47M edges, and 5428 distinct edge labels Related Work: 1) Query by Example [AFIPS ‘75] 2) Answering Table Augmentation Queries from Unstructured Lists on the Web [VLDB ‘09] 3) Set Expansion: SEAL [ICDM ‘07] 4) Google Sets and Google Squared 5) Exemplar Queries [VLDB ‘14] 20/20
  38. 38. Conclusions Big-data as big-graphs Emerging queries on big-data User-friendly, efficient, and effective online-query-answering algorithms Future Research Directions: Graph partitioning and load balancing, decoupling of graph storage from query processors; integration of graph-processing and data-flow systems
  39. 39. Big Picture Machine Learning Databases Graph Algorithms Stream Algorithms Systems Distributed Computing Social Computing BIG GRAPHS Semi-structured Data Stream Data Uncertain Data Social and Information Networks Knowledge Graphs RDF and Ontology Web/ Semantic Search Business Analytics Viral Marketing Health/ Medicare Natural Language Processing Computer Security

×