A query language for analyzing networks

1,158 views
1,088 views

Published on

Information networks are a popular way to represent information, especially in domains where the emphasis lies on the structural relationships between the entities rather than their features. Notable examples are online social networks and road networks. This special focus on network topology has led to the development of specialized graph databases. However, few of these databases offer a high-level declarative interface suited for analyzing information networks.
In this talk I present our work on developing a query language for analyzing networks. I will focus on the general principles we followed in the design of this language, and the main challenges related to developing it into a scalable tool for network analysis.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,158
On SlideShare
0
From Embeds
0
Number of Embeds
804
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A query language for analyzing networks

  1. 1. A query language foranalyzing networksAnton Dries(based on joint work with Siegfried Nijssen)
  2. 2. IdeaDeclarative language for manipulating andanalyzing information networks “Query language” – cf. SQL with special focus on querying connections simplicity / expressivity / flexibility
  3. 3. Information networksObjects (“nodes”)Connections between objects (“edges”)Focus on structure (“topology”) a.k.a. “large single graph”
  4. 4. Information networks HTTP://SPIKEDMATH.COM/382.HTML
  5. 5. Information networksExamples: World Wide Web Social networks Bibliographical Transportation Biological
  6. 6. Process Common tasksTOP DOWN APPROACH Query language Operational model (algebra) Implementation & Optimization Data management & storage
  7. 7. Process Common tasksTOP DOWN APPROACH Query language [CIKM 2009] Operational model (algebra) [MLG 2010] Implementation & Optimization ? Data management & storage Graph databases (DEX, Neo, ...)
  8. 8. Common tasksFeature-based queriesStructure-based queriesAggregationBasic graph problems e.g. degree, shortest pathNetwork analysis (e.g. centrality measures)... Mainly path-based queries
  9. 9. BiQL“The BISON Query Language”
  10. 10. keyword graphs keyword has data mining keyw ord keyw has ord author author of publication has keyword rof author a u tho author of author au tho of ro r tho f au author of author publication of o rd th or author ey w au f s k author o ha ord yw has ke publication has keyw ord keyword keyword probabilitiesmachine learning
  11. 11. keyword graphs keyword has data mining keyw ord keyw has ord author author of publication co-au has keyword ro f author u tho thor a co-a co-au utho thor author of author au co co- tho r aut -au of hor ro r tho -author tho co f au r author of author publication of o rd th or co-author author yw au s ke aut hor of ha ord yw has ke publication has keyw ord keyword keywordmachine learning co-authorship probabilities
  12. 12. Manipulation “query language”SQL-style: loosely based on SQL syntaxOne type of query: create set of (new) objectsCREATE/UPDATE Domain<Vars> { Properties } FROM Path Expression WHERE Constraints
  13. 13. Example keyword graphs keyword has data mining keyw ord keyw has author ord co-au author author of publication author thor has keyword co-a author hor of aut co-au utho thor author co co- r aut author of -au author au hor tho tho ro f co-author ro tho r f au author author of author publication of co-author ord thor author author yw au f s ke author o ha yword has ke publication has keyw ord keyword keyword probabilitiesmachine learning CREATE CoAuthor<A,B> { A <−>, B <−> } FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
  14. 14. keyword graphs keyword has data mining keyw ord keyw has author Example ord co-au author author of publication author thor has keyword co-a f author or o auth co-aut utho hor author co co-a r author of -au author uth au or tho tho f co-author ro ro r tho f au author author of author publication f co-author rd o ro author y wo a uth author e of sk author ha yword has ke publication has keywo rd keyword keyword probabilitiesmachine learning “object creation” – output specification CREATE CoAuthor<A,B> { A <−>, B <−> } FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B “path expression” – structural selection (+ other operations)
  15. 15. Structural selection Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B, Publicati on P −> HasKeyw ord −> K eyword K Author Author AuthorOf Publication P AuthorOf A B HasKeyword Keyword K Author Author CoAuthor A BAuthor A −> CoAuthor −> Author B −> CoAuthor CoAuthor CoAuthor −> Author C −> CoAuthor −> Author A Author C
  16. 16. Structural selection regular expressions Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B list variables each expansion of regular expression shouldlead to a valid (simple) path expression defining the same variables
  17. 17. Structural selection Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B Node A −> Edge [E] −> Node B (n1, [e1], n2) e1 n2 e4 (n1, [e3], n3)n1 e2 (A,E,B) = (n2, [e2], n3) e3 e5 n4 (n2, [e4], n4) n3 (n3, [e5], n4) Node A −> Edge [E] −> Node −> Edge [E] −> Node B (n1, [e1,e2], n3) (A,E,B) = (n1, [e1,e4], n4) (n1, [e3,e5], n4)
  18. 18. Output specification CREATE CoAuthor<A,B> { A <−>, B <−> }FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B UPDATE CREATE CoAuthor<A,B> { A <−>, B <−> }update/ put them for each with these create in this combination propertiesobjects domain of values
  19. 19. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B (n1, [e1], n2) ([e1], n2) (n1, [e3], n3) ([e3], n3) (n2, [e2], n3) <A> n1 ([e1,e2], n3) (n2, [e4], n4) ([e1,e4], n4) (n3, [e5], n4) ([e3,e5], n4) (n1, [e1,e2], n3) ([e2], n3) n2 (n1, [e1,e4], n4) ([e4], n4) (n1, [e3,e5], n4) n3 ([e5], n4)
  20. 20. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B ([e1], n2) ([e3], n3) <A> n1 ([e1,e2], n3) ([e1,e4], n4) ([e3,e5], n4) ([e2], n3) n2 ([e4], n4) n3 ([e5], n4)
  21. 21. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B ([e1], n2) ([e1]) n2 ([e3], n3) ([e3]) n3<A> n1 ([e1,e2], n3) <B> n1 ([e1,e2]) ([e1,e4], n4) ([e1,e4]) n4 ([e3,e5], n4) ([e3,e5]) ([e2], n3) ([e2]) n3 n2 n2 ([e4], n4) ([e4]) n4 n3 ([e5], n4) n3 ([e5]) n4
  22. 22. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B ([e1]) n2 ([e3]) n3 <B> n1 ([e1,e2]) ([e1,e4]) n4 ([e3,e5]) ([e2]) n3 n2 ([e4]) n4 n3 ([e5]) n4
  23. 23. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B ([e1]) n2 ([e3]) n3 <B> n1 ([e1,e2]) count 3 ([e1,e4]) n4 ([e3,e5]) ([e2]) n3 n2 2 ([e4]) n4 n3 ([e5]) n4 1
  24. 24. n1 e1 e3 n2 e2 e4 e5 n4 Output specification n3 UPDATE <A> { nr_reach: count<B> } FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B ([e1]) n2 ([e3]) n1 n3 nr_reach: 3 <B> n1 ([e1,e2]) count 3 UPDATE ([e1,e4]) n4 ([e3,e5]) n2 nr_reach: 2 ([e2]) n3 n2 2 ([e4]) n4 n3 ([e5]) n4 n3 1 nr_reach: 1
  25. 25. Object propertiesAttribute definition strength: count<P> start: min<P>(P.year)Link definition A −>, B −> P <−
  26. 26. Examples
  27. 27. Co-authorship adding a new relationship A B CoAuthor strength: 3 start: 2008 end: 2010CREATE CoAuthor<A,B> { A −>, B −>, <− P, P1 P2 P3 start: min<P>(P.year), year: 2008 year: 2008 year: 2010 end: max<P>(P.year), strength: count<P> }FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
  28. 28. Size of neighborhood transitive closureUPDATE <A> { netsize: count<B> }FROM Author A −> (CoAuthor [co] <− Author −>)* CoAuthor [co] <− Author BWHERE length(co) < 4
  29. 29. Distance based on shortest pathCREATE Connection<A,B> { A −>, −> B, distance: min<E>(length(E)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B distance: min<E>(length(E)) distance: min<E>(sum(E.weight)) distance: max<E>(product(E.probability))
  30. 30. Centrality measures degree centralityUPDATE <A> { Cdegree: count<B>/(count<N>-1) }FROM Node A −− Edge -- Node B, Node N deg(v) CD (v) = n 1 closeness centralityUPDATE <A> { closeness: 1/sum<B>(min<AB>(AB.distance))}FROM Node A −> Connection AB −> Node B 1 CC (v) = P t2V dist(v, t)
  31. 31. Query execution
  32. 32. Operational modelQuery algebra operators: Evaluate path expression (graph –> tuple) Relational algebra (tuple –> tuple) Construction operator (tuple –> graph)Used by prototype implementation
  33. 33. Operational model Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B“Pattern match” operator is too broadEnumerates all paths exponential e.g. even when only shortest path is requestedNeed for atomic graph operations (open question)
  34. 34. Pattern matching Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BHomomorphism matching (no cycle check) more efficient than isomorphism cycles could lead to unbounded solutionsUse constraints and algebraic solutions to avoidinfinite processing operator interaction – “pattern match” operator not atomic enough
  35. 35. Avoiding unbounded solutionsCREATE Distance<A,B> { A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BCREATE ConnectionWeight<A,B> { A −>, −> B, distance: sum<E>(product(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BCREATE PathCount<A,B> { A −>, −> B, numP: count<E> }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
  36. 36. Fletcher’s algorithm [FLETCHER, 1980] [BATAGELJ, 1994] FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j) Ck,k,k = e⊙ ⊕ Ck,k,kwhere C0,i,j weighted adjacency matrix (S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator n number of nodes in the graph
  37. 37. Fletcher’s algorithmDynamic programming approachAt step k: Ck,i,j contains solution using pathscontaining only nodes 1...kSome examples ...
  38. 38. Fletcher’s algorithm (S, ⊕, ⊙, e⊕, e⊙) = (ℝ+, min, +, ∞, 0) FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = min(Ck-1,i,j,Ck-1,i,k + Ck-1,k,j) Ck,k,k = 0 a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ... Ck,k* = min(0, Ck,k, 2Ck,k, 3Ck,k, ...) = 0 (Ck,k >= 0)Floyd-Warshall shortest path algorithm
  39. 39. Fletcher’s algorithm (S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1) FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1) sum of all path weights
  40. 40. Fletcher’s algorithm (S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k a* = 1 + a + a2 + a3 + ... Ck,k* = 1 (Ck,k = 0) no cycle k–>k Ck,k* = ∞ (Ck,k > 0) cycle k–>k number of paths
  41. 41. Fletcher’s algorithmGeneralized algorithm for several connectivityproblems O(n3) time complexity, O(n3) or O(n2) space complexity for many problems: best known time complexity (exact, for arbitrary graphs) also in the presence of cycles (thanks to (Ck,k,k*) term)Applicability depends on constraints on path
  42. 42. Fletcher’s algorithm (S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)CREATE Connection<A,B> { A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BWHERE A.color = ‘blue’ if e1e2 matches path expression then e1 and e2 must match path expression = + => has to compute all pair shortest paths
  43. 43. ConclusionA query language for analyzing networksFocussed to path based analysisRaises interesting questionsSome ideas on implementation and optimization
  44. 44. Future workNeed for atomic graph operationsFletcher’s algorithm: interaction with constraints complex path expressions (not just Node-Edge-Node)Approximate answers – O(n3) is very badOther metrics: flow-based, pagerank, ... mining
  45. 45. Thank you

×