### SlideShare for iOS

by Linkedin Corporation

FREE - On the App Store

Featured in: Technology

Relational databases are perhaps the most commonly used data management systems. In relational databases, data is modeled as a collection of disparate tables. In order to unify the data within these ...

Relational databases are perhaps the most commonly used data management systems. In relational databases, data is modeled as a collection of disparate tables. In order to unify the data within these tables, a join operation is used. This operation is expensive as the amount of data grows. For information retrieval operations that do not make use of extensive joins, relational databases are an excellent tool. However, when an excessive amount of joins are required, the relational database model breaks down. In contrast, graph databases maintain one single data structure---a graph. A graph contains a set of vertices (i.e. nodes, dots) and a set of edges (i.e. links, lines). These elements make direct reference to one another, and as such, there is no notion of a join operation. The direct references between graph elements make the joining of data explicit within the structure of the graph. The benefit of this model is that traversing (i.e. moving between the elements of a graph in an intelligent, direct manner) is very efficient and yields a style of problem-solving called the graph traversal pattern. This session will discuss graph databases, the graph traversal programming pattern, and their use in solving real-world problems.

- Total Views
- 18,635
- Views on SlideShare
- 17,285
- Embed Views

- Likes
- 52
- Downloads
- 781
- Comments
- 2

http://nosql.mypopescu.com | 1264 |

http://techpost.tumblr.com | 27 |

http://irr.posterous.com | 13 |

http://static.slidesharecdn.com | 7 |

https://si0.twimg.com | 5 |

http://www.techgig.com | 4 |

https://sabalive.saba.com | 4 |

http://translate.googleusercontent.com | 3 |

https://twimg0-a.akamaihd.net | 3 |

http://paper.li | 3 |

http://rvil.tumblr.com | 3 |

http://10.150.200.57 | 3 |

http://webcache.googleusercontent.com | 2 |

http://www.brijj.com | 2 |

http://revil.com | 1 |

https://test.sabalive.net | 1 |

http://twitter.com | 1 |

http://www.mefeedia.com | 1 |

http://safe.tumblr.com | 1 |

resource://brief-content | 1 |

http://www.linkedin.com | 1 |

Uploaded via SlideShare as Adobe PDF

© All Rights Reserved

- 1. Graph Databases: Trends in the Web of Data Marko A. Rodriguez Graph Systems Architect http://markorodriguez.com http://twitter.com/twarko http://slideshare.com/slidarko KRDB Trends in the Web of Data School - Brixen/Bressanone, Italy– September 18, 2010 September 18, 2010
- 2. Abstract Relational databases are perhaps the most commonly used data management systems. In relational databases, data is modeled as a collection of disparate tables. In order to unify the data within these tables, a join operation is used. This operation is expensive as the amount of data grows. For information retrieval operations that do not make use of extensive joins, relational databases are an excellent tool. However, when an excessive amount of joins are required, the relational database model breaks down. In contrast, graph databases maintain one single data structure—a graph. A graph contains a set of vertices (i.e. nodes, dots) and a set of edges (i.e. links, lines). These elements make direct reference to one another, and as such, there is no notion of a join operation. The direct references between graph elements make the joining of data explicit within the structure of the graph. The beneﬁt of this model is that traversing (i.e. moving between the elements of a graph in an intelligent, direct manner) is very eﬃcient and yields a style of problem-solving called the graph traversal pattern. This session will discuss graph databases, the graph traversal programming pattern, and their use in solving real-world problems.
- 3. Outline • Graph Structures, Algorithms, and Algebras • Graph Databases and the Property Graph • TinkerPop Open-Source Graph Product Suite • Real-Time, Real-World Use Cases for Graphs
- 4. difﬁculty graphs algebra databases indices time data models Diﬃculty Chart software algorithms real-world conclusion
- 5. Outline • Graph Structures, Algorithms, and Algebras • Graph Databases and the Property Graph • TinkerPop Open-Source Graph Product Suite • Real-Time, Real-World Use Cases for Graphs
- 6. difﬁculty graphs algebra databases indices time data models Diﬃculty Chart software algorithms real-world conclusion
- 7. G = (V, E)
- 8. A Vertex There once was a vertex i ∈ V named tenderlove.
- 9. Two Vertices And then came along another vertex j ∈ V named sixwing. Thus, i, j ∈ V .
- 10. A Directed Edge Our tenderlove extended a relationship to sixwing. Thus, (i, j) ∈ E.
- 11. The Single-Relational, Directed Graph More vertices join, create edges and, in turn, the graph grows...
- 12. The Single-Relational, Directed Graph as a Matrix A single-relational graph deﬁned as G = (V, E ⊆ (V × V )) can be represented as the adjacency matrix A ∈ {0, 1}n×n, where 1 if (i, j) ∈ E Ai,j = 0 otherwise.
- 13. The Single-Relational, Directed Graph as a Matrix 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 G A
- 14. The Single-Relational, Directed Graph • All vertices are homogenous in meaning—all vertices denote the same type of object (e.g. people, webpages, etc.).1 • All edges are homogenous in meaning—all edges denote the same type of relationships (e.g. friendship, works with, etc.).2 1 This is not completely true. All n-partite single-relational graphs allow for the division of the vertex set into n subsets, where V = n Ai : Ai ∩ Aj = ∅. Thus, its possible to implicitly type the vertices. i 2 This is not completely true. There exists an injective, information-preserving function that maps any multi-relational graph to a single-relational graph, where edge types are denoted by topological structures. Thus, at a “higher-level,” it is possible to create a heterogenous set of relationships. Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of Applied Mathematics and Computer Sciences, 5(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]
- 15. Applications of Single-Relational Graphs • Social: deﬁne how people interact (collaborators, friends, kins). • Biological: deﬁne how biological components interact (protein, food chains, gene regulation). • Transportation: deﬁne how cities are joined by air and road routes. • Dependency: deﬁne how software modules, data sets, functions depend on each other. • Technology: deﬁne the connectivity of Internet routers, web pages, etc. • Language: deﬁne the relationships between words.
- 16. The Limitations of Single-Relational Graph Modeling Friendship Graph Favorite Graph Works-For Graph Unfortunately, single-relational graphs are independent of each other. This is because G = (V, E)—there is only a single edge set E (i.e. a single type of relation).
- 17. Numerous Algorithms for Single-Relational Graphs We would like a more ﬂexible graph modeling construct, but unfortunately, most of our graph algorithms were designed for single-relational graphs.3 • Geodesic: diameter, radius, eccentricity, closeness, betweenness, etc. • Spectral: random walks, PageRank, eigenvector centrality, spreading activation, etc. • Assortativity: scalar, categorical, hierarchal, etc. • Others: ...4 3 For a ﬁne book on graph analysis algorithms, please see: Brandes, U., Erlebach T., “Network Analysis: Methodological Foundations,” edited book, Springer, 2005. 4 One of the purposes of this presentation is advocate for local graph analysis algorithms (i.e. priors-based, relative) vs. global graph analysis algorithms. Most popular graph analysis algorithms are global in that they require an analysis of the whole graph (or a large portion of a graph) to yield results. Local analysis algorithms are dependent on sub-graphs of the whole and in eﬀect, can boast faster running times.
- 18. How do we solve this? A multi-relational graph and a path algebra.
- 19. G = (V, E)
- 20. A Directed Edge
- 21. A Directed, Labeled Edge friend Lets specify the type of relationship that exists between tenderlove and sixwing. Thus, (i, j) ∈ Efriend.
- 22. Growing a Multi-Relational Graph friend friend Lets make the friendship relationship symmetric. Thus, (j, i) ∈ Efriend.
- 23. Growing a Multi-Relational Graph friend friend friend friend Lets add marko to the mix: k ∈ V . This graph is still single-relational. There is only one type of relation.
- 24. Growing a Multi-Relational Graph friend friend favorite friend friend Lets add an (i, l) ∈ Efavorite. Now there are multiple types of relationships: Efriend and Efavorite (2 edge sets).
- 25. The Multi-Relational, Directed Graph • At this point, there is a multi-relational, directed graph: G = (V, E), where E = (E0, E1, . . . , Em ⊆ (V × V )).5 • Vertices can denote diﬀerent types of objects (e.g. people, places).6 • Edge can denote diﬀerent types of relationships (e.g. friend, favorite).7 5 Another representation is G ⊆ (V × Ω × V ), where Ω ⊆ Σ∗ is the set of legal edge labels. 6 Vertex types can be determined by the domain and range speciﬁcation of the respective edge relation/label/predicate. Or, another way, by means of an explicit typing relation such as a, type, b . 7 Edge types are determined by the label that accompanies the edge.
- 26. The Multi-Relational, Directed RDF Graph • This is the data model of the Web of Data—the RDF data model. • The RDF data model’s vertex set is split into URIs (U ), literals (L), and blank/anonymous nodes (B), such that: G ⊆ ((U × B) × U × (U × B × L)).8 8 Named graphs are a popular extension to the RDF data model. There are various serializatons such as TriX FIND and Trig FIND. However, for the sake of brevity, this presentation will not discuss named graphs.
- 27. The Multi-Relational, Directed Graph as a Tensor A three-way tensor can be used to represent a multi-relational graph. If G = (V, E = {E0, E1, . . . , Em ⊆ (V × V )}) is a multi-relational graph, then A ∈ {0, 1}n×n×m and 1 if (i, j) ∈ Em : 1 ≤ k ≤ m Ak i,j = 0 otherwise. Thus, each edge set in E represents an adjacency matrix and the combination of m adjacency matrices forms a 3-way tensor.
- 28. The Multi-Relational, Directed Graph as a Tensor friend 0 0 0 0 0 0 0 1 friend favorite 0 0 0 0 0 0 0 0 s er sw nd an e ite fri or G A v fa
- 29. Multi-Relational Graph Algorithms “Can we evaluate single-relational graph analysis algorithms on a multi-relational graph?”
- 30. The Meaning of Edge Meanings loves loves loves hates hates hates loves loves hates hates • Multi-relationally: tenderlove is more liked than marko. • Single-relationally: tenderlove and marko simply have the same in-degree. Given, lets say, degree-centrality, tenderlove and marko are equal as they have the same number of relationships. The edge labels do not eﬀect the output of the degree-centrality algorithm.
- 31. What Do You Mean By “Central?” answer ... answer_for ite or v What is your favorite fa answer_by bookstore? favorite question_by ... friend friend friend Lets focus speciﬁcally on centrality. What is the most central vertex in a multi-relational graph? Who is the most central friend in the graph—by friendship, by question answering, by favorites, etc?
- 32. Primary Eigenvector “What does the primary eigenvector of a multi-relational graph mean?”91011 9 We will use the primary eigenvector for the following argument. Note that the same argument applies for all known single-relational graph algorithms (i.e. geodesic, spectral, community detection, etc.). 10 Technical details are left aside such as outgoing edge probability distributions and the irreducibility of the graph. 11 The popular PageRank vector is deﬁned as the primary eigenvector of a low-probability fully connected graph combined with the original graph (i.e. both graphs maintain the same V ).
- 33. Primary Eigenvector: Ignoring Edge Labels |V |×|V | • If π = Bπ, where B ∈ N+ is the adjacency matrix formed by merging the edge sets in E, then edge labels are ignored—all edges are treated equally. • In this “ignoring labels”-model, there is only one primary eigenvector for the graph—one deﬁnition of centrality. • With a heterogenous set of vertices connected by a heterogenous set of edges, what does this type of centrality mean?
- 34. Primary Eigenvector: Isolating Subgraphs • Are there other primary eigenvectors in the multi-relational graph? • You can ignore certain edge sets and calculate the primary eigenvector (e.g. pull out the single-relational “friend”-graph.) π = Afriendπ, where Afriend ∈ {0, 1}|V |×|V | is the adjacency matrix formed by the edge set Efriend. • Thus, you can isolate subgraphs (i.e. adjacency matrices) of the multi-relational graph and calculate the primary eigenvector for those subgraphs. • In this “isolation”-model, there are m deﬁnitions of centrality—one for each isolated subgraph.12 12 Remember, A ∈ {0, 1}n×n×m .
- 35. Ultimately what we want is...
- 36. Primary Eigenvector: Turing Completeness • What about using paths through the graph—not simply explicit one-step edges? • What about determining centrality for a relation that isn’t explicit in E (i.e. Ak ∈ A)? In general, what about π = Xπ, where X is a derived adjacency matrix of the multi-relational graph. For example, if I know who everyone’s friends are, then I know (i.e. can infer, derive, compute) who everyone’s friends-of-a-friends (FOAF) are. What about the primary eigenvector of the derived FOAF graph? • In the end, you want a Turing-complete framework—you want complete control (universal computability) over how π moves through the multi-relational graph structure.13 13 These ideas are expounded upon at great length throughout this presentation.
- 37. A Path Algebra for Evaluating Single-Relational Algorithms on Multi-Relational Graphs • There exists a multi-relational graph algebra for mapping single-relational graph analysis algorithms to the multi-relational domain.14 • The algebra works on a tensor representation of a multi-relational graph. • In this framework and given the running example, there are as many primary eigenvectors as there are abstract path deﬁnitions. 14 * Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, doi:10.1016/j.joi.2009.06.004, 2009. [http://arxiv.org/abs/0806.2274] * Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems, 21(7), pp. 727–739, doi:10.1016/j.knosys.2008.03.030, 2008. [http://arxiv.org/abs/0803.4355] * Rodriguez, M.A., Watkins, J.,“Grammar-Based Geodesics in Semantic Networks,” Knowledge-Based Systems, in press, doi:10.1016/j.knosys.2010.05.009, 2010.
- 38. The Operations of the Multi-Relational Path Algebra • A · B: ordinary matrix multiplication determines the number of (A, B)- paths between vertices. • A : matrix transpose inverts path directionality. • A ◦ B: Hadamard, entry-wise multiplication applies a ﬁlter to selectively exclude paths. • n(A): not generates the complement of a {0, 1}n×n matrix. • c(A): clip generates a {0, 1}n×n matrix from a Rn×n matrix. + • v ±(A): vertex generates a {0, 1}n×n matrix from a Rn×n matrix, where + only certain rows or columns contain non-zero values. • xA: scalar multiplication weights the entries of a matrix. • A + B: matrix addition merges paths.
- 39. Primary Eigenvectors in a Multi-Relational Graph • Friend: Afriend π 2 • FOAF: Afriend · Afriend π ≡ Afriend π 2 • FOAF (no self): Afriend ◦ n(I) π 15 2 • FOAF (no friends nor self): Afriend ◦ n Afriend ◦ n(I) π • Co-Worker: Aworks at · Aworks at ◦ n (I) π • Friend-or-CoWorker: 0.65Afriend + 0.35 Aworks at · Aworks at ◦ n ( I) π • ...and more.16 15 I ∈ {0, 1}|V |×|V | : Ii,i = 1—the identity matrix. 16 Note, again, that the examples are with respect to determining the primary eigenvector of the derived adjacency matrix. The same argument holds for all other single-relational graph analysis algorithms. In general, the path algebra provides a means of creating “higher-order” (i.e. semantically-rich) single-relational graphs from a single multi-relational graph. Thus, these derived matrices can be subjected to standard single-relational graph analysis algorithms.
- 40. Deriving “Semantically Rich” Adjacency Matrices 0 0 0 0 0 0 0 0 = 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 ∪ 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 s an f) d er n 0 0 0 0 sw se ie nd fri rs o -fr an e e ite fri (n -of d sw l en A Afriend · A friend or nd ◦ n(I) A v e e fa rit fri vo fa 2 Afriend ◦ n(I) "friend-of-a-friend (no self)" Use the multi-relational graph to generate explicit edges that were implicitly deﬁned as paths. Those new explicit edges can then be memoized17 and re-used (time vs. space tradeoﬀ)—aka path reuse. 17 Memoization Wikipedia entry: http://en.wikipedia.org/wiki/Memoization.
- 41. Beneﬁts, Drawbacks, and Future of the Path Algebra • Beneﬁt: Provides a set of theorems for deriving equivalences and thus, provides the foundation for graph traversal engine optimizers.18 Serves a similar purpose as the relational algebra for relational databases.19 • Drawback: The algebra is represented in matrix form and thus, operationally, works globally over the graph.20 • Future: A non-matrix-based, ring theoretic model of graph traversal that supports +, −, and · on individual vertices and edges. The Gremlin [http://gremlin.tinkerpop.com] graph traversal engine presented later provides the implementation before a fully-developed theory. 18 Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274] 19 Codd, E.F., “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 13(6), pp. 377–387, doi:10.1145/362384.362685, 1970. 20 It is possible to represent local traversals using vertex ﬁlters at the expense of clumsy notation.
- 42. Outline • Graph Structures, Algorithms, and Algebras • Graph Databases and the Property Graph • TinkerPop Open-Source Graph Product Suite • Real-Time, Real-World Use Cases for Graphs
- 43. difﬁculty graphs algebra databases indices time data models Diﬃculty Chart software algorithms real-world conclusion
- 44. The Simplicity of a Graph • A graph is a simple data structure. • A graph states that something is related to something else (the foundation of any other data structure).21 • It is possible to model a graph in various types of databases.22 Relational database: MySQL, Oracle, PostgreSQL JSON document database: MongoDB, CouchDB XML document database: MarkLogic, eXist-db etc. 21 A graph can be used to represent other data structures. This point becomes convenient when looking beyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing their applicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc. 22 For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directed graph. Note that it is possible to model multi-relational graphs in these types of database as well.
- 45. Representing a Graph in a Relational Database outV | inV ------------ A A | B A | C C | D B C D | A D
- 46. Representing a Graph in a JSON Database { A : { outE : [B, C] A } B : { outE : [] } B C C : { outE : [D] } D : { outE : [A] D } }
- 47. Representing a Graph in an XML Database <graphml> <graph> <node id=A /> A <node id=B /> <node id=C /> <node id=D /> <edge source=A target=B /> <edge source=A target=C /> B C <edge source=C target=D /> <edge source=D target=A /> </graph> </graphml> D
- 48. Deﬁning a Graph Database “If any database can represent a graph, then what is a graph database?”
- 49. Deﬁning a Graph Database A graph database is any storage system that provides index-free adjacency.2324 23 There is no “oﬃcial” deﬁnition of what makes a database a graph database. The one provided is my deﬁnition (respective of the inﬂuence of my collaborators in this area). However, hopefully the following argument will convince you that this is a necessary deﬁnition. Given that any database can model a graph, such a deﬁnition would not provide strict enough bounds to yield a formal concept (i.e. ). 24 There is adjacency between the elements of an index, but if the index is not the primary data structure of concern (to the developer), then there is indirect/implicit adjacency, not direct/explicit adjacency. A graph database exposes the graph as an explicit data structure (not an implicit data structure).
- 50. Deﬁning a Graph Database by Example Toy Graph Gremlin (stuntman) B E A C D
- 51. Graph Databases and Index-Free Adjacency B E A C D • Our gremlin is at vertex A. • In a graph database, vertex A has direct references to its adjacent vertices. • Constant time cost to move from A to B and C . It is dependent upon the number of edges emanating from vertex A (local).
- 52. Graph Databases and Index-Free Adjacency B E A C D The Graph (explicit)
- 53. Graph Databases and Index-Free Adjacency B E A C D The Graph (explicit)
- 54. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D • Our gremlin is at vertex A.
- 55. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D • In a non-graph database, the gremlin needs to look at an index to determine what is adjacent to A. • log2(n) time cost to move to B and C . It is dependent upon the total number of vertices and edges in the database (global).
- 56. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D The Index (explicit) The Graph (implicit)
- 57. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D The Index (explicit) The Graph (implicit)
- 58. Index-Free Adjacency • While any database can implicitly represent a graph, only a graph database makes the graph structure explicit.25 • In a graph database, each vertex serves as a “mini index” of its adjacent elements.26 • Thus, as the graph grows in size, the cost of a local step remains the same.27 25 Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_ Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in a relational database (MySQL) and a graph database (Neo4j). 26 Each vertex can be intepreted as a “parent node” in an index with its children being its adjacent elements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit the graph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner) 27 A graph, in many ways, is like a distributed index.
- 59. Graph Databases Do Make Use of Indices A B C } Index of Vertices (by id) D E } The Graph • There is more to the graph than the explicit graph structure. • Indices index the vertices by their properties (e.g. ids, name, latitude).28 28 Graph databases can be used to create index structures. In fact, in the early days of Neo4j, Neo4j used its own graph structure to index the properties of its vertices—a graph indexing a graph. A thought iterated many times over by Craig Taverner who is interested in graph databases for geo-spatial indexing/analysis.
- 60. The Patterns of a Relational Database • In a relational database, operations are conceptualized set- theoretically with the joining of tuple structures being the means by which normalized/separated data is associated.
- 61. The Pattern of a Graph Databases • In a graph database, operations are conceptualized graph- theoretically with paths over edges being the means by which non-adjacent/separated vertices are associated.29 29 Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” AT&Ti and NeoTechnology Technical Report, currently in review, 2010. [http://arxiv.org/abs/1004.1001]
- 62. What About Triple/Quad Stores? • In a triple/quad store, operations are conceptualized set- theoretically. pattern matching (e.g. SPARQL): ?pattern inferencing (e.g. RDFS, OWL): ?pattern =⇒ triples. • In many implementations, the triple/quad store make use of indices that combine subjects (?s), predicates (?p), and objects (?o).
- 63. Triple/Quad Stores, Graph Theory, and the Web of Data • The triple/quad store rides an interesting boundary between a relational and graph database — though its seen more set theoretically. This is because, I believe, RDF/Web of Data is not presented/taught in terms of graphs and graph theoretic operations.
- 64. Graph Databases and the Web of Data • In theory and ignoring performance, index and index-free models have the same expressivity and allow for the same manipulations. But such theory does not determine intention and the mental ruts that any approach engrains. • Can the graph traversal pattern become a staple in the Web of Data? Formulate SPARQL pattern matching in terms of traversing. Formulate inference in terms of traversing. Take advantage of graph theoretic models of data processing.
- 65. Outline • Graph Structures, Algorithms, and Algebras • Graph Databases and the Property Graph • TinkerPop Open-Source Graph Product Suite • Real-Time, Real-World Use Cases for Graphs
- 66. difﬁculty graphs algebra databases indices time data models Diﬃculty Chart software algorithms real-world conclusion
- 67. TinkerPop: Making Stuﬀ for the Fun of It • Open source software group started in 2008 focusing on graph data structures, graph query engines, graph-based programming languages, and, in general, tools and techniques for working with graphs. [http://tinkerpop.com] [http://github.com/tinkerpop] Current members: Marko A. Rodriguez (AT&Ti), Peter Neubauer (NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute), and Pavel Yaskevich (“I am no one from nowhere”).
- 68. TinkerPop Productions • Blueprints: Data Models and their Implementations [http://blueprints.tinkerpop.com] • Pipes: A Data Flow Framework using Process Graphs [http://pipes.tinkerpop.com] • Gremlin: A Graph-Based Programming Language [http://gremlin.tinkerpop.com] • Rexster: A RESTful Graph Shell [http://rexster.tinkerpop.com] Wreckster: A Ruby API for Rexster [http://github.com/tenderlove/wreckster] There are other TinkerPop products (e.g. Ripple, LoPSideD, TwitLogic, etc.), but for the purpose of this presentation, only the above will be discussed.
- 69. Blueprints: Data Models and their Implementations Blueprints • Blueprints is the like the JDBC of the graph database community. • Provides a Java-based interface API for the property graph data model. Graph, Vertex, Edge, Index. • Provides implementations of the interfaces for TinkerGraph, Neo4j, OrientDB, Sails (e.g. AllegroSail, Neo4jSail), and soon (hopefully) others such as InﬁniteGraph, InfoGrid, Sones, and HyperGraphDB.30 30 HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its current form, only supports the more common binary graph.
- 70. Pipes: A Data Flow Framework using Process Graphs Pipes • A dataﬂow framework with support for Blueprints-based graph processing. • Provides a collection of “pipes” (implement Iterable and Iterator) that are connected together to form processing pipelines. Filters: ComparisonFilterPipe, RandomFilterPipe, etc. Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc. Splitting/Merging: CopySplitPipe, RobinMergePipe, etc. Logic: OrPipe, AndPipe, etc.
- 71. Gremlin: A Graph-Based Programming Language Gremlin G = (V, E) • A Turing-complete, graph-based programming language that compiles Gremlin syntax down to Pipes (implements JSR 223). • Support various language constructs: :=, foreach, while, repeat, if/else, function and path deﬁnitions, etc. ./outE[@label=‘friend’]/inV ./outE[@label=‘friend’]/inV/outE[@label=‘friend’]/inV[g:except($ )] g:key(‘name’,‘Aaron Patterson’)[0]/outE[@label=‘favorite’]/inV/@name
- 72. Rexster: A RESTful Graph Shell reXster • Allows Blueprints graphs to be exposed through a RESTful API (HTTP). • Supports stored traversals written in raw Pipes or Gremlin. • Supports adhoc traversals represented in Gremlin. • Provides “helper classes” for performing search-, score-, and rank-based traversal algorithms—in concert, support for recommendation. • Aaron Patterson (AT&Ti) maintains the Ruby connector Wreckster.
- 73. Typical TinkerPop Graph Stack GET http://{host}/{resource} Neo4j NativeStore TinkerGraph
- 74. Outline • Graph Structures, Algorithms, and Algebras • Graph Databases and the Property Graph • TinkerPop Open-Source Graph Product Suite • Real-Time, Real-World Use Cases for Graphs
- 75. difﬁculty graphs algebra databases indices time data models Diﬃculty Chart software algorithms real-world conclusion
- 76. Using Graphs in Real-Time Systems • Most popular graph algorithms require global graph analysis. Such algorithms compute a score, a vector, etc. given the structure of the whole graph. Moreover, many of these algorithms have large running times: O(|V | + |E|), O(|V | log |V |), O(|V |2), etc. • Many real-world situations can make use of local graph analysis.31 Search for x starting from y. Score x given its local neighborhood. Rank x relative to y. Recommend vertices to user x. 31 Many web applications are “ego-centric” in that they are with respect to a particular user (the user logged in). In such scenarios, local graph analysis algorithms are not only prudent to use, but also, beneﬁcial in that they are faster than global graph analysis algorithms. Many of the local analysis algorithms discussed run in the sub-second range (for graphs with “natural” statistics).
- 77. Applications of Graph Databases and Traversal Engines: Searching, Scoring, and Ranking ˆ • Searching: given a power multi-set of vertices (P(V )) and a path description (Ψ), return the vertices at the end of that path.32 ˆ ˆ P(V ) × Ψ → P(V ) • Scoring: given some vertices and a path description, return a score. ˆ P(V ) × Ψ → R • Ranking: given some vertices and a path description, return a map of scored vertices. ˆ P(V ) × Ψ → (V × R) 32 Use cases need not be with respect to vertices only. Edges can be searched, scored, and ranked as well. However, in order to express the ideas as simply as possible, all discussion is with respect to vertices.
- 78. Applications of Graph Databases and Traversal Engines: Recommendation • Recommendation: searching, scoring, and ranking can all be used as components of a recommendation. Thus, recommendation is founded on these more basic ideas. Recommendation aids the user by allowing them to make “jumps” through the data. Items that are not explicitly connected, are connected implicitly through recommendation (through some abstract path Ψ). • The act of recommending can be seen as an attempt to increase the density of the graph around a user’s vertex. For example, recommending user i ∈ V places to visit U ⊂ V , will hopefully lead to edges of the form i, visited, j : ∀j ∈ U .33 33 A standard metric for recommendation quality is seen as how well it predicts the user’s future behavior. That is, does it predict an edge.
- 79. There Is More Than “People Who Like X Also Like Y .” • A system need not be limited to one type of recommendation. With graph-based methods, there are as many recommendations as there are abstract paths. • Use recommendation to aid the user in solving problems (i.e. computationally derive solutions for which your data set is primed for). Examples below are with respect to problem-solving in the scholarly community.34 Recommend articles to read. (articles) Recommend collaborators to work on an idea/article with. (people) Recommend a venue to submit the article to. (venues) Recommend an editor referees to review the article. (people)35 Recommend scholars to talk to and concepts to talk to them about at the venue. (people and tags) 34 Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the Scholarly Communication Process,” KRS-2009-02, 2009. [http://arxiv.org/abs/0905.1594] 35 Rodriguez, M.A., Bollen, J., “An Algorithm to Determine Peer-Reviewers,” Conference on Information and Knowledge Management (CIKM), pp. 319–328, doi:10.1145/1458082.1458127, 2008. [http: //arxiv.org/abs/cs/0605112]
- 80. Real-Time, Domain-Speciﬁc, Graph-Based, Problem-Solving Engine Ψ5 Ψ1 Real-Time + Ψ4 Ψn Ψ2 Ψ3 = Domain-Speciﬁc Graph-Based Problem-Solving Engine Library of Path/Traversal Expressions Graph Data Set Your domain model (i.e. graph dataset) determines what traversals you can design, develop, and deploy. Together, these determine which types of problems you can solve automatically/computationally for yourself, your users.
- 81. Applicable in Various, Seemingly Diverse Areas • Applications to a techno-social government (i.e. collective decision making systems).36 0.20 correct decisions 0.00 0.05 0.10 0.15 0.95 direct democracy dynamically distributed democracy 0.80 proportion oferror 0.65 dynamically distributed democracy direct democracy 0.50 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 0 0 percentage of active citizens percentage of active citizens (n) 36 Fig. 5. The relationship between k and evote for direct democracy (gray * Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective The plot provides line) and dynamically distributed democracy (black line). Decision Making Systems k the proportion of identical, correct decisions over a simulation that was run Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929] with 1000 artiﬁcially generated networks composed of 100 citizens each. Fig. 6. A visualization of a network of t * Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-Based Particle Swarms,” color denotes their “political tenden citizen’s Hawaii International Conference on Systems Science (HICSS), pp. 39–49, 2007. [http://arxiv.org/abs/cs/0609034] is 1, and layout. is 0.5. purple The layout algori As previously stated, let x ∈ [0, 1]n denote the political Reingold * Rodriguez, M.A., Steinbock, D.J., “A Social Network for Societal-Scale each citizen in this population, where xi is the of the North tendency of Decision-Making Systems,” Proceedings tendency of citizen i and, for the purpose of simulation, is American Association for Computational Social and Organizational Science Conference, 2004. [http://arxiv.org/abs/cs/ determined from a uniform distribution. Assume that every 1 n “vote power” and this is represe 0412047] citizen in a population of n citizens uses some social network- such that the total amount of vote based system to create links to those individuals that they 1. Let y ∈ Rn denote the total amo + believe reﬂect their tendency the best. In practice, these links ﬂowed to each citizen over the cours may point to a close friend, a relative, or some public ﬁgure a ∈ {0, 1}n denotes whether citizen whose political tendencies resonate with the individual. In in the current decision making pro other words, representatives are any citizens, not political values of a are biased by an unfair candidates that serve in public ofﬁce. Let A ∈ [0, 1]n×n denote of making the citizen an active parti the link matrix representing the network, where the weight of the citizen inactive. The iterative alg an edge, for the purpose of simulation, is denoted where ◦ denotes entry-wise multip 1 − |xi − xj | if link exists
- 82. A detour into the property graph data model...
- 83. Property Graphs and Graph Databases • Most graph databases support a graph data model known as a property graph. • A property graph is a directed, attributed, multi-relational graph. In other words, vertices and edges are equipped with a collection of key/value pairs.37 37 Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Society for Information Science and Technology, American Society for Information Science and Technology, 2010. [http://arxiv.org/abs/1006.2361]
- 84. From a Multi-Relational Graph... friend friend favorite friend friend
- 85. ...to a Property Graph name=marko location=Santa Fe lat=11111 gender=male long=22222 created_at=123456 friend friend favorite name=sixwing location=West Hollywood gender=male created_at=234567 friend friend created_at=234567
- 86. Why the Property Graph Model? • Standard single-relational graphs do not provide enough modeling ﬂexibility for use in real-world situations.38 • Multi-relational graphs do and the Web of Data (RDF) world demonstrates this to be the case in practice. • Property graphs are perhaps more practical because not every datum needs to be “related” (e.g. age, name, etc.). Thus, the edge and key/value model is a convenient dichotomy.39 • Property graphs provide ﬁner-granularity on the meaning of an edge as the key/values of an edge add extra information beyond the edge label. 38 This is not completely true—researchers use the single-relational graph all the time. However, in most data rich applications, its limiting to work with a single edge type and a homogenous population of vertices. 39 RDF has a similar argument in that literals can only be the object of a triple. However, in practice, when represented in a graph database, there is a single literal vertex denoting that literal and thus, is traversable like any other vertex.
- 87. Graph Type Morphisms weighted graph add weight attribute property graph remove attributes remove attributes no op labeled graph no op semantic graph no op directed graph remove edge labels remove edge labels make labels URIs no op remove directionality rdf graph multi-graph remove loops, directionality, and multiple edges simple graph no op undirected graph
- 88. Toy Graph Dataset lat=11111 long=22222 name=marko created_at=123456 4 name=sixwing location=West Hollywood location=Santa Fe gender=male favorite gender=male friend friend 1 2 3 favorite created_at=234567 friend favorite 6 name=Bryce Canyon favorite 5 name=charlie We will use the toy-graph above to demonstrate Gremlin (to introduce the syntax).
- 89. Dataset Schema in Neo4j Neo4j [http://neo4j.org] is a “schema-less” database. However, ultimately, data is represented according to some schema whether that schema be explicit in the database, in the code interacting with the database, or in the developer’s head.40 Please note the schema diagrammed below is a non-standard convention.41 name=<string> name=<string> location=<string> lat=<double> gender=<string> long=<double> type=Person type=Place Person Place friend favorite 40 A better term for “schema-less” might have been “dynamic schema.” 41 For expressive, standardized graph-based schema languages, refer to RDFS [http://www.w3.org/TR/ rdf-schema/] and OWL [http://www.w3.org/TR/owl-features/] of the Web of Data community.
- 90. Dataset Schema in MySQL CREATE TABLE friend ( outV INT NOT NULL, inV INT NOT NULL); CREATE INDEX friend_outV_index USING BTREE ON friend (outV); CREATE INDEX friend_inV_index USING BTREE ON friend (inV); CREATE TABLE favorite ( outV INT NOT NULL, inV INT NOT NULL); CREATE INDEX favorite_outV_index USING BTREE ON favorite (outV); CREATE INDEX favorite_inV_index USING BTREE ON favorite (inV); CREATE TABLE metadata ( vertex INT NOT NULL, _key VARCHAR(100) NOT NULL, _value VARCHAR(100), PRIMARY KEY (vertex, _key)); CREATE INDEX metadata_vertex_index USING BTREE ON metadata (vertex); CREATE INDEX metadata_key_index USING BTREE ON metadata (_key); CREATE INDEX metadata_value_index USING BTREE ON metadata (_value);
- 91. Basic Gremlin gremlin> (1 + 2) * 4 div 5 ==>2.4 gremlin> "marko" + " a. " + "rodriguez" ==>marko a. rodriguez gremlin> func ex:add-one($x) $x + 1 end gremlin> foreach $y in g:list(1,2,3,4) g:print(ex:add-one($y)) end 2 3 4 5
- 92. Searching Example: Friends gremlin> $_g := neo4j:open(‘/data/mygraph’) name=marko location=Santa Fe lat=11111 gremlin> $_ := g:id-v(1) gender=male long=22222 ==>v[1] gremlin> . ==>v[1] 3 4 gremlin> ./outE created_at=123456 ==>e[10][1-friend->2] friend favorite name=sixwing ==>e[11][1-friend->3] location=West Hollywood ==>e[12][1-favorite->4] gender=male gremlin> ./outE[@label=‘friend’]/inV/@name friend ==>sixwing 1 2 ==>marko gremlin> ./outE[@label=‘friend’]/inV/@gender favorite favorite ==>male created_at=234567 friend ==>male gremin> ./outE[@label=‘friend’] 6 /inV[@location=‘Santa Fe’]/@name name=Bryce Canyon favorite ==>marko 5 name=charlie
- 93. Searching Example: Friends in SPARQL The name of tenderlove’s friends... SELECT ?y WHERE { ex:tenderlove ex:friend ?x . ?x ex:name ?y } The gender of tenderlove’s friends... SELECT ?y WHERE { ex:tenderlove ex:friend ?x . ?x ex:gender ?y } The name of tenderlove’s friends who live in Santa Fe... SELECT ?y WHERE { ex:tenderlove ex:friend ?x . ?x ex:livesIn ex:SantaFe . ?x ex:name ?y }
- 94. Searching Example: FOAF (No Friends, No Self) gremlin> . name=marko location=Santa Fe lat=11111 ==>v[1] gender=male long=22222 gremlin> ./outE[@label=‘friend’]/inV /outE[@label=‘friend’]/inV ==>v[1] 3 4 ==>v[1] created_at=123456 ==>v[5] friend favorite name=sixwing gremlin> (./outE[@label=‘friend’] location=West Hollywood /inV)[g:assign($x)] gender=male /outE[@label=‘friend’] friend /inV[g:except($_)][g:except($x)] 1 2 /@name ==>charlie favorite favorite created_at=234567 friend 6 name=Bryce Canyon favorite 5 name=charlie
- 95. Searching Example: FOAF (No Friends, No Self) in SPARQL The name of tenderlove’s friends’ friends who are not him or his friends. SELECT ?z WHERE { ex:tenderlove ex:friend ?x . ?x ex:friend ?y . ?y ex:name ?z . FILTER { ?y != ex:tenderlove AND ?x != ?y }}
- 96. Searching Example: Friend’s Favorites gremlin> . name=marko location=Santa Fe lat=11111 ==>v[1] gender=male long=22222 gremlin> ./outE[@label=‘friend’]/inV /outE[@label=‘favorite’]/inV ==>v[6] 3 4 ==>v[6] created_at=123456 gremlin> ./outE[@label=‘friend’]/inV friend favorite name=sixwing /outE[@label=‘favorite’ and @created_at>234500] location=West Hollywood /inV/@name gender=male ==>Bryce Canyon friend 1 2 favorite favorite created_at=234567 friend 6 name=Bryce Canyon favorite 5 name=charlie
- 97. Loading Identical Data into MySQL and Neo4j On my laptop. 10,000,000 edges are created between 100,000 vertices. Random assignment with 50% favorite-edges and 50% friend-edges. This is a dense, relatively unnatural graph—everyone is heavily connected.42 42 The largest Neo4j instance that I know of contained 100,030,002 (100 million) vertices, 3,041,030,000 (3 billion) edges, and 140,120,000 (140 million) properties. This was deployed on Amazon EC2 and was yielding FOAF traversals, on average, in ∼50ms (again, index-free traversal). Figures provided by Todd Stavish (Stav.ish Consulting [http://blog.stavi.sh/]).
- 98. Play Query “What do my friends’ friends favorite?”
- 99. Querying Random Vertices with Repeats mysql> SELECT count(favorite.inV) FROM friend as fa, friend as fb, favorite WHERE fa.outV=XXX AND fa.inV=fb.outV AND fb.inV=favorite.outV; 29.72 sec -- vertex 110752 0.330 sec -- vertex 110752 REPEAT 10.10 sec -- vertex 145893 11.64 sec -- vertex 126993 0.250 sec -- vertex 126993 REPEAT 14.37 sec -- vertex 136442 6.990 sec -- vertex 154837 0.240 sec -- vertex 154837 REPEAT gremlin> g:count(g:id(XXX)/outE[@label=‘friend’]/inV /outE[@label=‘friend’]/inV/outE[@label=‘favorite’]/inV) 3.646 sec -- vertex 110752 0.350 sec -- vertex 110752 REPEAT 0.756 sec -- vertex 145893 3.251 sec -- vertex 126993 0.211 sec -- vertex 126993 REPEAT 1.462 sec -- vertex 136442 1.875 sec -- vertex 154837 0.268 sec -- vertex 154837 REPEAT
- 100. Web of Data Detour
- 101. A Traversal Detour Through the Web of Data ECS South- Sem- Wiki- BBC Surge ampton LIBRIS Web- company Playcount Radio Central RDF Data ohloh Resex Doap- Buda- Music- space Semantic ReSIST brainz Audio- pest Eurécom Project Flickr Web.org MySpace Scrobbler QDOS SW BME Wiki exporter Wrapper Conference IRIT Corpus Toulouse RAE National BBC BBC Crunch 2001 Science FOAF SIOC ACM BBC Music Later + John Base Revyu Foundation Jamendo Peel profiles Sites TOTP Open- Guides DBLP flickr RKB Project Pub Geo- Euro- wrappr Explorer Guten- Virtuoso Guide names stat Pisa CORDIS berg Sponger eprints BBC Programmes Open Calais RKB riese World Linked ECS Magna- Fact- MDB IEEE New- South- tune book ampton castle RDF Book DBpedia Mashup Linked GeoData lingvoj Freebase LAAS- US CiteSeer Census CNRS W3C DBLP Data IBM WordNet Hannover UniRef GEO UMBEL Species DBLP Gov- Track Berlin Reactome LinkedCT UniParc Open Taxonomy Cyc Yago Drug PROSITE Daily Bank Med Pub GeneID Chem Homolo KEGG UniProt Gene Pfam ProDom Disea- CAS Gene some ChEBI Ontology Symbol OMIM Inter Pro UniSTS PDB HGNC MGI PubMed As of July 2009 Image produced by Richard Cyganiak and Anja Jentzsch. [http://linkeddata.org/]
- 102. Deﬁning the Web of Data • The Web of Data is similar to the Web of Documents (of common knowledge), but instead of referencing documents (e.g. HTML, images, etc.) with the URI address space, individual datum are referenced.4344 http://markorodriguez.com, foaf:fundedBy, http://atti.com http://markorodriguez.com, foaf:name, "Marko Rodriguez" http://markorodriguez.com, foaf:age, "30" http://markorodriguez.com, foaf:knows, http://tenderlovemaking.com • In graph theoretic terms, the Web of Data is a multi-relational graph deﬁned as G ⊆ (U ∪ B) × U × (U ∪ B ∪ L), where U is the set of all URIs, B is the set of all blank/anonymous nodes, and L is the set of all literals. 43 The Web of Data is also known as the Linked Data Web, the Giant Global Graph, the Semantic Web, the RDF graph, etc. 44 * Rodriguez, M.A., “Interpretations of the Web of Data, Data Management in the Semantic Web, eds. H. Jin and Z. Lv, Nova Publishing, in press, 2010. [http://arxiv.org/abs/0905.3378] * Rodriguez, M.A., “A Graph Analysis of the Linked Data Cloud,” Technical Report, KRS-2009-01, 2009. [http://arxiv.org/abs/0903.0194]
- 103. Some of the Datasets on the Web of Data data set domain data set domain data set domain audioscrobbler music govtrack government pubguide books bbclatertotp music homologene biology qdos social bbcplaycountdata music ibm computer rae2001 computer bbcprogrammes media ieee computer rdfbookmashup books budapestbme computer interpro biology rdfohloh social chebi biology jamendo music resex computer crunchbase business laascnrs computer riese government dailymed medical libris books semanticweborg computer dblpberlin computer lingvoj reference semwebcentral social dblphannover computer linkedct medical siocsites social dblprkbexplorer computer linkedmdb movie surgeradio music dbpedia general magnatune music swconferencecorpus computer doapspace social musicbrainz music taxonomy reference drugbank medical myspacewrapper social umbel general eurecom computer opencalais reference uniref biology eurostat government opencyc general unists biology ﬂickrexporter images openguides reference uscensusdata government ﬂickrwrappr images pdb biology virtuososponger reference foafproﬁles social pfam biology w3cwordnet reference freebase general pisa computer wikicompany business geneid biology prodom biology worldfactbook government geneontology biology projectgutenberg books yago general geonames geographic prosite biology ...
- 104. Web of Data Dataset Dependencies homologenekegg projectgutenberg symbol libris cas bbcjohnpeel unists diseasome dailymed w3cwordnet chebi hgnc pubchem eurostat mgi omim wikicompany geospecies geneid reactome drugbank worldfactbook magnatune pubmed opencyc uniparc freebase linkedct uniprot taxonomy interpro uniref geneontologypdb umbel yago pfam dbpedia bbclatertotp govtrack prosite prodom flickrwrappropencalais uscensusdata surgeradio lingvoj linkedmdb virtuososponger homologenekegg projectgutenberg rdfbookmashup symbol libris swconferencecorpus geonames musicbrainz myspacewrapper dblpberlin pubguide cas bbcjohnpeel revyu unists jamendo diseasome dailymed w3cwordnet chebi rdfohloh hgnc bbcplaycountdata pubchem eurostat mgi omim wikicompany geospecies semanticweborg siocsites riese geneid foafprofiles reactome drugbank worldfactbook audioscrobbler bbcprogrammes magnatune dblphannover openguides pubmed opencyc uniparc crunchbase freebase linkedct uniprot taxonomy doapspace interpro uniref geneontology pdb umbel yago pfam dbpedia bbclatertotp govtrack flickrexporter budapestbme qdos prosite prodom flickrwrappropencalais semwebcentral uscensusdata eurecom ecssouthampton dblprkbexplorer surgeradio newcastle lingvoj linkedmdb pisa rae2001 virtuososponger acm eprints irittoulouse rdfbookmashup laascnrs citeseer swconferencecorpus geonames musicbrainz myspacewrapper ieee resex dblpberlin pubguide ibm revyu jamendo rdfohloh bbcplaycountdata semanticweborg siocsites riese foafprofiles openguides audioscrobbler bbcprogrammes dblphannover crunchbase doapspace flickrexporter
- 105. Web of Data Transforms Development Paradigm A new application development paradigm emerges. No longer do data and application providers need to be the same entity (left). With the Web of Data, its possible for developers to write applications that utilize data that they do not maintain (right).45 Application 1 Application 2 Application 3 Application 1 Application 2 Application 3 processes processes processes processes processes processes Web of Data structures structures structures structures structures structures 127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3 45 Rodriguez, M.A., “A Reﬂection on the Structure and Process of the Web of Data,” Bulletin of the American Society for Information Science and Technology, 35(6), pp. 38–43, doi:10.1002/bult.2009.1720350611, 2009. [http://arxiv.org/abs/0908.0373]
- 106. Extending our Knowledge of Bryce Canyon National Park gremlin> $h := lds:open() gremlin> $_ := g:id-v($h, ‘http://dbpedia.org/resource/Bryce_Canyon_National_Park’) ==>v[http://dbpedia.org/resource/Bryce_Canyon_National_Park] gremlin> ./outE ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:reference -> http://www.nps.gov/brca/] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:iucnCategory -> "II"@en] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:numberOfVisitors -> "1012563"^^<xsd:integer>] ==>e[dbpedia:Bryce_Canyon_National_Park - skos:subject -> dbpedia:Category:Colorado_Plateau] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:visitationNum -> "1012563"^^<xsd:int>] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:abstract -> "Bryce Canyon National Park is a national park located in southwestern Utah in the United States..."@en] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:area -> "35835.0"^^<http://dbpedia.org/datatype/acre>] ==>e[dbpedia:Bryce_Canyon_National_Park - rdf:type -> dbpedia-owl:ProtectedArea] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:location -> dbpedia:Garfield_County%2C_Utah] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:nearestCity -> dbpedia:Panguitch%2C_Utah] ==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:established -> "1928-09-15"^^<xsd:date>] ... 46 46 Linked Data Sail (LDS) was developed by Joshua Shinavier (RPI and TinkerPop) and connects to Gremlin through Gremlin’s native support for Sail (i.e. for RDF graphs). LDS caches the traversed aspects of the Web of Data into any quad-store (e.g. MemoryStore, AllegroGraph, HyperGraphSail, Neo4jSail, etc.).
- 107. Augmenting Traversals with the Web of Data Lets extend our query over the Web of Data. Perhaps incorporate that into our searching, scoring, ranking, and recommendation. gremlin> $visits := ./outE[@label=‘dbpprop:visitationNum’]/inV/@value ==>1012563 gremlin> $acreage := ./outE[@label=‘dbpprop:area’]/inV/@value ==>35835.0 ### imagine wrapping traversals in Gremlin functions: ### func lds:acreage($h, $v) and func lds:visitors($h, $v) gremlin> ./outE[@label=‘friend’]/inV/outE[@label=‘favorite’] /inV[lds:acreage($h, .) < 1000000 and lds:visitors($h, .) < 2000000]/@name ==>Bryce Canyon Thus, what do tenderlove’s friends favorite that are small in acreage and visitation?47 47 In Gremlin, its possible to have multiple graphs open in parallel and thus, mix and match data from each graph as desired. Hence, demonstrated by the example above, its possible to mix Web of Data RDF graph data and Blueprints property graph data.
- 108. Using the Web of Data for Music Recommendation Yet another aside: Using only the Web of Data data to recommend musicians/bands with a simplistic, edge-boolean spreading activation algorithm.48 gremlin> $_ := ==>The Tubes g:id(‘http://dbpedia.../Grateful_Dead’) ==>Bob Dylan ==>v[http://dbpedia.../Grateful_Dead] ==>New Riders of the Purple Sage gremlin> lds:spreading-activation(.) ==>Bruce Hornsby ==>Jerry Garcia Acoustic Band ==>Donna Jean Godchaux ==>BK3 ==>Kingfish ==>Phil Lesh and Friends ==>Jerry Garcia Band ==>Old and In the Way ==>Donna Jean Godchaux Band ==>RatDog ==>The Other Ones ==>The Dead ==>Bobby and the Midnites ==>Heart of Gold Band ==>Furthur ==>Legion of Mary ==>Rhythm Devils 48 Please read the following for interesting, deeper ideas in this space: Clark, A., “Associative Engines: Connectionism, Concepts, and Representational Change,” MIT Press, 1993.
- 109. Another View of the TinkerPop Stack GET http://{host}/{resource} Local Dataset Web of Data owl:sameAs
- 110. Recommendation
- 111. Extending the Schema for Some Richer Examples For the last part of this presentation on recommendation, we will extend the data schema to include tags (a place can be tagged with a tag). This will allow for some richer examples.4950 name=<string> name=<string> location=<string> lat=<double> gender=<string> long=<double> name=<string> type=Person type=Place type=Tag Person Place Tag friend favorite tagged 49 Please note that 1.) “place” can be item/thing/book/music/etc. 2.) “favorite” can be likes/purchased/visited/etc. 3.) “tag” can be category/etc. A particular use case is presented, but with little imagination, application to other schemas is, of course, plausible. 50 Following examples have experimental syntax that may diﬀer slightly from oﬃcial Gremlin 0.5 release.
- 112. Recommendation Example: Friend Finder • Open Friendship Triangles: (V × Ψ) → (V × N+)51 (people) 1. Create return map (i.e. V × N+). 2. Determine who my friends are. 3. Determine who my friends friends are... 4. ...that are not already my friends or me. (weighted by the number of overlapping friends—more overlaps, more traversers at that user vertex) 5. Sort return map by number of traversers at those user/people vertices. $m := g:map() (./outE[@label=‘friend’]/inV)[g:assign($x)] /outE[@label=‘friend’]/inV /.[g:except($x)][g:except($_)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 51 Rx ◦ Afriend · Afriend ◦ n Afriend ◦ n (I), where x is the user/person vertex. The in-degree centrality vector of the derived adjacency matrix determines the resultant V rank.
- 113. Recommendation Example: Follower Finder • People Similarity based on Favorites: (V × Ψ) → (V × N+)52 (people) 1. Create return map (i.e. V × N+). 2. Determine what I favorite/like/prefer/purchased/etc. 3. Of those things I favorite, who else favorites them that are not me? (weighted user similarity based on taste—the more I share in common, the more traversers are at that user vertex). 4. Filter out those people that are my friends. 5. Sort return map by number of traversers at those people vertices. $m := g:map() (./outE[@label=‘favorite’]/inV)[g:assign($x)] /inE[@label=‘favorite’]/outV[g:except($_)] /outE[@label=‘friend’]/inV[g:except($x)]/../..[g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 52 Rx ◦ Afavorite · Afavorite ◦ n (I) ◦ n Afriend . The in-degree centrality vector of the derived adjacency matrix determines the resultant V rank.
- 114. Recommendation Example: Follower Finder 2 • People Similarity based on Tags: (V × Ψ) → (V × N+)5354 (people) 1. Create return map (i.e. V × N+). 2. Determine the tags associated with what I favorite. 3. What else is tagged with those tags? 4. Who favorites those tagged items that are not me.55 5. Sort return map by number of traversers at those people vertices. $m := g:map() ./outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV /inE[@label=‘tagged’]/outV /inE[@label=‘favorite’]/outV[g:except($_)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 53 Rx ◦ Afavorite · Atagged · Atagged · Afavorite ◦ n (I). The in-degree centrality vector of the derived adjacency matrix determines the resultant V rank. 54 Variations on this theme can be used for expertise identiﬁcation. 55 A user’s friends could be recommended. This ﬁlter was ignored for the sake of brevity.
- 115. Recommendation Example: “Users Who Like x Also Like y” • Co-Favorited Places: (V × Ψ) → (V × N+)5657 (places) 1. Create return map (i.e. V × N+). 2. Determine who has favorited (i.e. liked) place x. 3. What else have they favorited that is not place x. 4. Sort return map by number of traversers at those place vertices. $m := g:map() $x/inE[@label=‘favorite’]/outV /outE[@label=‘favorite’]/inV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 56 Rx ◦ Afavorite · Afavorite ◦ n (Cx ). In-degree centrality of derived matrix determines rank. 57 This type of recommendation may be considered content-based recommendation. When two vertices share content (relations to other vertices), they are deemed similar. Co-relation, in general, is a pattern for content-based recommendation. Look back at the ﬁrst three recommendation examples: “friend ﬁnder” (co-friend), “follower ﬁnder” (co-favorites), “follow ﬁnder 2” (co-tagged-favorites).
- 116. Recommendation Example: Places Related through Tags • Co-Tagged Places: (V × Ψ) → (V × N+)5859 (places) 1. Create return map (i.e. V × N+). 2. Determine the tags for place x. 3. What else is tagged the same as x that is not x. 4. Sort return map by number of traversers at those place vertices. $m := g:map() $x/outE[@label=‘tagged’]/inV inE[@label=‘tagged’]/outV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 58 Rx ◦ Atagged · Atagged ◦ n (I). In-degree centrality of derived matrix determines rank. 59 Yet another type of content-based recommendation, but items are similar to each other not because of co-favoriting, but because of co-tagging. Think about mixing and matching diﬀerent similarities. How do you weight the diﬀerent “co”-graphs (i.e. aAα + bAβ )? Statistical techniques can emerge the signiﬁcant factors.
- 117. Recommendation Example: Tags Related through Places • Co-Placed Tags: (V × Ψ) → (V × N+)6061 (tags) 1. Create return map (i.e. V × N+). 2. Determine what has been tagged x. 3. What other tags do those items have that are not x. 4. Sort return map by number of traversers at those tag vertices. $m := g:map() $x/inE[@label=‘tagged’]/outV outE[@label=‘tagged’]/inV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 60 Rx ◦ Atagged · Atagged ◦ n (I). In-degree centrality of derived matrix determines rank. 61 In the previous example, items were related if they shared the same tags. In this example, tags are related if they are used to tag the same items. Anything can be deemed similar to anything else if there exists paths between such items—inferred or explicit. The path taken (Ψ) determines the meaning/type of similarity. Cognitive philosophers/psychologists see this as associativity through spreading activation.
- 118. Recommendation Example: Collaborative Filtering 1 • Basic Collaborative Filtering: (V × Ψ) → (V × N+)62 (places) 1. Create return map (i.e. V × N+). 2. Determine what I favorite/like/prefer/purchased/etc. 3. Of those things I favorite, who else favorites them? (weighted user similarity based on taste—the more I share in common, the more traversers are at that person vertex). 4. Of those similar users, what do they favorite that I don’t already favorite? 5. Sort return map by number of traversers at those favorited places. $m := g:map() (./outE[@label=‘favorite’]/inV)[g:assign($x)] /inE[@label=‘favorite’]/outV /outE[@label=‘favorite’]/inV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 62 Related to “follower ﬁnder” from previous. However, it takes the traversal one step further. Instead of simply ﬁnding who is similar to me with respect to favoriting, you then compute, what do those similar users also favorite. This is a classic case for path-reuse as an optimization.
- 119. Recommendation Example: Collaborative Filtering 2 • Collaborative “Category” Filtering: (V × Ψ × V ) → (V × N+) (places) 1. Create return map (i.e. V × N+). 2. Determine what I favorite... 3. ...in category/tag x. 4. Of those things I favorite, who else favorites them? 5. Of those similar users, what do they favorite categorized/tagged x ... 6. ...that I don’t already favorite? 7. Sort return map by number of traversers at those favorited places. $m := g:map() (./outE[@label=‘favorite’]/inV /outE[@label=‘tagged’]/inV[@name=‘bar’]/../..)[g:assign($x)] /inE[@label=‘favorite’]/outV /outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV[@name=‘bar’] /../..[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true)
- 120. Recommendation Example: Collaborative Filtering 3 • Collaborative “Location” Filtering: (V × Ψ × R4) → (V × N+)63 (places) 1. Create return map (i.e. V × N+). 2. Determine what I favorite. 3. Of those things I favorite, who else favorites them? 4. Of those similar users, what do they favorite in bounding box x1, x2, y1, y2... 5. ...that I don’t already favorite? 6. Sort return map by number of traversers at those places. $m := g:map() (./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)] /inE[@label=‘favorite’]/outV /outE[@label=‘favorite’]/inV[@lat > $x1 and @lat < $x2 ...] /.[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true) 63 Location-ﬁltering idea adapted from the Bonobo recommender engine by Nate Murray (AT&Ti).
- 121. Recommendation Example: Collaborative Filtering 4 • Collaborative “State of Mind” Filtering: (V × Ψ × N+) → (V × N+) (places) 1. Create return map (i.e. V × N+). 2. Determine what I have favorited in the last x minutes. 3. Of those things I recently favorited, who else favorites them? 4. Of those similar users, what do they favorite that I don’t? 5. Sort return map by number of traversers at those favorited places. $m := g:map() (./outE[@label=‘favorite’ and @created_at > 1234567]/inV)[g:assign($x)] /inE[@label=‘favorite’]/outV /outE[@label=‘favorite’]/inV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true)
- 122. Recommendation Example: Collaborative Filtering 5 • Collaborative “Zietgeist” Filtering: (V × Ψ × N+) → (V × N+) (places) 1. Create return map (i.e. V × N+). 2. Determine what I have favorited. 3. Of those things I favorited, who else favorites them? 4. Of those similar users, what have they favorited in the last x minutes... 5. ...that I don’t already favorite? 6. Sort return map by number of traversers at those favorited places. $m := g:map() (./outE[@label=‘favorite’]/inV)[g:assign($x)] /inE[@label=‘favorite’]/outV /outE[@label=‘favorite’ and @created_at > 1234567] /inV[g:except($x)][g:op-value(‘+’,$m,.,1)] g:sort($m,‘value’,true)
- 123. ...keep going all day long.
- 124. A Cornucopia of Recommendations – Part 1 • Its possible to use oﬄine statistical methods to determine which factors of a vertex contribute to user interest (e.g. PCA+KMeans to determine metadata contributing to shared interests). (slow) • Then, use online, real-time graph methods to incorporate those features into the traversal (i.e. to deﬁne Ψ). (fast) Mix various traversals together: aAα + bAβ + . . . + zAζ (or other, perhaps non-linear combinations).64 64 Though not discussed in this presentation, sampling techniques can be used to increase the speed of a traversal. For example, ./outE[g:rand-real() > 0.5] only traverses, on average, 50% of the edges. Moreover, if edges have weights, those weights can be used to create probability distributions and thus, biased sampling can be implemented (i.e. random walks)
- 125. A Cornucopia of Recommendations – Part 2 • ...also, be creative. Develop numerous recommendation traversals for numerous problem-solving situations.65 • Make use of user click-behavior to determine usefulness. • ...Or, allow users to select which algorithms they want to apply (give them the option to select how they want to solve their problems). 65 For a ﬁne review of graph-based techniques and ideas regarding recommendation, please see: * Mirza, B.J., Keller, B., Ramakrishnan, N., “Studying Recommendation Algorithms by Graph Analysis,” Journal of Intelligent Information Systems, 20(2), pp. 131–160, doi:10.1023/a:1021819901281, 2003. * Huang, Z., Zeng, D., Chen, H., “A Link Analysis Approach to Recommendation Under Sparse Data,” Proceedings of the Tenth Americas Conference on Information Systems, 2004. * Perugini, S., Goncalves, M.A., Fox, E., “Recommender System Research: A Connection-Centric Survey,” Journal of Intelligent Information Systems, 23(2), pp. 107–143, 2004. * Rodriguez M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using Associative Networks,” ACM Transactions on Information Systems, 27(2), pp. 1–20, doi:10.1145/1462198.1462199, 2009. [http://arxiv.org/abs/0807.0023]
- 126. Traversal Algorithms Simulate User Behavior • A traversal is like a simulation of the user(s). • If all the user had were direct links (i.e. a basic user-interface over the dataset), what path would they take to solve their problem? • Operationalize as a traversal and you have simulated (and sped up) their problem-solving behavior.6667 66 Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Lecture Notes in Artiﬁcial Intelligence, 5712, pp. 813–820, doi:10.1007/978-3-642-04592-9 101, Springer-Verlag, 2009. [http://arxiv.org/abs/0904.0027] – see Faith in the Algorithm, in general: http://faithinthealgorithm.net. 67 Think of the graph data set as a conceptual graph—“things” and their relationships to each other: the world as index. Think how your mind composes, manipulates, make use of such structures to solve problems—to think, to infer, to creatively combine (i.e. join, traverse) ideas. Automate that process. ...automate the process that generates that process. [http://arxiv.org/abs/0704.3395]
- 127. Graph Traversal Model: Beneﬁts and Drawbacks • Beneﬁts: The solution is explainable (i.e. the factors/paths are known). Evaluations can happen in real-time and on live data.6869 Can easily develop/deploy new traversals for diﬀerent problems.70 • Drawbacks: If intuition fails, derive factors with oﬄine statistical techniques.7172 68 A user can add an edge and then recalculate a traversal. 69 It is noted that this depends on the complexity of the traversal and density of the graph. 70 For very rich data models, this is a promising proposition. 71 In the past, my method has been to use intuition to develop traversals, and then with sample data, validate/tweak the traversal [http://arxiv.org/abs/cs/0605112, http://arxiv.org/abs/0807.0023]. Also, for live systems with active users, using click-behavior is possible. 72 Think about deriving Ψ from the paths that the users take through the data. “Ruts,” given the law of large numbers, can expose the collective’s problem-solving behavior. In short, study your users to derive Ψ.
- 128. Conclusion • Multi-relational graphs are more expressive than common, single- relational graphs. • The path algebra serves as a formal model for traversing through a graph in a complex manner. • Graph databases provide index free adjacency which yield fast traversals. • The Web of Data is a multi-relational graph distributed across computers world wide. • ...traversals algorithms in the future of the Web of Data?
- 129. Acknowledgements • The ideas presented have been developed over the course of my time with the following institutions: University of California at Santa Cruz, Vrije Universiteit Brussel, Los Alamos National Laboratory, and AT&T Interactive. • My core collaborators: Alberto Pepe (Harvard), Johan Bollen (University of Indiana), Herbert Van de Sompel (LANL), Jennifer H. Watkins (LANL), Peter Neubauer (NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute), and Pavel Yaskevich (“No one, from no where.”) • The Neo4j team [http://neo4j.org] have been instrumental in inﬂuencing my thoughts with respect to the database considerations of graph processing. These people include Peter Neubauer, Emil Eifrem, Tobais Ivarsson, Johan Svensson, Mattias Persson... • My current institution of AT&Ti has provided me with ideas and support: Aaron Patterson, Rand Fitzpatrick, Nate Murray. • The greater TinkerPop [http://tinkerpop.com] community for their discussions, code submissions, and general excitement in the space.

Full NameComment goes here.彭 彭彭at Microsoft InterestingBCmoney MobileTVat BCmoney MobileTV Interesting research...important to note there's much more to it than basic Social Network Analysis, but hopefully we build the algorithms for traversing the web of data together collaboratively, just as we are currently building the web of data itself right now.Terry Ribb, Mobile Brand Innovation at Relevens, Inc., sponsor of the The Mobile Relevance Project 1 month agoQuinsulon Israel, Ph.D. Candidate at Drexel University 4 months agoMarat Charlaganov, Researcher at VU Amsterdam 6 months agoTatiana Tarasova, PhD candidate at University of Amsterdam 6 months agoFrank Mayer, CEO & Founder at FRANKMAYER.NET 7 months ago