Gremlin: A Graph-Based Programming Language

  • 32,012 views
Uploaded on

Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex …

Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Graph-based reasoning is the process of making explicit what is implicit in lop co-developer the graph'
    Are you sure you want to
    Your message goes here
  • This is a Java-embeddable language that can perform queries on generalized graphs using a clear and concise XPath-based query language. In addition it adds typical scripting-language constructs that make it a complete language. However these additions are pretty weak, and I think it would have been better to restrict it to being a pure query language (like SQL) and use the a host language (such as Java) for the other stuff.

    Another thing I thought was a little inelegant was the number of special reserved identifiers in the query language, such as 'outE', 'inE', 'outV', and 'inV'. I understand this was necessary to stay compatible with XPath, but I think it would have been better to move away from XPath and put such special identifiers in the syntax of the language itself.

    On thing I did like was how easy it is to define computed edges in the graph, sort of like views in SQL. This can be considered a kind of reasoning. I particularly was struck by the clear insight in the presentation that 'Graph-based reasoning is the process of making explicit what is implicit in lop co-developer the graph'
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
32,012
On Slideshare
0
From Embeds
0
Number of Embeds
11

Actions

Shares
Downloads
730
Comments
2
Likes
54

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Gremlin G = (V, E) A Graph-Based Programming Language Marko A. Rodriguez T-5, Center for Nonlinear Studies Los Alamos National Laboratory http://markorodriguez.com http://gremlin.tinkerpop.com February 25, 2010
  • 2. Abstract Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 3. Acknowledgements • Marko A. Rodriguez [http://markorodriguez.com] designed, developed, tested, and documented Gremlin. • Peter Neubauer [http://www.linkedin.com/in/neubauer] aided in the design and the evangelizing of Gremlin. • Pavel Yaskevich [http://github.com/xedin] aided in the development of user defined functions in Gremlin. • Joshua Shinavier [http://fortytwo.net] provided initial conceptual support for Gremlin. • Ketrina Yim [http://csillustrated.berkeley.edu] designed the logo for Gremlin. • Gremlin-Users Group [http://groups.google.com/group/gremlin-users] provided much direction in the design and implementation of Gremlin. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 4. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts • Conclusions Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 5. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts • Conclusions Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 6. What is a Graph? • A graph (network) is composed of a collection of vertices (dots) and edges (lines). There are many types of graphs: directed/undirected, weighted, attributed, etc. vertex-labeled a hyper d edge-attributed ed bele ht e-la multi ig edgknows created=2-01-09 we 0.2 modified=2-11-09 cted tic undire di an re ct m hired ed se reg ge ula half-ed r pseudo http://ex.com/123 type="person" name="emil" resource description framework vertex-attributed Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 7. Why Use a Graph? • A graph is a very general data structure that can be used to model various systems. A graph can model the structure of transportation, technological, bibliographic, etc. systems. A graph can model a list, a map, a tree, etc. • There are numerous graph algorithms that are defined independent of the domain of the graph model. • There are numerous graph databases, frameworks, packages, etc. that aid in the creation, manipulation, and analysis of graphs. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 8. Graph Databases, Frameworks, and Packages • Neo4j Graph Database [http://neo4j.org] • AllegroGraph Quad Store [http://http://www.franz.com/agraph] • HyperGraphDB [http://www.kobrix.com/hgdb.jsp] • Java Universal Network/Graph Framework [http://jung.sourceforge.net] • OpenRDF Sesame Framework [http://www.openrdf.org] • InfoGrid Graph Database [http://infogrid.org] • Filament Graph Toolkit [http://filament.sourceforge.net] • OWLim Semantic Repository [http://www.ontotext.com/owlim] • Sones Graph Database [http://www.sones.com] • NetworkX Graph Toolkit [http://networkx.lanl.gov] • iGraph Toolkit [http://igraph.sourceforge.net] • Blueprints Graph API [http://blueprints.tinkerpop.com] • ... and many more. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 9. What Makes Gremlin Different? • Gremlin is a domain specific language for working with graphs. • Gremlin is not an application programming interface (API). • Gremlin makes use of various graph databases, frameworks, packages. • Gremlin is a language that currently has a virtual machine implementation written in Java. • What can be succinctly expressed in Gremlin is verbose/clumsy to express in general purpose languages such as Java, Python, Ruby, etc. • Gremlin allows one to map single-relational graph analysis algorithms over to the multi-relational domain. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 10. Single-Relational Graphs • In single-relational graphs, all edges have the same meaning (e.g. all edges are either frienship, kinship, worksWith, knows, etc.). G = (V, E ⊆ (V × V )) • Most graph algorithms are defined for single-relational graphs (e.g. centrality/ranking, clustering/community detection, etc.). person-c person-a person-b NOTE: These types of graphs are also known as directed, vertex-labeled graphs. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 11. Multi-Relational Graphs • In multi-relational graphs, edges can have different meanings. G = (V, E ⊂ (V × V ), ω : E → Σ∗) • Most graph software is designed for multi-relational graphs (e.g. arbitrary objects as vertices and edges, knowledge-based reasoning systems, etc.). book-c read cites person-a authored book-b NOTE: These types of graphs are also known as directed, vertex/edge-labeled graphs. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 12. Gremlin and Multi-Relational Graphs • Gremlin provides a means to elegantly map single-relational graph analysis algorithms over to the multi-relational graph domain. • Gremlin provides an elegant way to do automated reasoning in multi-relational graphs using path expressions. These two points form the primary thesis of this presentation. Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), 29–41, doi:10.1016/j.joi.2009.06.004, LA-UR-08-03931, http://arxiv.org/abs/0806.2274, December 2009. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 13. Property Graphs • Gremlin works with a type of multi-relational graph called a property graph. Vertices and edges are labeled with unique identifiers. Edges are directed, labeled, and can form loops. Multiple edges of the same label can exist for the same vertex pair. Vertices and edges can have any number of key/value pair properties/attributes. Property graphs are a relatively general graph structure that can be constrained to model other graph structures — though, a property-based hypergraph would be the most general (see HyperGraphDB and the JUNG API). Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 14. Property Graphs name = "lop" lang = "java" weight = 0.4 3 name = "marko" age = 29 created weight = 0.2 9 1 created 8 created 12 7 weight = 1.0 weight = 0.4 6 weight = 0.5 knows knows 11 name = "peter" age = 35 name = "josh" 4 age = 32 2 10 name = "vadas" age = 27 weight = 1.0 created 5 name = "ripple" lang = "java" Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 15. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 16. Gremlin System Architecture • The Gremlin console is a scripting environment Gremlin Gremlin which allows for the dynamic evaluation of Console ScriptEngine Gremlin code. • Gremlin implements JSR 223 which allows Gremlin to also be used within the Java language and thus, as a virtual machine directly accessible to Java applications. Popular JSR 223 implementations include Jython, JRuby, and Groovy. For a fine list of implementations see https://scripting.dev.java.net. • Blueprints is a set of interfaces for abstract data structures such as graphs and documents. Implementations to these interfaces exist for various data management systems. • There exist many graph data management systems that span various graph data models Neo4j NativeStore TinkerGraph (e.g. edge labeled graphs, RDF graphs, hypergraphs, etc.). Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 17. “Hello World” in the Gremlin Console marko$ ./gremlin.sh ,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> gremlin> concat(‘goodbye’, ‘ ’, ‘self’) ==>goodbye self Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 18. Simple Traversals in Gremlin name = "lop" gremlin> $_ := g:key(‘name’,‘marko’) lang = "java" ==>v[1] weight = 0.4 3 name = "marko" age = 29 created gremlin> . 1 9 ==>v[1] created 7 8 created 12 gremlin> ./outE 6 weight = 0.5 knows ==>e[7][1-knows->2] knows 11 weight = 1.0 ==>e[9][1-created->3] name = "josh" 4 2 age = 32 ==>e[8][1-knows->4] name = "vadas" 10 gremlin> ./outE/@weight age = 27 ==>0.5 created ==>0.4 5 ==>1.0 ./outE/@weight: “Get the current object(s). Then get the outgoing edges of those objects. Then get the weights of those edges.” $ is a reserved variable meaning the root list of objects. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 19. Simple Traversals in Gremlin name = "lop" gremlin> . lang = "java" ==>v[1] 3 name = "marko" gremlin> ./outE[@label=‘created’]/inV age = 29 created 9 ==>v[3] 1 created 8 created gremlin> $_ := $_last 12 7 6 ==>v[3] knows knows 11 gremlin> ./@name ==>lop 4 2 gremlin> g:map(.) 10 ==>name=lop created ==>lang=java 5 ./outE[@label=‘created’]/inV: “Get the current object(s). Then get the outgoing edges of those objects, where their labels equal ‘created’. Then get the incoming vertices of those ‘created’ edges.” $ last is a reserved variable meaning the last value evaluated. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 20. Simple Traversals in Gremlin name = "lop" lang = "java" 3 name = "marko" age = 29 created 9 1 created 8 created 12 7 6 knows knows 11 name = "josh" 4 age = 32 2 10 name = "vadas" age = 27 created 5 ./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name ==>vadas Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 21. Simple Traversals in Gremlin ./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name 1. .: Get the current object(s). 2. outE[@label=‘knows’]: Get the outgoing edges of the current object(s), where their labels equal ‘knows’. 3. inV[matches(@name,‘va.{3}’) and @age > 21]: Get the incoming vertices of those ‘knows’ edges, where the names of those vertices are 5 characters long, start with ‘va’, and whose age is greater than 21. 4. @name: get the name of those particular incoming vertices. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 22. Knowledge-Based Reasoning • Blueprints implements the Sesame SAIL interfaces and thus, Gremlin can be used over the many Resource Description Framework (RDF) triple/quad stores. In such cases, RDF is modeled as a property graph where the named graph component is the @ng edge property. • Gremlin makes use of the Sesame SAIL SPARQL engine to allow for queries based on graph-pattern matching. gremlin> sail:sparql(‘SELECT ?x ?y WHERE { ?x foaf:knows ?y }’) ==>{y=v[http://ex.com#2], x=v[http://ex.com#1]} ==>{y=v[http://ex.com#4], x=v[http://ex.com#1]} • Gremlin is useful for knowledge-based reasoning using path expressions. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 23. Reasoning as Defining New Types of Adjacency • Graph-based reasoning is the process of making explicit what is implicit in lop co-developer the graph. created marko created • A reasoner takes a graph G co-developer peter and a collection of graph-patterns created (i.e. transformation/rewrite rules) and knows knows creates a new graph G (usually, G ⊂ josh G ). G has new relationships/edges vadas and thus, new definitions of vertex created adjacency. • Example: The co-developers of person ripple A are those people who have created the same software as person A and who are themselves, not person A (as person For these “co-developer” examples, we will use A has created the same software as him vertex 1 (marko) as the source of the reasoning or herself). process. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 24. The Co-Developers of Marko A. Rodriguez in SPARQL name = "lop" SELECT ?x WHERE { lang = "java" ?y marko created ?y . 3 name = "marko" age = 29 created ?z created ?y . marko 1 created ?z ?z != marko . created 6 ?z name ?x knows name = "peter" } age = 35 ?x knows ?z 4 name = "josh" age = 32 ?x This query would return: josh and 2 peter. created 5 Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 25. The Co-Developers of Marko A. Rodriguez in Gremlin co-developer lop co-developer created created marko co-developer peter created knows knows josh vadas created ripple gremin> ./@name ==>marko gremlin> ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name ==>josh ==>peter Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 26. The Co-Developers of Marko A. Rodriguez in Gremlin ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name 1. .: Get the current object(s) (i.e. vertex 1 — denoting Marko). 2. outE[@label=‘created’]: Get the outgoing edges of the Marko vertex, where their labels equal ‘created’. 3. inV: Get the incoming (i.e. head) vertices of those ‘created’ edges. 4. inE[@label=‘created’]: Get the incoming edges of those vertices, where their labels equal ‘created’. 5. outV[g:except($ )]: Get the outgoing (i.e. tail) vertices of those ‘created’ edges, where those vertices are not the Marko vertex. 6. @name: get the name of those non-Marko vertices. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 27. Defining Co-Developers in Gremlin path co-developer ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)] end Once defined, you can use it like any other path segment. gremlin> ./co-developer ==>v[4] ==>v[6] gremlin> ./co-developer/@name ==>josh ==>peter Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 28. Defining Co-Developers in Java public class CoDeveloperPath implements Path { public List invoke(Object root) { if(root instanceof Vertex) { List<Vertex> projects = new ArrayList<Vertex>(); for(Edge edge : ((Vertex)root).getOutEdges()) { if(edge.getLabel().equals("created")) { projects.add(edge.getInVertex()); } } List<Vertex> coDevelopers = new ArrayList<Vertex>(); for(Vertex project : projects) { for(Edge edge : project.getInEdges()) { if(edge.getLabel().equals("created") && edge.getOutVertex() != root) { coDevelopers.add(edge.getOutVertex()); } } } return coDevelopers; } else { return null; } } } Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 29. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts • Conclusions Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 30. Gremlin Type System object element graph number string boolean map list vertex edge Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 31. Predefined Paths and Properties vertex 1 out edges vertex 3 in edges edge 9 out vertex edge 9 label edge 9 in vertex edge 9 id 1 9 created 3 8 11 knows created 4 vertex 4 id vertex 4 properties name = "josh" age = 32 object property description example graph V the vertex iterator of the graph $g/V graph E the edge iterator of the graph $g/E vertex/edge @id the identifier of the element $v/@id vertex outE the outgoing edges of the vertex $v/outE vertex inE the incoming edges of the vertex $v/inE vertex bothE both in and out edges of the vertex $v/bothE edge outV the outgoing tail vertex of the edge $e/outV edge inV the incoming head vertex of the edge $e/outV edge bothV both in and out vertices of the edge $e/bothV edge @label the label of the edge $e/@label Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 32. Predefined Functions g:assign() g:remove-idx() g:list() g:sort() g:print() g:assign() g:load() g:dedup() g:map() g:time() g:unassign() g:save() g:union() g:keys() g:p() g:id() g:clear() g:intersect() g:values() g:to-json() g:key() g:close() g:difference() g:rand-nat() g:from-json() g:add-v() g:keys() g:retain() g:rand-real() ... g:add-e() g:values() g:except() g:prob() .. g:remove-ve() g:map() g:remove() g:cont() . g:idx-all() g:get() g:get() g:halt() g:add-idx() g:op-value() g:op-value() g:type() There are over 70 predefined functions. See the following for a description of each. http://wiki.github.com/tinkerpop/gremlin/core-function-library http://wiki.github.com/tinkerpop/gremlin/gremlin-function-library Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 33. Working With Non-Graph Types gremlin> 1.2 + 6 ==>7.2 gremlin> ‘this is a string’ ==>this is a string gremlin> true() or false() ==>true gremlin> g:map(‘marko’,‘lanl’,‘peter’,‘neotech’,‘josh’,‘rpi’) ==>marko=lanl ==>peter=neotech ==>josh=rpi gremlin> g:list(‘graphs’,‘hockey’,‘motorcylces’,6) ==>graphs ==>hockey ==>motorcylces ==>6.0 Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 34. Working With Non-Graph Types gremlin> $m := g:map(‘hobbies’,g:list(‘hockey’,‘graphs’), ‘location’, g:map(‘state’,‘new mexico’, ‘city’, ‘santa fe’, ‘zipcode’, 87501), ‘age’, 30) ==>location={zipcode=87501.0, state=new mexico, city=santa fe} ==>age=30.0 ==>hobbies=[hockey, graphs] gremlin> $m/@age ==>30.0 gremlin> $m/@hobbies[2] ==>graphs gremlin> $m/@location/@city ==>santa fe Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 35. Variables • Variables in Gremlin are prefixed with a $ character. • There are a collection of reserved variables that all begin with $ . $ is the root list of objects. $ last is the last result evaluated by the evaluator. $ g is the “working graph” to reduce typing with graph functions. gremlin> $x := 1 ==>1.0 gremlin> $y := 2 ==>2.0 gremlin> $x + $y ==>3.0 Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 36. Language Statements Variable Assignment Repeat gremlin> $i := 0 gremlin> $i := 1 + 5 ==>0.0 ==>6.0 gremlin> repeat 10 gremlin> $i $i := $i + 1 ==>6.0 end ==>10.0 If/Else While gremlin> if true() gremlin> $i := ‘g’ $i := 1 ==>g else gremlin> while not(matches($i, ‘ggg’)) $i := 2 $i := concat($i,‘g’) end end ==>1.0 ==>ggg Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 37. Language Statements Foreach Path gremlin> $i := 0 gremlin> path friend_name ==>0.0 ./outE[@label=‘knows’]/inV/@name gremlin> foreach $j in 1 | 2 | 3 end $i := $i + $j gremlin> gremlin> ./friend_name end ==>vadas ==>6.0 ==>josh Function gremlin> func ex:hello($name) concat(‘hello ’, $name) end gremlin> ex:hello(‘pavel’) ==>hello pavel You can define functions and paths in native Gremlin (as demonstrated above) or in Java. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 38. XPath Filters • Use [ ] filters to filter objects in a path expression (i.e. “such that” or “where”) • The evaluated result of [ ] must be a number or boolean. If its a number, it is treated as the position within an array (i.e. list). If it is boolean, it is treated as whether to include or exclude the object from the next path in the sequence. gremlin> ./outE[@label=‘knows’] ==>e[7][1-knows->2] ==>e[8][1-knows->4] gremlin> ./outE[@label=‘knows’ and @weight>0.5]/inV[@age<21 or @name=‘josh’][true()][1] ==>v[4] Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 39. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts • Conclusion Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 40. A Grateful Dead Dataset 2,500 concerts 35,000 songs played 600 songs 30 years 11 members 1 band ... the Grateful Dead. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 41. A Grateful Dead Dataset • vertices denote songs and artists type: “song” or “artist” name: name of song or artist. performances: number of times song was played in concert. song type: whether the song was a “cover” or “original”. • edges denote followed by, sung by, written by weight: number of times a song was followed by another song over all concerts played. Rodriguez, M.A., Gintautas, V., Pepe, A., “A Grateful Dead Analysis: The Relationship Between Concert and Listening Behavior,” First Monday, 14(1), University of Illinois at Chicago Library, http://arxiv.org/abs/0807.2466, January 2009. NOTE: A portion of the raw dataset courtesy of Mark Leone http://www.cs.cmu.edu/ mleone/gdead/setlists.html Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 42. A Grateful Dead Dataset Stanley Theater type="artist" type="artist" name="Hunter" name="Garcia" Pittsburgh, PA (11/30/79) type="song" name="Scarlet.." 7 2nd Set 5 written_by 1 sung_by ------------------- weight=239 Scarlet Begonias followed_by type="song" Fire on the Mountain name="Fire on.." sung_by sung_by written_by Passenger 2 Terrapin Station weight=1 type="artist" name="Lesh" ... followed_by type="song" name="Pass.." 6 .. written_by 3 sung_by . followed_by type="song" weight=2 name="Terrap.." 4 Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 43. A Grateful Dead Dataset – Load Data/Basic Stats gremlin> g:load(‘data/graph-example-2.xml’) ==>true gremlin> count($_g/V) ==>809.0 gremlin> count($_g/E) ==>8049.0 Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 44. A Grateful Dead Dataset – Out-Degree of Each Vertex gremlin> $degrees := g:map() gremlin> foreach $v in $_g/V $degrees[@name=$v/@name] := count($v/outE) end Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 45. A Grateful Dead Dataset – Out-Degree of Each Vertex gremlin> g:sort($degrees, ‘value’, true()) ==>PLAYING IN THE BAND=96.0 ==>SUGAR MAGNOLIA=92.0 ==>PROMISED LAND=89.0 ==>GOOD LOVING=87.0 ==>NOT FADE AWAY=86.0 ==>I KNOW YOU RIDER=85.0 ==>CASSIDY=83.0 ==>DEAL=82.0 ==>JACK STRAW=81.0 ==>ONE MORE SATURDAY NIGHT=81.0 ==>EL PASO=80.0 ==>MEXICALI BLUES=79.0 ... Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 46. A Grateful Dead Dataset – Inspecting Single Vertex gremlin> $v := g:key(‘name’,‘CHINA DOLL’)[1] ==>v[129] gremlin> g:map($v) ==>name=CHINA DOLL ==>song_type=original ==>performances=114 ==>type=song gremlin> $v/outE[@label=‘sung_by’]/inV/@name ==>Garcia Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 47. A Grateful Dead Dataset – Inspecting Single Vertex gremlin> $v/outE[@label=‘followed_by’]/inV/@name ==>BIG RIVER ==>THROWING STONES ==>SAMSON AND DELILAH ==>TRUCKING ==>CASEY JONES ==>HIGH TIME ... gremlin> $v/outE[@label=‘followed_by’]/@weight ==>2 ==>8 ==>1 ==>2 ==>1 ==>1 ... Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 48. Introduction to PageRank • The remainder of this section will discuss the PageRank algorithm and its application to multi-relational graphs. • The arguments made and the examples presented generalizes to all other single-relational graph algorithms. However, for the sake of brevity and consistency, only PageRank will be discussed. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 49. Introduction to Matrix-Based PageRank • PageRank is a centrality measure based on the primary eigenvector |V |×|V | of a modified version of a graph. Let A ∈ R+ denote the adjacency matrix representing the graph. • In order to ensure a positive real values in the eigenvector, the graph must be strongly connected. PageRank induces strong connectivity by overlaying a low probability (defined by α ∈ [0, 1] – usually 0.15) 1 |V |×|V | “teleportation” graph over the original graph. Let B ∈ |V | denote a teleportation adjacency matrix where ever vertex is connected to vertex with equal probability. |V |×|V | C = (1 − α)A + αB, where C ∈ R+ |V | λ = λC, where λ ∈ R+ is the PageRank vector over V . Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 50. Introduction to Random Walk-Based PageRank • PageRank can be implemented by a random walk. • Create a vertex counter map, m : V → N+. • Place a walker on a random vertex in V . Denote the walker’s current vertex i ∈ V . 1. increment the vertex counter by 1 (i.e. m(i) ← m(i) + 1). 2. the walker chooses a random adjacent vertex with probability α. 3. the walker chooses a random vertex in V with probability 1 − α. 4. rinse and repeat until m reaches a stationary probability distribution (continually normalize m if you want a probability distribution). • We will use this random walk model in the Gremlin examples to follow. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 51. PageRank over Multi-Relational Graphs • PageRank was designed for single-relational graphs (i.e. where all edges have the same meaning). • In a multi-relational graph, what does it mean to find the centrality of a vertex when vertices can be related by various types of edges? For example, if there exists “socializes with” and “met once”, then the person who “met once” many people could be the most centrally located in the graph. Also, what if you graph has more than just “person”-type vertices (e.g. cars, pets, buildings, articles, etc.) and “person”-type edges (e.g. owns, walks, livesAt, cites, etc.). Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 52. PageRank over Multi-Relational Graphs • Calculating single-relational PageRank would yield Person as the most central ... Person type vertex. type type • You can boolean filter certain edge labels type type (e.g. ignore type edges — in such cases, type type type type type type type you would have the centrality scores over the knows social graph). • However, what if you only wanted to traverse knows edges if and only if the Herbert Johan Marko Josh Jen ... adjacent vertex knows more than 10 other people? knows knows knows knows • In the end, you want complete knows knows control (universal computability) over the paths that the traverser/walker can take through a graph. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 53. PageRank over Multi-Relational Graphs • In multi-relational graphs, the meaning of your graph algorithm’s results are defined by your definition of adjacency. • With respect to random walk-based PageRank, define the path that the walker should take. That path is the definition of adjacency. • The stationary probability distribution created from this walk yields a path-dependent centrality. • Thus, in a multi-relational graph, there are many types of PageRanks that can be calculated — one for each type of path defined for a walker. Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, Knowledge-Based Systems, 21(7), 727–739, http://arxiv.org/abs/0803.4355, October 2008. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 54. PageRank over “Garcia Followed By” SubGraph • Define a path that will go from song-to-song by “followed by” edges and only traverse songs that are “sung by” Jerry Garcia. (./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’] /inV[name=‘Garcia’]/../..)[g:rand-nat()] A B C D /../.. followed_by sung_by name="Garcia" g:rand-nat() . followed_by sung_by name="Garcia" followed_by sung_by name="Weir" Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 55. PageRank over “Garcia Followed By” SubGraph path garcia-followed_by (./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’] /inV[name=‘Garcia’]/../..)[g:rand-nat()] end $m := g:map() $alpha := 0.15 $_ := g:key(‘type’, ‘song’)[g:rand-nat()] repeat 2500 $_ := ./garcia-followed_by if count($_) > 0 g:op-value(‘+’,$m,$_[1]/@name, 1.0) end if g:rand-real() < $alpha or count($_) = 0 $_ := g:key(‘type’, ’song’)[g:rand-nat()] end end Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 56. PageRank over “Garcia Followed By” SubGraph gremlin> g:sort($m,‘value’,true()) ==>CRAZY FINGERS=98.0 ==>HES GONE=85.0 ==>CHINA CAT SUNFLOWER=79.0 ==>BERTHA=76.0 ==>UNCLE JOHNS BAND=74.0 ==>TERRAPIN STATION=72.0 ==>GOING DOWN THE ROAD FEELING BAD=71.0 ==>WHARF RAT=71.0 ==>EYES OF THE WORLD=65.0 ==>COLD RAIN AND SNOW=62.0 ==>SHIP OF FOOLS=58.0 ==>RAMBLE ON ROSE=53.0 ==>CASEY JONES=51.0 ==>DARK STAR=47.0 ==>DEAL=46.0 ... Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 57. Universal Computation in Paths path path-name # any arbitrary computation can occur here end • A path definition can be used to define adjacencies. adjacency can be expressed as anything that can be computed by a Turing machine. path definitions are used to create “semantically meaningful” results from single- relational graph algorithms applied to multi-relational graphs. path definitions make explicit what is implicit in the structure of the graph. This has applications to knowledge-based reasoning. • A path definition can perform any arbitrary computation. path definitions can check/set vertex/edge properties. path definitions can create new vertices and edges. path definitions can call/define functions. This allows fine grained control over how your traverser/walker moves through a graph. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 58. Outline • Introduction to Graphs and Graph Software • Basic Gremlin Concepts • Gremlin Language Description • Advanced Gremlin Concepts • Conclusions Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 59. The Current Gremlin EcoSystems • Webling: Web console for Gremlin (developed by Pavel Yaskevich w/ funding from Neo Technology) Webling • Project Gargamel: Distributed Graph Computing (uses Linked Process and Gremlin) • ReXster: A Graph-Based Recommender Engine Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
  • 60. Thank You Please enjoy Gremlin at http://gremlin.tinkerpop.com ... My homepage is http://markorodriguez.com. Please feel to contact me with any questions or comments. Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010