1. Memoirs of a Graph Addict:
Despair to Redemption
Marko A. Rodriguez
Graph Systems Architect
http://markorodriguez.com
http://twitter.com/twarko
Winter Whirlwind Tour – Chicago to Malm¨ – January 10-14, 2011
o
January 8, 2011
2. Abstract
A graph database provides a means of linking together objects using direct
references. In other words, in order to determine if one object is adjacent
to another, no index lookup is required. In contrast to relational databases,
in a graph database, there is no notion of a join operation as the graph is
already an explicitly joined structure. Given a graph, problems are solved
using graph traversals–that is, directed walks over the objects and relations
that compose the graph. This lecture has three primary points of
discussion. The first is a description of graph database technology. The
second, a memoir of the speaker’s applied and theoretical work with
graphs. The third and final point, a review of an open source graph
processing stack currently being developed by AT&T Interactive and its
collaborators.
6. For 10 years now, I’ve dealt with a painful graph addiction...
Let me share my story with you.
9. Graph Data Structure Pieces: Part 1
id vertex (thing, object, dot)
}
element
edge (relation, join, line)
10. Single-Relational Graph
marko peter
neotech
tinkerpop
neo4j
gremlin blueprints
In single-relational graphs, things are related. Unfortunately, not a very useful structure
for most domain modeling situations. Relatedness is too generic—all edges have the
same meaning.
11. Graph Data Structure Pieces: Part 2
id vertex (thing, object, dot)
}
element
label edge (relation, join, line)
12. Multi-Relational Graph
knows
marko knows peter
member
neotech
member member
created
tinkerpop
neo4j
created created
imports
gremlin imports blueprints
By adding labels to the edges, its possible to denote the type of relation that exists
between any two vertices. Now its possible to denote different types of things and the
different ways in which they relate to one another.
13. Graph Data Structure Pieces: Part 3
id vertex (thing, object, dot)
}
element
label edge (relation, join, line)
key=value property (key/value, attribute)
key1=value1
key2=value2 property map
14. Property Graph
knows
marko knows peter
member
neotech
member member
created
tinkerpop
date=2009 date=2009 neo4j
created created
imports lang=java
use=graphdb
gremlin imports blueprints
lang=java
lang=java
use=api
use=traverse
Allow elements to have key/value properties. In particular, very useful for further
specifying the meaning of an edge. “When did TinkerPop create Gremlin?”
15. Numerous Graph Types
vertex-labeled
a
multi
ted
igh hyper
we 0.2
edge-labeled
knows
simple created=2-01-09
modified=2-11-09
ge
tic
undirected
half-ed
hired di
an
re edge-attributed
cte
sem
pseudo
d
name=emil
type=person http://ex.com/123
vertex-attributed resource description framework
Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Society for Information Science
and Technology, 36(6), pp. 35-41, 2010. [http://arxiv.org/abs/1006.2361]
16. Property Graph as a Rich Structure
weighted graph
add weight attribute
property graph
remove attributes remove attributes no op
labeled graph no op semantic graph no op directed graph
remove edge labels remove edge labels
make labels URIs no op
rdf graph multi-graph remove directionality
remove loops, directionality,
and multiple edges
simple graph no op undirected graph
A fun related thought: Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of
Applied Mathematics and Computer Sciences, 4(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]
17. Graph Algorithms in Single-Relational Graphs
• Most graph algorithms are designed for single-relational graphs.1
Geodesic: shortest path, eccentricity, diameter, closeness centrality,
betweenness centrality, etc.
Eigenvector: spreading activation, pagerank, eigenvector centrality,
etc.
Assortative: scalar, assortative, etc.
1
Excellent book reviewing numerous graph algorithms: Brandes U., Erlebach, T., “Network Analysis:
Methodological Foundations,” Springer, 2005.
18. Graph Algorithms in Multi-Relational+ Graphs
• Most real-world software systems require multi-relational+ graphs. E.g.:
Who are the most central coauthors when all I know is wrote?
coauthor
coauthor
wrote
wrote wrote wrote wrote wrote
• A key concept when evaluating graph algorithms over multi-relational+
graphs is implicit adjacency/path descriptions/virtual edges/etc.2
2
Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis
Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]
20. The Simplicity of a Graph
• A graph is a simple data structure.
• A graph states that something is related to something else (the foundation
of any other data structure).3
• It is possible to model a graph in various types of databases.4
Relational database: MySQL, Oracle, PostgreSQL
JSON document database: MongoDB, CouchDB
XML document database: MarkLogic, eXist-db
etc.
3
A graph can be used to represent other data structures. This point becomes convenient when looking
beyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing their
applicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc.
4
For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directed
graph. Note that it is possible to model multi-relational graphs in these types of database as well.
21. Representing a Graph in a Relational Database
outV | inV
------------ A
A | B
A | C
C | D B C
D | A
D
22. Representing a Graph in a JSON Database
{
A : {
A
outE : [B, C]
}
B : {
outE : []
} B C
C : {
outE : [D]
}
D : {
D
outE : [A]
}
}
23. Representing a Graph in an XML Database
<graphml>
<graph>
A
<node id=A />
<node id=B />
<node id=C />
<node id=D />
<edge source=A target=B /> B C
<edge source=A target=C />
<edge source=C target=D />
<edge source=D target=A />
</graph>
D
</graphml>
24. Defining a Graph Database
“If any database can represent a graph, then what
is a graph database?”
25. Defining a Graph Database
A graph database is any storage system that
provides index-free adjacency.
26. Defining a Graph Database by Example
Toy Graph Gremlin
(stuntman)
B E
A
C D
27. Graph Databases and Index-Free Adjacency
B E
A
C D
• Our gremlin is at vertex A.
• In a graph database, vertex A has direct references to its adjacent vertices.
• Constant time cost to move from A to B and C . It is dependent upon the number
of edges emanating from vertex A (local).
30. Non-Graph Databases and Index-Based Adjacency
B E
A B C A
B,C E D,E
D E
C D
• Our gremlin is at vertex A.
31. Non-Graph Databases and Index-Based Adjacency
B E
A B C A
B,C E D,E
D E
C D
• In a non-graph database, the gremlin needs to look at an index to determine what
is adjacent to A.
• log(n) time cost to move to B and C . It is dependent upon the total number of
vertices and edges in the database (global).
32. Non-Graph Databases and Index-Based Adjacency
B E
A B C A
B,C E D,E
D E C D
The Index (explicit) The Graph (implicit)
33. Non-Graph Databases and Index-Based Adjacency
B E
A B C A
B,C E D,E
D E C D
The Index (explicit) The Graph (implicit)
34. Index-Free Adjacency
• While any database can implicitly represent a graph, only a
graph database makes the graph structure explicit.5
• In a graph database, each vertex serves as a “mini index”
of its adjacent elements.6
• Thus, as the graph grows in size, the cost of a local step
remains the same.7
5
Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_
Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in a
relational database (MySQL) and a graph database (Neo4j).
6
Each vertex can be intepreted as a “parent node” in an index with its children being its adjacent
elements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit the
graph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner)
7
A graph, in many ways, is like a distributed index.
35. Graph Query = Graph Traversal
• Graph databases are optimized for graph-theoretic operations
(e.g. graph traversals).
• Graph databases are not optimized for set-theoretic
operations (e.g. union, intersection, theta-join).
• The graph traversal pattern:8
Given some root set of elements, traverse in X fashion
to yield some side-effect and/or destination.
8
Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques
and Applications, eds. S. Sakr, E. Pardede, IGI Global, 2011. http://arxiv.org/abs/1004.1001
39. Oil production has dropped significantly. Any reserves that are left are too
expensive to purchase. Nations can not transport food.9
Regions with poor agriculture yield famine.
9
Peak oil available at http://en.wikipedia.org/wiki/Peak_oil.
40. People are in shock, fear, and panic over the fall of
the modern world.
The world sees a 75% drop in human population.
41. The technology and knowledge of the modern world
still exists.
The social infrastructure doesn’t....A few rise to
create a new world order.10
10
Watkins, J.H., M.A. Rodriguez, “A Survey of Web-Based Collective Decision Making Systems,” Studies
in Computational Intelligence: Evolution of the Web in Artificial Intelligence Environments, eds. R. Nayak,
N. Ichalkaranje, and L.C. Jain, pp. 245-279, 2008. [http://escholarship.org/uc/item/04h3h1cr]
42. Collective Decision Making: Rise of the Machines
Four strong, brave men begin the
journey to stability. Decisions marko peter
need to be made regarding how
to determine and execute social
goals. The distributed collective of
TinkerPop is created. josh
• Marko Rodriguez (former USA)
• Peter Neubauer (former Sweden) pavel
• Josh Shinavier (former China)
• Pavel Yaskevich (former Belarus)
43. Collective Decision Making: Rise of the Machines
marko peter
josh pavel
Dynamically Distribute
Direct Democracy Democracy
Two examples will be presented for the same decision making scenario. One using direct
democracy as the aggregation algorithm and one using dynamically distributed
democracy as the aggregation algorithm.11
11
Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision
Making Systems Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929]
44. Collective Decision Making: Direct Democracy
• “What percentage of our crop
marko peter
yield should we store as 0.8 0.5
reserves?”
• The outcome is represented as a
real value in [0, 1]. josh
0.8
• Each individual has their opinion
of the situation.
pavel
Marko (80% should be stored.) 0.9
Peter (50% should be stored.)
Josh (80% should be stored.)
Pavel (90% should be stored.)
45. Collective Decision Making: Direct Democracy
• In a direct democracy, every one
marko peter
voices their opinion. 0.8 0.5
• The average of all voiced opinions
is the final decision (even in binary josh
decisions). 0.8
• For our society of 4, a pure direct
pavel
democracy would yield 0.9
(0.8 + 0.5 + 0.8 + 0.9)/4 = 0.75.
46. Collective Decision Making: Direct Democracy
• If an individual abstains from
marko peter
participation, then their opinion 0.8 0.5
is not considered.
• Assume only Peter and Pavel are
there to participate. Marko and josh
0.8
Josh are out hunting.
• For our society of 4 (with 2
voters), a pure direct democracy pavel
would yield 0.9
(0.5 + 0.9)/2 = 0.7.
|0.75 − 0.7| = 0.05 error.
47. Collective Decision Making: Representative Democracy
• Thomas Paine stated that when populations are small “some convenient
tree will afford them a State house”, but as the population increases it
becomes a necessity for representatives to “act in the same manner as
the whole body would act were they present.”12 13
12
Paine, T., “Common Sense,” 1776.
13
The role of the representative as an expert vs. a model is argued at length in Pitkin, H.F., “The
Concept of Representation,” University of California Press, 1972.
48. Collective Decision Making: DDD
• Dynamically distributed democracy (DDD) strikes a balance between
direct and representative democracy.
• An individual is at least a representative of themselves.
• An individual can also yield the power of those that abstain from
participation.
• Dynamically distributing representative power is the purpose of the
algorithm.
49. Collective Decision Making: DDD
• Peter believes that Josh and
Marko are good decision makers. marko 0.5 peter
• When Peter abstains, Marko 0.5
and Josh yield his social power
in equal parts (0.5). josh
• Like a friendship graph, but the
edges denote “trust.”
“I believe that X has identical values pavel
to me and will behave as I do.”
“I believe that X is more expert than
I and should make decisions.”
50. Collective Decision Making: DDD
• Marko believes Josh is the key to
humanity. marko 0.5 peter
1.0 0.5
• Josh prefers people closer to his 0.25
eastern home of former China. josh
0.75
• Pavel is of the former Soviet
Union, and simply has no faith pavel
in anyone.
51. Collective Decision Making: DDD
marko 0.5 peter
1.0 0.5
0.25
josh
0.75
pavel
This is the trust-based social graph. Individuals can add/remove
outgoing edges from their vertex as they please. When decisions are
required, the current snapshot of the graph is used to compute the
collective decision.
52. Collective Decision Making: DDD
• In a dynamically distributed
democracy, every can voice their marko 0.5 peter
opinion.
1.0 0.5
• The weighted average of all 0.25
voiced opinions is the final josh
decision.
0.75
• For our society of 4, a pure direct
democracy would yield
pavel
(0.8 + 0.5 + 0.8 + 0.9)/4 = 0.75.
• When everyone participates,
its a direct democracy.
53. Collective Decision Making: DDD
• Assume Marko and Josh go
1.0 1.0
hunting, again. By abstaining,
marko peter
they diffuse their vote power 0.8
0.5
0.5
over their outgoing edges.
1.0 0.5
• By participating, Peter and 0.25
josh
Pavel aggregate vote power 0.8
through their incoming edges. 1.0
0.75
1.0
• This diffusion process continues
pavel
until all power has aggregated at 0.9
participating individuals.
54. Collective Decision Making: DDD
• Note that Marko fully trusts Josh
decision making abilities. 1.25
marko peter
0.5
0.8 0.5
• However, given that Josh is not
1.0 0.5
participating, Marko is implicitly
0.25
stating that he trusts Josh’s
josh
decision in choosing decision 0.8
makers. 1.0
0.75
1.75
pavel
• Thus, Josh serves to route 0.9
Marko’s power.
55. Collective Decision Making: DDD
• In the end, Peter and Pavel
have aggregated all the energy
1.5
in the graph (albeit, to different marko peter
0.5
degrees). 0.8 0.5
1.0 0.5
• Now a weighted direct democracy 0.25
is used to calculate the collective josh
0.8
decision.
0.75
2.5
• The collective vote is
pavel
((1.5·0.5)+(2.5·0.9))/4 = 0.75. 0.9
|0.75 − 0.75| = 0.0 error.
56. Collective Decision Making: DDD
0.20
correct decisions
0.00 0.05 0.10 0.15 0.95
direct democracy
dynamically distributed democracy
0.80
proportion oferror
0.65
dynamically distributed democracy
direct democracy
0.50
100 90 80 70 60 50 40 30 20 10
100 90 80 70 60 50 40 30 20 10 0
0
percentage of active citizens
percentage of active citizens (n)
Fig. 5. The relationship between k and evote for direct democracy (gray
k
line) and dynamically distributed democracy (black line). The plot provides
the proportion of identical, correct decisions over a simulation that was run
• As participation wanes, dynamically 6. A visualization
with 1000 artificially generated networks composed of 100 citizens each.
Fig.
distributed democracy is able to1, andcolor denotes th citizen’s
is purple is 0.5.
As previously stated, let x ∈ [0, 1]n denote 14 political Reingold layout.
the
simulate direct democracy. xi is the
tendency of each citizen in this population, where
tendency of citizen i and, for the purpose of simulation, is
determined from a uniform distribution. Assume that every n “vote power” and
1
14
Rodriguez, M.A., Steinbock, D.J., “A Social Networka population of n citizens uses some social network- such that thentotal a
citizen in for Societal-Scale Decision-Making
based system to create links to those individuals that they 1. Let y ∈ R+ deno
Systems,” Proceedings of the Computational Social and Organizational Science In practice, these links flowed to each citize
believe reflect their tendency the best. Conference, 2004.
[http://arxiv.org/abs/cs/0412047] may point to a close friend, a relative, or some public figure a ∈ {0, 1}n denotes
whose political tendencies resonate with the individual. In in the current decis
other words, representatives are any citizens, not political values of a are biase
candidates that serve in public office. Let A ∈ [0, 1]n×n denote of making the citize
the link matrix representing the network, where the weight of the citizen inactive.
an edge, for the purpose of simulation, is denoted where ◦ denotes en
1 − |xi − xj | if link exists
Ai,j = π←0
0 otherwise. i≤
while i=
y←y
57. Collective Decision Making: Techno-Government
• In this model of decision making, there is no governmental body.
• Power is determined when a decision is needed.
• How are bills created? Wikilegislature?15
• What about different types of trust (e.g. “Marko trusts Josh in
engineering decisions only.”) — Hint: Multi-relational+ graphs. Tagging
legislature and tagging trust.16
15
Turoff, M., Roxanne-Hiltz, S., Bieber, M., Rana, A., “Collaborative Discourse Structures in Computer
Mediated Group Communications”, Hawaii International Conference on Systems Science (HICSS), 1998.
[http://web.njit.edu/~turoff/Papers/CDSCMC/CDSCMC.htm]
16
Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-Based
Particle Swarms,” Hawaii International Conference on Systems Science (HICSS), pp. 39–49, 2007.
[http://arxiv.org/abs/cs/0609034]
58. “The founders of modern democracies provided a moral heritage that
remains highly regarded in societies today. However, it should be
remembered that it is the ideals that are valuable, not the specific
implementation of the systems that protect and support them. If
there is another implementation of government that better realizes
these ideals, then, by the rights of man, it must be enacted.”17
– Michael Scott
17
Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision
Making Systems Perspective,” First Monday, 14(8), University of Illinois at Chicago Library, 2009.
[http://arxiv.org/abs/0901.3929]
61. Humans no longer struggle to survive. They
struggle for eudaemonia. They seek the “good
daemon” within...
62. Eudaemonic Engine: Artistotle
• Being virtuous is repeatedly choosing correctly.
• Habitual correct behavior leads to eudaemonia – complete engagement in the world
(a complete sense of engagement/acceptance).18 19
• Can systems aid individuals in choosing correctly – in all aspects of life?
Aristotle David L. Norton
18
Aristotle, “Nicomachean Ethics”, 350 B.C.
19
Mihaly Csikszentmihalyi, “Flow: The Psychology of Optimal Experience”, Harper Perennial, 1990.
63. Eudaemonic Engine: Resource Modeling
But if the development of character is a the moral objective, it is obvious that
[...] the choices of vocation and avocations to pursue, of friends to cultivate, of
books to read are moral for they clearly influence such development.20
• Web services are continuing to build richer models of humans, resources,
and the relationships between them.
• There exists an increasing reliance on such services to aid in decision
making: correct books (Amazon.com), correct movies (NetFlix.com),
correct music (Pandora), correct occupation (Monster.com), correct
friends (PointsCommuns.com), correct life partner (Match.com), etc.21
20
David L. Norton, “Democracy and Moral Development: A Politics of Virtue”, University of California Press, 1991.
21
Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the
International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, 5712, pp. 813–820, 2009.
[http://arxiv.org/abs/0904.0027]
64. Eudaemonic Engine: Mapping Person to Resource
movie
watch
article
read
time
person listen music
meet
friend
eat
food
Map an individual to actions on resources. However, how do we
model/expose the resources of the world?
67. Eudaemonic Engine: URIs of the Web of Data
http://dbpedia.org/resource/The Fountainhead
FLICKR
http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Ayn_Rand
foaf:depiction
flickr:Ayn_Rand
dbpprop:hasPhotoCollection
dbpedia:Ayn_Rand
DBPEDIA
dbpedia:Book
dbpedia:author
dbpedia:Fountain_Head rdf:type
68. Eudaemonic Engine: Datasets on the Web of Data
data set domain data set domain data set domain
audioscrobbler music govtrack government pubguide books
bbclatertotp music homologene biology qdos social
bbcplaycountdata music ibm computer rae2001 computer
bbcprogrammes media ieee computer rdfbookmashup books
budapestbme computer interpro biology rdfohloh social
chebi biology jamendo music resex computer
crunchbase business laascnrs computer riese government
dailymed medical libris books semanticweborg computer
dblpberlin computer lingvoj reference semwebcentral social
dblphannover computer linkedct medical siocsites social
dblprkbexplorer computer linkedmdb movie surgeradio music
dbpedia general magnatune music swconferencecorpus computer
doapspace social musicbrainz music taxonomy reference
drugbank medical myspacewrapper social umbel general
eurecom computer opencalais reference uniref biology
eurostat government opencyc general unists biology
flickrexporter images openguides reference uscensusdata government
flickrwrappr images pdb biology virtuososponger reference
foafprofiles social pfam biology w3cwordnet reference
freebase general pisa computer wikicompany business
geneid biology prodom biology worldfactbook government
geneontology biology projectgutenberg books yago general
geonames geographic prosite biology ...
69. Eudaemonic Engine: Transforms Development
A new application development paradigm emerges. No longer do data and application
providers need to be the same entity (left). With the Web of Data, its possible for
developers to write applications that utilize data that they do not maintain (right).22
Application 1 Application 2 Application 3 Application 1 Application 2 Application 3
processes processes processes
processes processes processes
Web of Data
structures structures structures
structures structures structures
127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3
22
Rodriguez, M.A., “A Reflection on the Structure and Process of the Web of Data,” Bulletin of the American Society for
Information Science and Technology, 35(6), pp. 38–43, 2009. [http://arxiv.org/abs/0908.0373]
70. Now that there is a rich structure, what is the
process?
72. Eudaemonic Engine: Diffusion Processes on Graphs
A graph diffusion process will be used to determine the solution to one’s
problems.
• Graph traversing can be seen as a diffusion process over a graph.
• “Energy” moves over a graph and reverberates in regions where there
is recurrence (i.e. cycles).
• At some t in the future, the vertices with the greatest flow are the
solution to the problem.
78. Implementing a diffusion process is easy when the edges of the
graph are unlabeled.
flow = new HashMap<Vertex,Integer>();
current = Arrays.asList(startVertex);
steps = 10;
for(int i=0; i<steps; i++) {
current = current.collect{ it.getAdjacentVertices() }
current.each{ flow[it] = flow[it] + 1 }
}
79. Eudaemonic Engine: Diffusion on a Property Graph?
likes emil
likes
linked
24
process
knows True Blood
likes wrote wrote
likes
likes
jen knows marko knows peter
occupation
occupation likes likes wrote occupation
intelligence The Wire gremlin tagged graphs
With different types of things being related by different types of relations,
you need to specify legal paths for the energy to flow over.
80. Eudaemonic Engine: Diffusion on a Property Graph
• Problem statement = Start vertices + path expression.
• Problem solution = Highest energy vertices at t.23 24 25
23
Examples presented next are basic due to the simplicity of the toy graph example used. In such cases,
queries as opposed to energy diffusions are best. In general, the purpose of an energy diffusion is to
expose recurrence/feedback in the graph. For the more technically inclined, think of it as determining the
eigenvector of the graph defined by the path expression.
24
Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems,
21(7), pp. 727–739, 2008. [http://arxiv.org/abs/0803.4355]
25
Rodriguez, M.A., Neubauer, P., “A Path Algebra for Multi-Relational Graphs,” 2nd International
Workshop on Graph Data Management (GDM11), 2010. [http://arxiv.org/abs/1011.0390]
81. Eudaemonic Engine: Friend Recommendation
likes emil
likes
linked
24
process
knows True Blood
likes wrote wrote
likes
likes
jen knows marko knows peter
occupation
occupation likes likes wrote occupation
intelligence The Wire gremlin tagged graphs
“Who are my friends’ friends that are not me or my friends?”26
26
marko.outE[[label:’knows’]].inV.aggregate(x).outE.inV{!x.contains(it)}
82. Eudaemonic Engine: Product Recommendation
likes emil
likes
linked
24
process
knows True Blood
likes wrote wrote
likes
likes
jen knows marko knows peter
occupation
occupation likes likes wrote occupation
intelligence The Wire gremlin tagged graphs
“Who likes what I like? Of those things they like, what else do they like
that I don’t already like?”27
27
marko.outE[[label:’likes’]].inV.aggregate(x).inE[[label:’likes’]].outV.outE[[label:’likes’]].inV{!x.contains(it)}
83. Eudaemonic Engine: Product Recommendation 2
likes emil
likes
linked
24
process
knows True Blood
likes wrote wrote
likes
likes
jen knows marko knows peter
occupation
occupation likes likes wrote occupation
intelligence The Wire gremlin tagged graphs
“Who likes what I like and what do they like? What do the people I know
like? Of those things liked, what do I not already like?”
84. Eudaemonic Engine: Recommendation
• Different paths through a domain model expose different types of
recommendations.
• Individual path preferences allow for an ecosystem of traversals (different
problems can be solved over the same domain model).28 29 30
28
Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the
Scholarly Communication Process,” 2009. [http://arxiv.org/abs/0905.1594]
29
Rodriguez, M.A., “Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and
Recommendation,” Technical Talk Seminar, AT&T Interactive, 2010.
[http://slidesha.re/bOCy4Q]
30
Traversal Patterns with Gremlin available at https://github.com/tinkerpop/gremlin/wiki/
Traversal-Patterns.
86. Life is good. Humans flourish. Virtuous men’s minds are filled
with wonderfully creative ideas. Inventions proliferate.
87. Advances in computer network technology yield a
new model of computing.
Computer networks are no longer the bottleneck for
speed. Accessing local and remote data is no longer
considered “different.” The distinction between
RAM, disk drive, and Web disappears.
88. Universal Computer: A Computational Substrate
On the Web...
• Represent data.
• Represent code.
• Represent virtual machines.
89. Universal Computer: Represent Data
• URIs form an infinite universal address space.
• A URI can denote a datum.
http://markorodriguez.com#self (Marko)
http://sws.geonames.org/4887398/about.rdf (Chicago)
http://data.nytimes.com/N38395718310308503251 (Malm¨)
o
• RDF (Resource Description Framework) is a data model for linking URIs
into a multi-relational graph.
90. Universal Computer: Represent Data
127.0.0.2
127.0.0.1
atti:marko atti:bestFriend nm:puppy
atti:hasFur atti:hasFur
atti:numberOfLegs atti:numberOfLegs
"2"^^xsd:integer "false"^^xsd:boolean "4"^^xsd:integer "true"^^xsd:boolean
• The concept of atti:marko and the properties atti:numberOfLegs, atti:hasFur,
and atti:bestFriend is maintained by AT&Ti graph server.
• The concept of nm:puppy is maintained by a New Mexico graph server.
• The data types of xsd:integer and xsd:boolean are maintained by XML standards
organization.
91. Universal Computer: Represent Code
• Computing is a series of instructions — add, write, branch, goto...
• The URI address space and RDF glue can be seen as computational
medium.31
_:123 rdf:type atti:Add
atti:left-op atti:right-op
rdf:subClassOf
"3"^^xsd:int "7"^^xsd:int atti:Instruction
31
Rodriguez, M.A., “General-Purpose Computing on a Semantic Network Substrate,” Emergent Web
Intelligence: Advanced Semantic Technologies, eds. R. Chbeir, A. Hassanien, A. Abraham, and Y. Badr, pp.
57–104, 2010. [http://arxiv.org/abs/0704.3395]
92. Universal Computer: Represent Code
atti:marko atti:bestFriend nm:puppy
atti:hasMethod atti:isHappy
Method
atti:pet "false"^^xsd:boolean
atti:args
atti:block
_:1234 _:2345
atti:inst
rdf:1
_:3456
"animal"^^xsd:string
// make animal happy
Represent methods and their instructions attached to objects/classes.
93. Universal Computer: Represent Virtual Machines
Virtual Machine
atti:VM atti:marko atti:bestFriend nm:puppy
atti:hasMethod atti:isHappy
rdf:type
_:6789 atti:pc _:3456 atti:pet "false"^^xsd:boolean
atti:block
atti:inst
_:2345
write "true"^^xsd:boolean
Represent not only code, but the machines that execute it.
95. Global Data Structure
Data
Machine Architecture
API
Program Virtual Machine State
read/write
read/write
Virtual Machine Processes
...
127.0.0.1 Physical Machines 127.0.0.4
127.0.0.2 127.0.0.3
Physics
My Belief in Reality
96. Universal Computer: A Ramification
• Data, APIs, code, machine architectures, and virtual machines are within
the same global URI address space.
Code can by physically distributed across computers. For example,
an add instruction on 127.0.0.1 references a branch instruction on
127.0.0.2.
Hardware machines can be added or removed without altering the
state of computation — only the speed.
No developer concept of RAM-based memory addresses — the only
address space is the space of all URIs.
97. Universal Computer: Another Ramification
• Reflection down to the machine level.32
Most languages support the manipulation of code at runtime. In this
model, the virtual machine can be altered at runtime.
Code can rewrite the virtual machine that is evaluating the
code. (i.e. create lots of bugs.)
32
Rodriguez, M.A., The RDF Virtual Machine, LA-UR-08-03925, in review, 2009. [http://arxiv.org/
abs/0802.3492]
99. Man learns to encode themselves into the URI
address space...33 34
33
Egan, G., “Permutation City,” Eos Publisher, 1995.
34
Rodriguez, M.A., “From the Signal to the Symbol: Structure and Process in Artificial Intelligence,”
Center for Nonlinear Studies Post Doctorate Seminar, Los Alamos National Laboratory, Los Alamos, New
Mexico, 2008. [http://slidesha.re/hdqRn2]
102. TinkerPop Productions
• Blueprints: Data Models and their Implementations
[http://blueprints.tinkerpop.com]
• Pipes: A Data Flow Framework using Process Graphs
[http://pipes.tinkerpop.com]
• Gremlin: A Graph-Based Programming Language
[http://gremlin.tinkerpop.com]
• Rexster: A RESTful Graph Shell
[http://rexster.tinkerpop.com]35
35
Please see http://engineering.attinteractive.com/2010/12/a-graph-processing-stack/ for
a short review of these products.
Also TinkerPop’s homepage at: http://tinkerpop.com
103. Blueprints: A Property Graph Model Interface
Blueprints
• Blueprints is the like the JDBC of the graph database community.
• Provides a Java-based interface API for the property graph data model.
Graph, Vertex, Edge, Index.
• Connectors to TinkerGraph, Neo4j, OrientDB, Sails (e.g. AllegroGraph,
HyperSail, etc.), and soon InfiniteGraph. Into the future, hope to support
InfoGrid, Sones, DEX, and HyperGraphDB.36
36
HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its current
form, only supports the more common binary graph.
104. Creating a Neo4jGraph in Blueprints
// create a graph
Graph graph = new Neo4jGraph("/tmp/neo4j");
// add two vertices
Vertex a = graph.addVertex(null);
a.setProperty("name","marko");
Vertex b = graph.addVertex(null);
b.setProperty("name","peter");
// join the two vertices by a knows relation
Edge e = graph.addEdge(null,a,b,"knows");
e.setProperty("since","2007");
0 knows 1
since=2007
name=marko name=peter
105. Handy Features of Blueprints
• Supports automatic transactions
graph.setTransactionMode(AUTOMATIC -or- MANUAL)
In automatic mode, every manipulation of the graph is wrapped in a
transaction and committed.
• Supports automatic indices
graph.createIndex(AUTOMATIC -or- MANUAL)
In automatic mode, elements are added or removed from an index as
their properties are manipulated.
• Utility Suite
Blueprints Sail makes a graphdb into a traversal-based RDF store.
GraphML Reader/Writer library.
106. Pipes: A Data Flow Framework using Process Graphs
Pipes
• Lazy data flow with support for Blueprints-based graph processing.
• Provides a collection of “pipes” (implement Iterable and Iterator)
that are connected together to form processing pipelines.
Filters: ComparisonFilterPipe, RandomFilterPipe, etc.
Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc.
Splitting/Merging: CopySplitPipe, RobinMergePipe, etc.
Logic: OrFilterPipe, AndFilterPipe, etc.
107. Pipes: Chained Iterators
This pipeline takes objects of type A and turns them into objects of type D
through a sequence of processing pipes...37
D
D
A
A A Pipe1 B Pipe2 C Pipe3 D D
A D
A
Pipeline
Pipe<A,D> pipeline =
new Pipeline<A,D>(Pipe1<A,B>, Pipe2<B,C>, Pipe3<C,D>)
37
Though not discussed, splitting and merging is allowed as well (branching pipelines).
108. Pipes: A Simple Example
“What are the names of the people that marko knows?”
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
109. Pipes: A Simple Example
Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES);
Pipe<Edge,Edge> pipe2= new LabelFilterPipe("knows",Filter.NOT_EQUAL);
Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX);
Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name");
Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4);
pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A"));
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
110. Pipes: A Simple Example
for(String name : pipeline) {
System.out.println(name);
}
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
peter
pavel
111. Pipes: A Simple Example
EdgeVertexPipe(IN_VERTEX)
VertexEdgePipe(OUT_EDGES)
PropertyPipe("name")
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
LabelFilterPipe("knows")
112. Pipes: A Simple Example
EdgeVertexPipe(IN_VERTEX)
VertexEdgePipe(OUT_EDGES)
PropertyPipe("name")
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
LabelFilterPipe("knows")
113. Pipes: A Simple Example
EdgeVertexPipe(IN_VERTEX)
VertexEdgePipe(OUT_EDGES)
PropertyPipe("name")
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
LabelFilterPipe("knows")
114. Pipes: A Simple Example
EdgeVertexPipe(IN_VERTEX)
VertexEdgePipe(OUT_EDGES)
PropertyPipe("name")
B name=peter
knows
A knows C name=pavel
name=marko
created
created
D name=gremlin
LabelFilterPipe("knows")
116. Pipes: Easy to Create New Pipes
public class NumCharsPipe extends AbstractPipe<String,Integer> {
public Integer processNextStart() {
String word = this.starts.next();
return word.length();
}
}
When extending the base class AbstractPipe<S,E> all that is required is
an implementation of processNextStart().
117. Pipes: Easy to Create New Pipes
Most of my projects are composed
of lots of application specific Pipes.
com.tinkerpop.pipes That is, Pipes that are specific to
my domain model and yield useful
jumps in the graph. For example,
domain specific SameLikesPipe<Vertex,Vertex>.
From these domain specific Pipes,
complex algorithms are created
through the piecing together of
complex traversal
those Pipes. For example,
algorithms
RecommenderPipe<Vertex,Map>.
118. Gremlin: A Graph-Based Programming Language
Gremlin G = (V, E)
• A graph traversal language that uses Groovy as its host language.
• Compiles Gremlin syntax down to Pipes (implements JSR 223).38
38
At the time of this presentation, Gremlin’s most recent stable release is 0.6 which is a standalone
language. To increase the flexibility of the language, 0.7-SNAPSHOT+ boasts the use of Groovy as the host
the language.
119. Gremlin: Easily Compose Graph Related Pipes
Pipes is verbose...
Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES);
Pipe<Edge,Edge> pipe2 = new LabelFilterPipe("knows",Filter.NOT_EQUAL);
Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX);
Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name");
Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4);
pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A"));
...relative to Gremlin.
g.v(‘A’).outE[[label:‘knows’]].inV.name
120. Gremlin: The Simple Example
inV
outE name
B name=peter
knows
g.v('A')
A knows C name=pavel
name=marko
created
created
D name=gremlin
[[label:'knows']]
121. Gremlin: Defining a Step
“Who likes the same things that I like?”
Vertex.metaClass.same_like =
{ _().outE[[label:‘likes’]].inV.inE[[label:‘likes’]].outV }
B likes E
likes likes
A C likes F
likes likes
D likes G
122. Gremlin: Defining a Step
gremlin> g.v(‘A’).same_likes
==>v[E]
==>v[F]
==>v[F]
==>v[G]
B likes E
likes likes
A C likes F
likes likes
D likes G
123. Gremlin: Defining a Step
gremlin> m = g:id-v(‘A’).same_likes.group_count >> 1
gremlin> m
==>v[E]=1
==>v[F]=2
==>v[G]=1
v[F] is most similar, in terms of likes, to v[A].39
39
For a thorough review of such traversal patterns, please see: Rodriguez, M.A., “Problem-
Solving using Graph Traversals: Searching, Scoring, Ranking, and Recommendation,” July 2010.
[http://slidesha.re/bOCy4Q]
124. Rexster: A RESTful Graph Shell
reXster
• Allows Blueprints graphs to be exposed through a RESTful API (HTTP).
• All communication is via JSON.
• Supports stored traversals written in raw Pipes or Gremlin.
• Supports adhoc traversals represented in Gremlin.
• Provides “helper classes” for performing search-, score-, and rank-based
traversal algorithms—in concert, support for recommendation.
125. Rexster: URI Patterns
• http://localhost/graph/vertices: all the vertices in the graph
• http://localhost/graph/vertices/1: vertex with id 1 in the graph.
• http://localhost/graph/vertices/1/outE: outgoing edges of
vertex with id 1.
{ "results": {
"_type":"vertex",
"_id":"1",
"name":"aaron",
"type":"person"
},
"query_time":0.1537 }
127. Conclusion
• Property graphs are convenient structures for modeling the real-world.
• Graph databases provide index-free adjacency to ensure speedy
traversal over graphs.
• The graph is such a general data structure that it can be used for
numerous applications.
• TinkerPop provides a database agnostic stack of technologies for
working with property graphs.
128. Acknowledgements
• Research collaborators: Daniel Steinbock (Stanford), Jennifer H.
Watkins (LANL), Alberto Pepe (Harvard), Joshua Shinvaier (RPI), Johan
Bollen (LANL), Herbert Van de Sompel (LANL).
• TinkerPop contributors: Pavel Yaskevich (Riptano), Stephen Mallete
(Independent), Darrick Weibe (Independent), Alex Averbuch (Swedish
Institute of CS), Peter Neubauer (Neo4j).
• Others: Emil Eifrem (Neo4j), Luca Garulli (Orient Technologies), Aaron
Patterson (AT&Ti).