Great, you have cleared your first hurdle by building your data model and loading your data into a graph, but you know that there’s more. Now the real fun begins, finding out what secrets reside within you data.
We will use a data model we are all familiar with, family trees, and a common language, Apache Tinkerpop, to demonstrate how you can begin applying some common graph analytical techniques (e.g. Path analysis, centrality analysis, community detection) to pull interesting information from within your data.
- Who's married their 1st cousin?
- Who is the most influential person in my family?
- Am I really only 6 degrees from Kevin Bacon?
By the end of this session you will have enough knowledge to begin running useful analytics on your graphs, or at least have a better appreciation for how you can use analytics to provide valuable insight into your data.
Background in nearly 20 years Full Stack development in.NET, C, Java/Scala, and pretty much everything else Switched to working almost exclusively on big data problems several years ago Spent the last few years leveraging graph databases to build out high performance data platforms
If you have questions on using .NET and graph databases feel free to come talk to me.
Current role is Sr Architect for data and analytics building out our next generation data and analytics platform
As I like to think of this talk as “Things I wish I knew 18 months ago about graphs”
Well known model Going to use a European Royal Family Tree
Based on GEDCOM - 1995 Standard by the LDS church Basically its a linked data structure where all records are atomic units (individual/family/name/note) that contain pointers to each other
Start at a tree Move to the root owner and to their name Traverse out to families Then from families to other individuals and their name
Here is what an example query on our model looks like….
As you can see the basis of this model as it was brought over from GEDCOM can make the queries be more verbose that one would normally strive to in order to retrieve what should be a relatively simple set of data
OLTP -Depth first - serial stream processing to provide depth first traversals into the data. Can be thought of as a stream processor where graph traversers arrive from the left -> an instruction is processed on those traversers -> mutated traversers are sent out the right
OLAP - Unlike OLTP queries OLAP queries are breadth first queries meaning that they run in a logically parallel and use message passing to communicate between the messages.
OLTP - Has its limitations , most notably certain complex operations (such as running pageRank, bulking loading, and global operations) which are not allowed or appropriate for a transactional workload
OLAP - This scatter/gather methodology allows for working on massive scales of data but also prevents some steps (such as path(), simplePath()) from being executed and others such as order() from being meaningful. It also has the disadvantage that some steps within a gremlin query can require all of the data to be in the same location to process. Steps such as count(), min(), max(), group(), etc. are known as barrier steps and requires that all the data return to a single location to be processed before being sent out to workers.
OLTP - Use when your query is going to touch only a portion of the data or a subgraph e.g. Give me the average age of people in my family?
OLAP - Use when your query is going to touch all/a significant amount of the data in the graph e.g. Give me the average age of everyone in my family tree?
Centrality Analysis is about determining what is the most important in your graph. This sort of analysis is quite common when performing social network analysis, looking for key infrastructure points and examining biological networks.
Unfortunately defining what it means to be important is really dependant on the circumstances. One other important thing to remember is that these sort of algorithms measure the importance of a vertex in a graph which may or may not be correlated to the influence. For finding the most influential nodes in a graph there are other node influence metrics you would want to investigate.
Degree Centrality - a measure of the number of edges associated with a vertex
Degree Centrality looks at the number of connections a vertex has and uses that to determine the relative importance. This can be further refined using only inward outward edges. In degree centrality the larger the number the more influential the vertex
Eigenvector Centrality - a measure of the vertex on the graph by using the relative importance of the adjacent vertices to influence the importance of a vertex. I.e. If a node has many edges but is connected to few influential vertices it will be ranked lower than a vertex with fewer edges but the adjacent vertices are more important
PageRank - Made famous by Sergey Brin and Larry Page at Google for ranking web links. It works similar to the Eigenvector Centrality but adds a scaling factor to the results. This algorithm is well documented but far from something you would want to create yourself. Luckily Gremlin has a prebuilt step to help us with this.
The interesting part about this answer is not necessarily the answer itself but the fact that each method produced distinctly different answers
Example why you need to understand your question to choose the correct method
Why do these examples matter?
Paths are the walk through the graph defined by a traversal
Path object contains All Labels “as(xxxx)” All Vertices All Edges All sideeffects/datastructures
Path traversals tend to be on the slow side and they are computationally expensive as the entire path is stored for each traverser. This can expand exponentially as the size of the
Simple path queries are pretty much what it sounds like. Shortest path between two vertices in a graph. Minimize the amount of computation that is required simplePath filters out paths that contain repeats in them.
Simple path queries are often useful if you want to find the shortest connection between two things such as in a transportation network, between patterns or subgraphs or in social network analysis
cyclic paths are paths that repeat back on themselves. Using something like a cyclic path can be a first step in trying to detect communities or clusters within your graph
How about we find the quickest lineage between Queen Victoria and Henry VIII instead?
There are a few key things to note here:
Where the “until” sits matters when do a repeat. If it is before the “repeat” it is a while/do, if it is after it is a do/while loop
Adding a timelimit to you traversal can help prevent a never ending query
If you go in and examine the path objects returned by these two queries you will notice that the difference between the simple and cyclic paths is that the cyclic path circles through her husband Albert to continue on to Henry VIII
Finding how you are related to others in your family tree is a rather straightforward matter of counting the ups and down in generations found by the simplest path.
Finding clusters of people in your tree can be used to help identify areas in your tree where familial marriages were common
Gremlin has the ability to work as both an imperative language as well as a declarative language
In the imperative model you usually write queries as we described earlier. You start with some stream over vertices -> you then move left to right taking in data -> processing that data -> the emitting the processed data
On the other hand the declarative model works using a different approach. In the declarative model the user defines a base set of nodes -> then describes a one or more patterns that the data needs to match. Once submitted to the gremlin engine the engine determines the optimal query to run to find that pattern within the graph
One of the neat features of gremlin is that you are able to intermix the two types of syntax within the same traversal.
Personally I find writing the declarative syntax powerful but I struggle everyt ime I work with it.
1.So what we are doing here is first define a predicate containing all the males 2. Next we are defining a predicate containing all of those mens wives 3. We then are defining a predicate containing all of those mens cousins 4. Finally we are matching everyone who is both a cousin and a wife
Yes while I understand this is an interesting traversal from an inquisitive perspective it is also relevant from a genetic aspect as well.
Endogamous populations, ones that marry within specific groups, have greater genetic chance of inheriting familial genetic defects than people who marry within the larger population. While cousin marriages were common across many parts of the European Royal family tree, one famous example was Queen Victoria. She married her first cousin Albert. Due to this close relationship between parents several of Queen Victoria's children ended up with hemophilia, which is a genetic defect of on the X chromosome inherited from parents.
Why do these examples matter?
Well in our business our customers are very interested in expanding their family trees. If we are able to use pattern matching algorithms to suggest potential matches in other people’s family trees then we are able to quickly and effectively provide them with the ability to expand their family trees.
When it comes to it this is a bit of a strange query.
Intermixing these sorts of graph analytical tools to gain more valuable insight into your data
This query is also an example of how you can levereage both declarative and imperative syntaxes in the same query.
Graph Analytics For Fun and Profit
For Fun and Profit
I am David Bechberger
Sr. Architect for Data and Analytics at Gene by
Gene, a bioinformatics company specializing
in genetic genealogy.
You can find me at:
What this talk isn’t
◎A through review of graph analytic
◎A review of all graph analytic frameworks
◎A deep dive into any of the techniques we
What this talk is
◎Where to start with Graph Analytics
◎OLTP and OLAP in Gremlin
◎Practical Examples using …..
◎We all have them
◎I know them well
◎They are natural
Or more specifically this
Example - Find the names of all family members in a tree
Gremlin Example - Finding the names of all family members
for tree owner
g.V().has(‘tree’, ‘unique_id, ‘T1')
◎Tinkerpop supports both
◎Gremlin can be used to
query in either
◎But their are differences….
Apache Tinkerpop Gremlin OLTP and OLAP
◎ Depth First
◎ Lazy Evaluation - Low
◎ Real-time (ms/sub-
Gremlin OLTP versus OLAP
◎ Breadth First
◎ Eager evaluation -
High memory usage
◎ Long Running
◎ Cannot run certain
queries or steps (e.g.
◎ Limited time a query
◎ Local operations
◎ Some steps are
prohibitive like path(),
◎ Barrier Steps (count(),
min(), max(), etc.)
◎ Global Operations
What insights are we going to gain
◎Who in this tree is the most important?
◎Who in this tree is 6 degrees from Kevin
◎Who in this tree married their first cousin?
Example - Who is the member of the most families?
Example - Who is in the most important individual?
Similar to the Eigenvector
Centrality but with scaling
Example - Whose lineage exerts the most influence over this
Degree EigenVector PageRank
Henry VIII 7
Ferdinand VII 5
Philip II 5
Henry VIII 107539
Joan of the
Edward III 0.774
William III 0.681
Freeman Centrality …...
◎Who is the most important person in my
◎Who in my family history has been the most
Who in this tree is 6 degrees from
Example - What long is the lineage between Queen Victoria
and Henry VIII?
Pattern Detection in Gremlin
◎Gremlin has the ability to be imperative
__.as(‘a’).....as(‘b’), //predicate 1
__.as(‘b’).....as(‘c’), //predicate 2
Example - Who is married to their first cousin?
1 Albert Augustus Charles Victoria /Hanover/
2 Leopold_I Margaret Teresa
3 Alexander_I the_Fierce Sybil
4 Philip_IV Mariana of_Austria
◎Merging trees together based on potential
common ancestors using pattern matching
Example - Which women who married their first cousin had
the greatest number of families?
1 Victoria /Hanover/ 2
2 Margaret Teresa 3
3 Sybil 4
4 Mariana of_Austria 2
You can find me at: