Successfully reported this slideshow.
Your SlideShare is downloading. ×

Graph Analytics For Fun and Profit

Ad

Graph Analytics
For Fun and Profit

Ad

Hello!
I am David Bechberger
Sr. Architect for Data and Analytics at Gene by
Gene, a bioinformatics company specializing
i...

Ad

What we do at
Swab Sequence Analysis Insight

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Neo4j
Neo4j
Loading in …3
×

Check these out next

1 of 41 Ad
1 of 41 Ad

Graph Analytics For Fun and Profit

Download to read offline

Great, you have cleared your first hurdle by building your data model and loading your data into a graph, but you know that there’s more. Now the real fun begins, finding out what secrets reside within you data.
We will use a data model we are all familiar with, family trees, and a common language, Apache Tinkerpop, to demonstrate how you can begin applying some common graph analytical techniques (e.g. Path analysis, centrality analysis, community detection) to pull interesting information from within your data.
- Who's married their 1st cousin?
- Who is the most influential person in my family?
- Am I really only 6 degrees from Kevin Bacon?
By the end of this session you will have enough knowledge to begin running useful analytics on your graphs, or at least have a better appreciation for how you can use analytics to provide valuable insight into your data.

Great, you have cleared your first hurdle by building your data model and loading your data into a graph, but you know that there’s more. Now the real fun begins, finding out what secrets reside within you data.
We will use a data model we are all familiar with, family trees, and a common language, Apache Tinkerpop, to demonstrate how you can begin applying some common graph analytical techniques (e.g. Path analysis, centrality analysis, community detection) to pull interesting information from within your data.
- Who's married their 1st cousin?
- Who is the most influential person in my family?
- Am I really only 6 degrees from Kevin Bacon?
By the end of this session you will have enough knowledge to begin running useful analytics on your graphs, or at least have a better appreciation for how you can use analytics to provide valuable insight into your data.

More Related Content

Graph Analytics For Fun and Profit

  1. 1. Graph Analytics For Fun and Profit
  2. 2. Hello! I am David Bechberger Sr. Architect for Data and Analytics at Gene by Gene, a bioinformatics company specializing in genetic genealogy. You can find me at: @bechbd www.linkedin.com/in/davebechberger
  3. 3. What we do at Swab Sequence Analysis Insight
  4. 4. What this talk isn’t ◎A through review of graph analytic techniques ◎A review of all graph analytic frameworks ◎A deep dive into any of the techniques we discuss
  5. 5. What this talk is ◎Where to start with Graph Analytics ◎OLTP and OLAP in Gremlin ◎Practical Examples using …..
  6. 6. Family Trees ◎We all have them ◎I know them well ◎They are natural graphs
  7. 7. Or more specifically this name owns individual family tree member_of is_known_as is_spouse is_first_cousin
  8. 8. Example - Find the names of all family members in a tree T1 F1 I1 Bob F2 I2 I3 I4 Steve Joan Rick owns member_of: Husband member_of: Sonis_known _as is_known _as is_known _as is_known _as member_of: Husband member_of: Wife member_of: Wife
  9. 9. Gremlin Example - Finding the names of all family members for tree owner g.V().has(‘tree’, ‘unique_id, ‘T1') .out(‘owns’) .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .out('member_of').in('member_of') .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .cap('name')
  10. 10. ◎Tinkerpop supports both ◎Gremlin can be used to query in either ◎But their are differences…. Apache Tinkerpop Gremlin OLTP and OLAP
  11. 11. OLTP ◎ Depth First ◎ Lazy Evaluation - Low memory usage ◎ Real-time (ms/sub- sec) Gremlin OLTP versus OLAP OLAP ◎ Breadth First ◎ Eager evaluation - High memory usage ◎ Long Running (min/hour)
  12. 12. OLTP ◎ Cannot run certain queries or steps (e.g. pageRank, bulk loading) ◎ Limited time a query ◎ Local operations Limitations OLAP ◎ Some steps are prohibitive like path(), simplePath(), etc. ◎ Barrier Steps (count(), min(), max(), etc.) ◎ Global Operations
  13. 13. What insights are we going to gain ◎Who in this tree is the most important? ◎Who in this tree is 6 degrees from Kevin Bacon? ◎Who in this tree married their first cousin?
  14. 14. 1. Centrality Analysis Finding Importance
  15. 15. Degree Centrality Count the edges
  16. 16. Example - Who is the member of the most families? g.V().hasLabel('individual') .project('person', 'degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  17. 17. Eigenvector Centrality Relative importance matters .6 .3 .5 .4 .2 .2 .2
  18. 18. Example - Who is in the most important individual? g.V().hasLabel('individual') .repeat( groupCount('m').by('full_name') .out('member_of').in('member_of') .timeLimit(100) ).times(5).cap('m') .order(local).by(values, decr) .limit(local, 5).next()
  19. 19. PageRank Similar to the Eigenvector Centrality but with scaling 25 3 2 5 1 3 2 22
  20. 20. Example - Whose lineage exerts the most influence over this family tree? g.V().withComputer().hasLabel('individual') .pageRank() .by(bothE('member_of')).by('rank') .order().by('rank', decr) .valueMap('full_name', 'rank').limit(5)
  21. 21. Answer Degree EigenVector PageRank Name Value Henry VIII 7 Charlemagne 6 Jan 5 Ferdinand VII 5 Philip II 5 Name Value Mary 149950 Margret 124221 Henry VIII 107539 Son 90715 Daughter 86961 Name Value Joan of the Tower 0.784 Edward III 0.774 Elenor 0.774 John of Eltham 0.719 Frederick William III 0.681
  22. 22. And many more... Closeness Centrality Betweeness Centrality Katz Centrality Freeman Centrality …...
  23. 23. Practical Examples ◎Who is the most important person in my family's history? ◎Who in my family history has been the most prolific?
  24. 24. 2. Path Analysis Who in this tree is 6 degrees from Kevin Bacon?
  25. 25. Path How did you get there?
  26. 26. Simple Path Don’t Repeat yourself
  27. 27. Cyclic Path Ok then Repeat yourself
  28. 28. Sorry Not in this family tree
  29. 29. How about this instead?
  30. 30. Example - What long is the lineage between Queen Victoria and Henry VIII? SimplePath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .simplePath()).until(hasId('@I828@')) .path().limit(1).count(local) CyclicPath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .cyclicPath()).until(hasId('@I828@')) .path().limit(1).count(local)
  31. 31. SimplePath 25 steps Answer CyclicPath 27 steps
  32. 32. Practical Examples ◎How am I related to X in my family? ◎Does this family tree contain clusters of people?
  33. 33. 3. Pattern Detection Finding what is hidden
  34. 34. Pattern Detection in Gremlin ◎Gremlin has the ability to be imperative ○ g.V().in().out()...... ◎Or Declarative ○ g.V().match( __.as(‘a’).....as(‘b’), //predicate 1 __.as(‘b’).....as(‘c’), //predicate 2 __.as(‘c’).where(‘c’, eq(‘b’)).as(‘c’) ).select(‘b’, ‘c’)
  35. 35. Example - Who is married to their first cousin? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('husband',’wife') .by('full_name').fold().unfold()
  36. 36. Answer Husband Wife 1 Albert Augustus Charles Victoria /Hanover/ 2 Leopold_I Margaret Teresa 3 Alexander_I the_Fierce Sybil 4 Philip_IV Mariana of_Austria
  37. 37. Practical Examples ◎Merging trees together based on potential common ancestors using pattern matching
  38. 38. 4. Putting it all together
  39. 39. Example - Which women who married their first cousin had the greatest number of families? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('wife') .project('person','degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  40. 40. Answer Wife Degree 1 Victoria /Hanover/ 2 2 Margaret Teresa 3 3 Sybil 4 4 Mariana of_Austria 2
  41. 41. Thanks! Any questions? You can find me at: dave@bechberger.com @bechbd www.linkedin.com/in/davebechberger

Editor's Notes

  • Background in nearly 20 years Full Stack development in.NET, C, Java/Scala, and pretty much everything else
    Switched to working almost exclusively on big data problems several years ago
    Spent the last few years leveraging graph databases to build out high performance data platforms

    If you have questions on using .NET and graph databases feel free to come talk to me.

    Current role is Sr Architect for data and analytics building out our next generation data and analytics platform
  • As I like to think of this talk as “Things I wish I knew 18 months ago about graphs”
  • Well known model
    Going to use a European Royal Family Tree
  • Based on GEDCOM - 1995 Standard by the LDS church
    Basically its a linked data structure where all records are atomic units (individual/family/name/note) that contain pointers to each other
  • Start at a tree
    Move to the root owner and to their name
    Traverse out to families
    Then from families to other individuals and their name
  • Here is what an example query on our model looks like….

    As you can see the basis of this model as it was brought over from GEDCOM can make the queries be more verbose that one would normally strive to in order to retrieve what should be a relatively simple set of data
  • OLTP -Depth first - serial stream processing to provide depth first traversals into the data.
    Can be thought of as a stream processor where graph traversers arrive from the left -> an instruction is processed on those traversers -> mutated traversers are sent out the right

    OLAP - Unlike OLTP queries OLAP queries are breadth first queries meaning that they run in a logically parallel and use message passing to communicate between the messages.
  • OLTP - Has its limitations , most notably certain complex operations (such as running pageRank, bulking loading, and global operations) which are not allowed or appropriate for a transactional workload

    OLAP - This scatter/gather methodology allows for working on massive scales of data but also prevents some steps (such as path(), simplePath()) from being executed and others such as order() from being meaningful. It also has the disadvantage that some steps within a gremlin query can require all of the data to be in the same location to process. Steps such as count(), min(), max(), group(), etc. are known as barrier steps and requires that all the data return to a single location to be processed before being sent out to workers.


    OLTP - Use when your query is going to touch only a portion of the data or a subgraph e.g. Give me the average age of people in my family?

    OLAP - Use when your query is going to touch all/a significant amount of the data in the graph e.g. Give me the average age of everyone in my family tree?
  • Centrality Analysis is about determining what is the most important in your graph. This sort of analysis is quite common when performing social network analysis, looking for key infrastructure points and examining biological networks.

    Unfortunately defining what it means to be important is really dependant on the circumstances. One other important thing to remember is that these sort of algorithms measure the importance of a vertex in a graph which may or may not be correlated to the influence. For finding the most influential nodes in a graph there are other node influence metrics you would want to investigate.
  • Degree Centrality - a measure of the number of edges associated with a vertex

    Degree Centrality looks at the number of connections a vertex has and uses that to determine the relative importance. This can be further refined using only inward outward edges. In degree centrality the larger the number the more influential the vertex
  • Eigenvector Centrality - a measure of the vertex on the graph by using the relative importance of the adjacent vertices to influence the importance of a vertex. I.e. If a node has many edges but is connected to few influential vertices it will be ranked lower than a vertex with fewer edges but the adjacent vertices are more important
  • PageRank - Made famous by Sergey Brin and Larry Page at Google for ranking web links. It works similar to the Eigenvector Centrality but adds a scaling factor to the results. This algorithm is well documented but far from something you would want to create yourself. Luckily Gremlin has a prebuilt step to help us with this.
  • The interesting part about this answer is not necessarily the answer itself but the fact that each method produced distinctly different answers

    Example why you need to understand your question to choose the correct method
  • Why do these examples matter?

  • Paths are the walk through the graph defined by a traversal

    Path object contains
    All Labels “as(xxxx)”
    All Vertices
    All Edges
    All sideeffects/datastructures

    Path traversals tend to be on the slow side and they are computationally expensive as the entire path is stored for each traverser. This can expand exponentially as the size of the
  • Simple path queries are pretty much what it sounds like.
    Shortest path between two vertices in a graph.
    Minimize the amount of computation that is required
    simplePath filters out paths that contain repeats in them.

    Simple path queries are often useful if you want to find the shortest connection between two things such as in a transportation network, between patterns or subgraphs or in social network analysis
  • cyclic paths are paths that repeat back on themselves.
    Using something like a cyclic path can be a first step in trying to detect communities or clusters within your graph
  • How about we find the quickest lineage between Queen Victoria and Henry VIII instead?
  • There are a few key things to note here:

    Where the “until” sits matters when do a repeat. If it is before the “repeat” it is a while/do, if it is after it is a do/while loop

    Adding a timelimit to you traversal can help prevent a never ending query
  • If you go in and examine the path objects returned by these two queries you will notice that the difference between the simple and cyclic paths is that the cyclic path circles through her husband Albert to continue on to Henry VIII
  • Finding how you are related to others in your family tree is a rather straightforward matter of counting the ups and down in generations found by the simplest path.

    Finding clusters of people in your tree can be used to help identify areas in your tree where familial marriages were common
  • Gremlin has the ability to work as both an imperative language as well as a declarative language

    In the imperative model you usually write queries as we described earlier. You start with some stream over vertices -> you then move left to right taking in data -> processing that data -> the emitting the processed data

    On the other hand the declarative model works using a different approach. In the declarative model the user defines a base set of nodes -> then describes a one or more patterns that the data needs to match. Once submitted to the gremlin engine the engine determines the optimal query to run to find that pattern within the graph

    One of the neat features of gremlin is that you are able to intermix the two types of syntax within the same traversal.

    Personally I find writing the declarative syntax powerful but I struggle everyt ime I work with it.
  • 1.So what we are doing here is first define a predicate containing all the males
    2. Next we are defining a predicate containing all of those mens wives
    3. We then are defining a predicate containing all of those mens cousins
    4. Finally we are matching everyone who is both a cousin and a wife
  • Yes while I understand this is an interesting traversal from an inquisitive perspective it is also relevant from a genetic aspect as well.

    Endogamous populations, ones that marry within specific groups, have greater genetic chance of inheriting familial genetic defects than people who marry within the larger population. While cousin marriages were common across many parts of the European Royal family tree, one famous example was Queen Victoria. She married her first cousin Albert. Due to this close relationship between parents several of Queen Victoria's children ended up with hemophilia, which is a genetic defect of on the X chromosome inherited from parents.
  • Why do these examples matter?

    Well in our business our customers are very interested in expanding their family trees. If we are able to use pattern matching algorithms to suggest potential matches in other people’s family trees then we are able to quickly and effectively provide them with the ability to expand their family trees.
  • When it comes to it this is a bit of a strange query.

    Intermixing these sorts of graph analytical tools to gain more valuable insight into your data

    This query is also an example of how you can levereage both declarative and imperative syntaxes in the same query.

×