Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Graph Analytics For Fun and Profit

582 views

Published on

Great, you have cleared your first hurdle by building your data model and loading your data into a graph, but you know that there’s more. Now the real fun begins, finding out what secrets reside within you data.
We will use a data model we are all familiar with, family trees, and a common language, Apache Tinkerpop, to demonstrate how you can begin applying some common graph analytical techniques (e.g. Path analysis, centrality analysis, community detection) to pull interesting information from within your data.
- Who's married their 1st cousin?
- Who is the most influential person in my family?
- Am I really only 6 degrees from Kevin Bacon?
By the end of this session you will have enough knowledge to begin running useful analytics on your graphs, or at least have a better appreciation for how you can use analytics to provide valuable insight into your data.

Published in: Software
  • Be the first to comment

Graph Analytics For Fun and Profit

  1. 1. Graph Analytics For Fun and Profit
  2. 2. Hello! I am David Bechberger Sr. Architect for Data and Analytics at Gene by Gene, a bioinformatics company specializing in genetic genealogy. You can find me at: @bechbd www.linkedin.com/in/davebechberger
  3. 3. What we do at Swab Sequence Analysis Insight
  4. 4. What this talk isn’t ◎A through review of graph analytic techniques ◎A review of all graph analytic frameworks ◎A deep dive into any of the techniques we discuss
  5. 5. What this talk is ◎Where to start with Graph Analytics ◎OLTP and OLAP in Gremlin ◎Practical Examples using …..
  6. 6. Family Trees ◎We all have them ◎I know them well ◎They are natural graphs
  7. 7. Or more specifically this name owns individual family tree member_of is_known_as is_spouse is_first_cousin
  8. 8. Example - Find the names of all family members in a tree T1 F1 I1 Bob F2 I2 I3 I4 Steve Joan Rick owns member_of: Husband member_of: Sonis_known _as is_known _as is_known _as is_known _as member_of: Husband member_of: Wife member_of: Wife
  9. 9. Gremlin Example - Finding the names of all family members for tree owner g.V().has(‘tree’, ‘unique_id, ‘T1') .out(‘owns’) .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .out('member_of').in('member_of') .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .cap('name')
  10. 10. ◎Tinkerpop supports both ◎Gremlin can be used to query in either ◎But their are differences…. Apache Tinkerpop Gremlin OLTP and OLAP
  11. 11. OLTP ◎ Depth First ◎ Lazy Evaluation - Low memory usage ◎ Real-time (ms/sub- sec) Gremlin OLTP versus OLAP OLAP ◎ Breadth First ◎ Eager evaluation - High memory usage ◎ Long Running (min/hour)
  12. 12. OLTP ◎ Cannot run certain queries or steps (e.g. pageRank, bulk loading) ◎ Limited time a query ◎ Local operations Limitations OLAP ◎ Some steps are prohibitive like path(), simplePath(), etc. ◎ Barrier Steps (count(), min(), max(), etc.) ◎ Global Operations
  13. 13. What insights are we going to gain ◎Who in this tree is the most important? ◎Who in this tree is 6 degrees from Kevin Bacon? ◎Who in this tree married their first cousin?
  14. 14. 1. Centrality Analysis Finding Importance
  15. 15. Degree Centrality Count the edges
  16. 16. Example - Who is the member of the most families? g.V().hasLabel('individual') .project('person', 'degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  17. 17. Eigenvector Centrality Relative importance matters .6 .3 .5 .4 .2 .2 .2
  18. 18. Example - Who is in the most important individual? g.V().hasLabel('individual') .repeat( groupCount('m').by('full_name') .out('member_of').in('member_of') .timeLimit(100) ).times(5).cap('m') .order(local).by(values, decr) .limit(local, 5).next()
  19. 19. PageRank Similar to the Eigenvector Centrality but with scaling 25 3 2 5 1 3 2 22
  20. 20. Example - Whose lineage exerts the most influence over this family tree? g.V().withComputer().hasLabel('individual') .pageRank() .by(bothE('member_of')).by('rank') .order().by('rank', decr) .valueMap('full_name', 'rank').limit(5)
  21. 21. Answer Degree EigenVector PageRank Name Value Henry VIII 7 Charlemagne 6 Jan 5 Ferdinand VII 5 Philip II 5 Name Value Mary 149950 Margret 124221 Henry VIII 107539 Son 90715 Daughter 86961 Name Value Joan of the Tower 0.784 Edward III 0.774 Elenor 0.774 John of Eltham 0.719 Frederick William III 0.681
  22. 22. And many more... Closeness Centrality Betweeness Centrality Katz Centrality Freeman Centrality …...
  23. 23. Practical Examples ◎Who is the most important person in my family's history? ◎Who in my family history has been the most prolific?
  24. 24. 2. Path Analysis Who in this tree is 6 degrees from Kevin Bacon?
  25. 25. Path How did you get there?
  26. 26. Simple Path Don’t Repeat yourself
  27. 27. Cyclic Path Ok then Repeat yourself
  28. 28. Sorry Not in this family tree
  29. 29. How about this instead?
  30. 30. Example - What long is the lineage between Queen Victoria and Henry VIII? SimplePath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .simplePath()).until(hasId('@I828@')) .path().limit(1).count(local) CyclicPath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .cyclicPath()).until(hasId('@I828@')) .path().limit(1).count(local)
  31. 31. SimplePath 25 steps Answer CyclicPath 27 steps
  32. 32. Practical Examples ◎How am I related to X in my family? ◎Does this family tree contain clusters of people?
  33. 33. 3. Pattern Detection Finding what is hidden
  34. 34. Pattern Detection in Gremlin ◎Gremlin has the ability to be imperative ○ g.V().in().out()...... ◎Or Declarative ○ g.V().match( __.as(‘a’).....as(‘b’), //predicate 1 __.as(‘b’).....as(‘c’), //predicate 2 __.as(‘c’).where(‘c’, eq(‘b’)).as(‘c’) ).select(‘b’, ‘c’)
  35. 35. Example - Who is married to their first cousin? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('husband',’wife') .by('full_name').fold().unfold()
  36. 36. Answer Husband Wife 1 Albert Augustus Charles Victoria /Hanover/ 2 Leopold_I Margaret Teresa 3 Alexander_I the_Fierce Sybil 4 Philip_IV Mariana of_Austria
  37. 37. Practical Examples ◎Merging trees together based on potential common ancestors using pattern matching
  38. 38. 4. Putting it all together
  39. 39. Example - Which women who married their first cousin had the greatest number of families? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('wife') .project('person','degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  40. 40. Answer Wife Degree 1 Victoria /Hanover/ 2 2 Margaret Teresa 3 3 Sybil 4 4 Mariana of_Austria 2
  41. 41. Thanks! Any questions? You can find me at: dave@bechberger.com @bechbd www.linkedin.com/in/davebechberger

×